Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promising avenue to enhance LDCT image quality. However, the success of such models relies on a large amount of paired noisy and clean images, which are often scarce in clinical settings. In computer vision and natural language processing, masked autoencoders (MAE) have been recognized as a powerful self-pretraining method for transformers, due to their exceptional capability to extract representative features. However, the original pretraining and fine-tuning design fails to work in low-level vision tasks like denoising. In response to this challenge, we redesign the classical encoder-decoder learning model and facilitate a simple yet effective streamlined low-level vision MAE, referred to as LoMAE, tailored to address the LDCT denoising problem. Moreover, we introduce an MAE-GradCAM method to shed light on the latent learning mechanisms of the MAE/LoMAE. Additionally, we explore the LoMAE's robustness and generability across a variety of noise levels. Experimental findings show that the proposed LoMAE enhances the denoising capabilities of the transformer and substantially reduce their dependency on high-quality, ground-truth data. It also demonstrates remarkable robustness and generalizability over a spectrum of noise levels. In summary, the proposed LoMAE provides promising solutions to the major issues in LDCT including interpretability, ground truth data dependency, and model robustness/generalizability.