Prior research in video object segmentation (VOS) predominantly relies on videos with dense annotations. However, obtaining pixel-level annotations is both costly and time-intensive. In this work, we highlight the potential of effectively training a VOS model using remarkably sparse video annotations-specifically, as few as one or two labeled frames per training video, yet maintaining near equivalent performance levels. We introduce this innovative training methodology as low-shot video object segmentation, abbreviated as low-shot VOS. Central to this method is the generation of reliable pseudo labels for unlabeled frames during the training phase, which are then used in tandem with labeled frames to optimize the model. Notably, our strategy is extremely simple and can be incorporated into the vast majority of current VOS models. For the first time, we propose a universal method for training VOS models on one-shot and two-shot VOS datasets. In the two-shot configuration, utilizing just 7.3% and 2.9% of labeled data from the YouTube-VOS and DAVIS benchmarks respectively, our model delivers results on par with those trained on completely labeled datasets. It is also worth noting that in the one-shot setting, a minor performance decrement is observed in comparison to models trained on fully annotated datasets.