Identification of the shortest species-specific oligonucleotide sequences

Genome Res. 2025 Jan 2:gr.280070.124. doi: 10.1101/gr.280070.124. Online ahead of print.

Abstract

Despite the exponential increase in sequencing information driven by massively parallel DNA sequencing technologies, universal and succinct genomic fingerprints for each organism are still missing. Identifying the shortest species-specific nucleic sequences offers insights into species evolution and holds potential practical applications in agriculture, wildlife conservation, and healthcare. We propose a new method for sequence analysis termed nucleic "quasi-primes", the shortest occurring sequences in each of 45,785 organismal reference genomes, present in one genome and absent from every other examined genome. In the human genome, we find that the genomic loci of nucleic quasi-primes are most enriched for genes associated with brain development and cognitive function. In a single-cell case study focusing on the human primary motor cortex, nucleic quasi-prime genes account for a significantly larger proportion of the variation based on average gene expression. Non-neuronal cell types, including astrocytes, endothelial cells, microglia perivascular-macrophages, oligodendrocytes, and vascular and leptomeningeal cells, exhibited significant activation of quasi-prime containing gene associations related to cancer, while simultaneously suppressing quasi-prime containing genes were associated with cognitive, mental, and developmental disorders. We also show that human disease-causing variants, eQTLs, mQTLs and sQTLs are 4.43-fold, 4.34-fold, 4.29-fold and 4.21-fold enriched at human quasi-prime loci, respectively. These findings indicate that nucleic quasi-primes are genomic loci linked to the evolution of species-specific traits and in humans they provide insights in the development of cognitive traits and human diseases, including neurodevelopmental disorders.