Background: Olfactory receptors (ORs), the largest mammalian gene superfamily (900-1400 genes), has >50% pseudogenes in humans. While most of these inactive genes are identified via coding frame (nonsense) disruptions, seemingly intact genes may also be inactive due to other deleterious (missense) mutations. An ultimate assessment of the actual size of the functional human OR repertoire thus requires an accurate distinction between genes and pseudogenes.
Results: To characterize inactive ORs with intact open reading frame, we have developed a probabilistic Classifier for Olfactory Receptor Pseudogenes (CORP). This algorithm is based on deviations from a functionally crucial consensus, constituting sixty highly conserved positions identified by a comparison of two evolutionarily-constrained OR repertoires (mouse and dog) with a small pseudogene fraction. We used a logistic regression analysis to assign appropriate coefficients to the conserved position and thus achieving maximal separation between active and inactive ORs. Consequently, the algorithms identified only 5% of the mouse functional ORs as pseudogenes, setting an upper limit of 0.05 to the false positive detection. Finally we used this algorithm to classify the 384 purportedly intact human OR genes. Of these, 135 were predicted as likely encoding non-functional proteins, and 38 were segregating between active and inactive forms due to missense polymorphisms.
Conclusion: We demonstrated that the CORP algorithm is capable to distinguish between functional and non-functional OR genes with high precision even when the encoded protein would differ by a single amino acid. Using the CORP algorithm, we predict that approximately 70% of human OR genes are likely non-functional pseudogenes, a much higher number than hitherto suspected. The method we present may be employed for better annotation of inactive members in other gene families as well. CORP algorithm is available at: http://bioportal.weizmann.ac.il/HORDE/CORP/