Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Nov 3;12(11):e1005182.
doi: 10.1371/journal.pcbi.1005182. eCollection 2016 Nov.

WORMHOLE: Novel Least Diverged Ortholog Prediction Through Machine Learning

Affiliations
Free PMC article

WORMHOLE: Novel Least Diverged Ortholog Prediction Through Machine Learning

George L Sutphin et al. PLoS Comput Biol. .
Free PMC article

Abstract

The rapid advancement of technology in genomics and targeted genetic manipulation has made comparative biology an increasingly prominent strategy to model human disease processes. Predicting orthology relationships between species is a vital component of comparative biology. Dozens of strategies for predicting orthologs have been developed using combinations of gene and protein sequence, phylogenetic history, and functional interaction with progressively increasing accuracy. A relatively new class of orthology prediction strategies combines aspects of multiple methods into meta-tools, resulting in improved prediction performance. Here we present WORMHOLE, a novel ortholog prediction meta-tool that applies machine learning to integrate 17 distinct ortholog prediction algorithms to identify novel least diverged orthologs (LDOs) between 6 eukaryotic species-humans, mice, zebrafish, fruit flies, nematodes, and budding yeast. Machine learning allows WORMHOLE to intelligently incorporate predictions from a wide-spectrum of strategies in order to form aggregate predictions of LDOs with high confidence. In this study we demonstrate the performance of WORMHOLE across each combination of query and target species. We show that WORMHOLE is particularly adept at improving LDO prediction performance between distantly related species, expanding the pool of LDOs while maintaining low evolutionary distance and a high level of functional relatedness between genes in LDO pairs. We present extensive validation, including cross-validated prediction of PANTHER LDOs and evaluation of evolutionary divergence and functional similarity, and discuss future applications of machine learning in ortholog prediction. A WORMHOLE web tool has been developed and is available at http://wormhole.jax.org/.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Schematic representation of the WORMHOLE LDO prediction strategy.
(A) First-order features of gene pairs (e.g. sequence comparison, phylogenetic history, and functional interaction) are used by Layer 1 algorithms (B) to generate candidate LDO (cLDO) predictions, which are considered second order features (C). The second-order features are used by the WORMHOLE Layer 2 methods (voting or SVMs) (D) to select high-confidence LDOs and filter out erroneous predictions.
Fig 2
Fig 2. WORMHOLE SVMs improve prediction of PANTHER LDOs over constituent algorithms and voting to a degree dependent on the evolutionary separation of the compared species.
(A) Precision-recall performance charts for PANTHER LDO predictions made between vertebrate and invertebrate species separated into categories based on evolutionary distance. Points or lines represent the mean performance of the 17 constituent algorithms (black), BLASTp reciprocal best hits (RBHs) (red), voting (green), WORMHOLE SVMs (blue), or WORMHOLE RBHs (cyan) at predicting PANTHER LDOs across the 10-folds of the outer cross-validation (see Materials and Methods). WORMHOLE RBHs are reciprocal best hits selected based on the WORMHOLE Score and are introduced later in the Results section. Error bars and colored regions represent standard error of mean for precision and recall across folds (due to the large number of gene pairs, error bars and regions are small and fall within the width of the point or line in most cases). Lines are generated by sampling the complete range of possible threshold values for each confidence score type. Color-matched points indicate the performance for specified threshold values (blue numbers) on each line. (B) Box and whisker plot representing the harmonic mean of precision and recall for each of the 17 constituent WORMHOLE algorithms, voting, BLASTp RBHs, WORMHOLE SVMs, and WORMHOLE RBHs when predicting PANTHER LDOs for each pair of query and target species. Ortholog prediction methods are ordered by median harmonic mean. For voting, SVMs, and WORMHOLE RBHs, values represent the maximum harmonic mean for each pair of query and target species (WORMHOLE Score ≥ 0.5).
Fig 3
Fig 3. Performance of WORMHOLE SVMs generalizes across species.
Box and whisker plot representing the harmonic mean of precision and recall for each WORMHOLE SVM trained on PANTHER LDOs between one pair of query and target species when applied to predict PANTHER LDOs between each other pair of query and target species. Values represent the maximum harmonic mean for each pair of query and target species (WORMHOLE Score ≥ 0.5). Training data sets are ordered by median harmonic mean. The box labelled "same pair" shows the performance of each model when applied to predict LDOs within the same species pair used to train that model (with cross-validation).
Fig 4
Fig 4. Weights given to constituent algorithm predictions by WORMHOLE SVMs are correlated across species comparisons.
(A) Distribution of Pearson correlations between weight vectors across models trained on different pairs of query and target species show reasonable concordance across species pairs (mean = 0.54), with considerable variation (standard deviation = 0.21) indicating species-pair-specific structure in the models. (B) Box and whisker plot of the weights given to predictions made by each constituent algorithm by WORMHOLE SVMs trained on each pair of query and target species show that each constituent algorithm has relatively consistent weight within each species pair comparison. Note that PANTHER has the highest average weight, as expected. Ortholog prediction methods are ordered by median SVM weight.
Fig 5
Fig 5. WORMHOLE SVMs reproduce the majority of PANTHER LDOs while expanding the total number of LDOs.
(A) Venn diagrams displaying the relative number of gene pairs in PANTHER, PANTHER LDOs, WORMHOLE (WOMRHOLE score ≥ 0.5). Outer circles represent the complete set of gene pairs predicted by all of the constituent algorithms. Circle areas are proportional to the number of gene pairs in each data set. (B) The number of query genes with multiple LDO predictions by WORMHOLE SVMs as a function of WORMHOLE Score threshold. (C) Box and whisker plot representing the range of WORMHOLE Scores assigned to PANTHER LDOs or non-PANTHER LDOs (gene pairs in the WORMHOLE database but not in the PANTHER LDO reference set) within the set of genes with multiple LDO predictions by the WORMHOLE SVMs (WORMHOLE score ≥ 0.5).
Fig 6
Fig 6. WORMHOLE identifies LDO pairs with a similar distribution of BLASTp alignment quality and evolutionary distance to the PANTHER LDOs and excludes low-scoring PANTHER LDOs.
Box and whisker plots representing BLASTp Bit Score (A) or evolutionary distance (B) for alignments between longest protein isoforms for genes in each gene pair in the indicated ortholog dataset. Novel WORMHOLE LDOs are gene pairs predicted by WORMHOLE that are not present in the PANTHER LDO training set. Excluded PANTHER LDOs are gene pairs in the PANTHER LDO training set that are excluded by WORMHOLE.
Fig 7
Fig 7. WORMHOLE SVMs improve prediction of FOSTA FEPs over constituent algorithms and voting to a degree dependent on the evolutionary separation of the compared species.
(A) Precision-recall performance charts for FOSTA FEP predictions made between vertebrate and invertebrate species separated into categories based on evolutionary distance. Points or lines represent the mean performance of the 17 constituent algorithms (black), BLASTp reciprocal best hits (RBHs) (red), voting (green), WORMHOLE SVMs (blue), or WORMHOLE RBHs (cyan) at predicting FOSTA FEPs. Lines are generated by sampling the complete range of possible threshold values for each confidence score type. Colored points indicate the performance for specified threshold values (blue numbers) on each line. (B) Box and whisker plot representing the harmonic mean of precision and recall for each of the 17 constituent WORMHOLE algorithms, voting, BLASTp RBHs, WORMHOLE SVMs, and WORMHOLE RBHs when predicting FOSTA FEPs each pair of query and target species. Ortholog prediction methods are ordered by median harmonic mean. For voting and SVMs, values represent the maximum harmonic mean for each pair of query and target species (WORMHOLE Score ≥ 0.5).
Fig 8
Fig 8. WORMHOLE SVMs produce an expanded set of LDOs while maintaining functional similarity relative to PANTHER LDOs.
Conservation of GO term annotation between genes in each gene pair is plotted against the number of gene pairs contained with each dataset for PANTHER (green point), all other constituent algorithms (black points), PANTHER LDOs (red points), WORMHOLE SVMs (blue lines), and WORMHOLE RBHs (cyan points). Points or lines indicate mean, and error b ars or colored regions represent 95% confidence intervals, for Schlicker similarity in GO terms between genes (see Materials and Methods).

Similar articles

See all similar articles

Cited by 4 articles

References

    1. Loewith R, Hall MN (2011) Target of rapamycin (TOR) in nutrient signaling and growth control. Genetics 189: 1177–1201. 10.1534/genetics.111.133363 - DOI - PMC - PubMed
    1. Cornu M, Albert V, Hall MN (2013) mTOR in aging, metabolism, and cancer. Curr Opin Genet Dev 23: 53–62. 10.1016/j.gde.2012.12.005 - DOI - PubMed
    1. Benjamin D, Colombi M, Moroni C, Hall MN (2011) Rapamycin passes the torch: a new generation of mTOR inhibitors. Nat Rev Drug Discov 10: 868–880. 10.1038/nrd3531 - DOI - PubMed
    1. Santulli G, Totary-Jain H (2013) Tailoring mTOR-based therapy: molecular evidence and clinical challenges. Pharmacogenomics 14: 1517–1526. 10.2217/pgs.13.143 - DOI - PMC - PubMed
    1. Richardson A, Galvan V, Lin AL, Oddo S (2014) How longevity research can lead to therapies for Alzheimer's disease: The rapamycin story. Exp Gerontol. - PMC - PubMed

LinkOut - more resources

Feedback