Predicting HIV coreceptor usage on the basis of genetic and clinical covariates

Tobias Sing; Andrew J Low; Niko Beerenwinkel; Oliver Sander; Peter K Cheung; Francisco S Domingues; Joachim Büch; Martin Däumer; Rolf Kaiser; Thomas Lengauer; P Richard Harrigan

Predicting HIV coreceptor usage on the basis of genetic and clinical covariates

Antivir Ther. 2007;12(7):1097-106.

Authors

Tobias Sing¹, Andrew J Low, Niko Beerenwinkel, Oliver Sander, Peter K Cheung, Francisco S Domingues, Joachim Büch, Martin Däumer, Rolf Kaiser, Thomas Lengauer, P Richard Harrigan

Affiliation

¹ Max Planck Institute for Informatics, Saarbrücken, Germany.

PMID: 18018768

Abstract

Background: We compared several statistical learning methods for the prediction of HIV coreceptor use from clonal HIV third hypervariable (V3) loop sequences, and evaluated and improved their effectiveness on clinical samples.

Methods: Support vector machines (SVM), artificial neural networks, position-specific scoring matrices (PSSM) and mixtures of localized rules were estimated and tested using 10x ten-fold cross-validation on a clonal dataset consisting of 1,100 matched clonal genotype-phenotype pairs from 332 patients. Different SVMs were also trained and tested on a clinically derived dataset, representing 920 patient samples from British Columbia, Canada. Methods were evaluated using receiver operating characteristic (ROC) curves.

Results: In the clonal analysis, the sensitivity of the 11/25 rule at 92.5% specificity was 59.5%. PSSMs and SVMs increased sensitivity to 71.9% and 76.4%, respectively, at the same specificity (P < < 0.05). In clinical samples, the sensitivity of the 11/25 rule and SVM decreased to 25.9% (specificity 93.9%) and 39.8% (specificity 93.5%), respectively. However, the integration of clinical data resulted in a further 2.4-fold increase in sensitivity over the 11/25 rule (63%). Univariate analyses identified 41 V3 mutations significantly associated with coreceptor usage.

Conclusion: For all methods tested, a substantial sensitivity decrease is observed on clinical data, probably owing to the heterogeneity of the viral population in vivo. In response to these complications, we present an SVM-based approach that integrates sequence information with clinical and host data, resulting in improved performance and sensitivity compared with purely sequence-based approaches.

Publication types

Evaluation Study
Research Support, Non-U.S. Gov't

MeSH terms

CD4 Lymphocyte Count
Genotype
HIV / genetics*
HIV / metabolism*
HIV Envelope Protein gp120 / genetics
HIV Infections / virology*
Humans
Models, Statistical*
Neural Networks, Computer
Peptide Fragments / genetics
Phenotype
Receptors, CCR5 / metabolism*
Receptors, CXCR4 / metabolism*
Reproducibility of Results
Sensitivity and Specificity
Sequence Alignment
Viral Load

Substances

HIV Envelope Protein gp120
HIV envelope protein gp120 (305-321)
Peptide Fragments
Receptors, CCR5
Receptors, CXCR4