Mining the unknown: a systems approach to metabolite identification combining genetic and metabolic information

PLoS Genet. 2012;8(10):e1003005. doi: 10.1371/journal.pgen.1003005. Epub 2012 Oct 18.


Recent genome-wide association studies (GWAS) with metabolomics data linked genetic variation in the human genome to differences in individual metabolite levels. A strong relevance of this metabolic individuality for biomedical and pharmaceutical research has been reported. However, a considerable amount of the molecules currently quantified by modern metabolomics techniques are chemically unidentified. The identification of these "unknown metabolites" is still a demanding and intricate task, limiting their usability as functional markers of metabolic processes. As a consequence, previous GWAS largely ignored unknown metabolites as metabolic traits for the analysis. Here we present a systems-level approach that combines genome-wide association analysis and Gaussian graphical modeling with metabolomics to predict the identity of the unknown metabolites. We apply our method to original data of 517 metabolic traits, of which 225 are unknowns, and genotyping information on 655,658 genetic variants, measured in 1,768 human blood samples. We report previously undescribed genotype-metabotype associations for six distinct gene loci (SLC22A2, COMT, CYP3A5, CYP2C18, GBA3, UGT3A1) and one locus not related to any known gene (rs12413935). Overlaying the inferred genetic associations, metabolic networks, and knowledge-based pathway information, we derive testable hypotheses on the biochemical identities of 106 unknown metabolites. As a proof of principle, we experimentally confirm nine concrete predictions. We demonstrate the benefit of our method for the functional interpretation of previous metabolomics biomarker studies on liver detoxification, hypertension, and insulin resistance. Our approach is generic in nature and can be directly transferred to metabolomics data from different experimental platforms.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computational Biology / methods
  • Data Mining / methods*
  • Genome-Wide Association Study*
  • Genomics / methods*
  • Humans
  • Metabolome
  • Metabolomics / methods*
  • Models, Statistical
  • Polymorphism, Single Nucleotide
  • Reproducibility of Results
  • Signal Transduction

Grant support

This work was funded in part by a grant from the German Federal Ministry of Education and Research (BMBF) to the German Center for Diabetes Research (DZD e.V.), by the European Research Council (starting grant “LatentCauses”), by BMBF Grant no. 03IS2061B (project Gani_Med), by BMBF Grant no. 0315494A (project SysMBo), by Era-Net grant no. 0315442A (project PathoGenoMics), and by the Initiative and Networking Fund of the Helmholtz Association within the Helmholtz Alliance on Systems Biology (project CoReNe). JK is supported by a PhD student fellowship from the “Studienstiftung des Deutschen Volkes.” KS is supported by Qatar Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.