Predicting the Presence of Uncommon Elements in Unknown Biomolecules from Isotope Patterns

Anal Chem. 2016 Aug 2;88(15):7556-66. doi: 10.1021/acs.analchem.6b01015. Epub 2016 Jul 22.

Abstract

The determination of the molecular formula is one of the earliest and most important steps when investigating the chemical nature of an unknown compound. Common approaches use the isotopic pattern of a compound measured using mass spectrometry. Computational methods to determine the molecular formula from this isotopic pattern require a fixed set of elements. Considering all possible elements severely increases running times and more importantly the chance for false positive identifications as the number of candidate formulas for a given target mass rises significantly if the constituting elements are not prefiltered. This negative effect grows stronger for compounds of higher molecular mass as the effect of a single atom on the overall isotopic pattern grows smaller. On the other hand, hand-selected restrictions on this set of elements may prevent the identification of the correct molecular formula. Thus, it is a crucial step to determine the set of elements most likely comprising the compound prior to the assignment of an elemental formula to an exact mass. In this paper, we present a method to determine the presence of certain elements (sulfur, chlorine, bromine, boron, and selenium) in the compound from its (high mass accuracy) isotopic pattern. We limit ourselves to biomolecules, in the sense of products from nature or synthetic products with potential bioactivity. The classifiers developed here predict the presence of an element with a very high sensitivity and high specificity. We evaluate classifiers on three real-world data sets with 663 isotope patterns in total: 184 isotope patterns containing sulfur, 187 containing chlorine, 14 containing bromine, one containing boron, one containing selenium. In no case do we make a false negative prediction; for chlorine, bromine, boron, and selenium, we make ten false positive predictions in total. We also demonstrate the impact of our method on the identification of molecular formulas, in particular on the number of considered candidates and running time. The element prediction will be part of the next SIRIUS release, available from https://bio.informatik.uni-jena.de/software/sirius/ .

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Chemical Phenomena*
  • Datasets as Topic
  • Elements*
  • Isotopes / chemistry*
  • Machine Learning*
  • Mass Spectrometry
  • Molecular Weight

Substances

  • Elements
  • Isotopes