Automated high confidence compound identification of electron ionization mass spectra for nontargeted analysis

J Chromatogr A. 2021 Dec 20:1660:462656. doi: 10.1016/j.chroma.2021.462656. Epub 2021 Oct 31.

Abstract

Nontargeted analysis based on mass spectrometry is a rising practice in environmental monitoring for identifying contaminants of emerging concern. Nontargeted analysis performed using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/TOF-MS) generates large numbers of possible analytes. Moreover, the default spectral library similarity score-based search algorithm used by LECO® ChromaTOF® does not ensure that high similarity scores result in correct library matches. Therefore, an additional manual screening is necessary, but leads to human errors especially when dealing with large amounts of data. To improve the speed and accuracy of the chemical identification, we developed CINeMA.py (Classification Is Never Manual Again). This programming suite automates GC×GC/TOF-MS data interpretation by determining the confidence of a match between the observed analyte mass spectrum and the LECO® ChromaTOF® software generated library hit from the NIST Electron Ionization Mass Spectral (NIST EI-MS) library. Our script allows the user to evaluate the confidence of the match using an algorithmic method that mimics the manual curation process and two different machine learning approaches (neural networks and random forest). The script allows the user to adjust various parameters (e.g., similarity threshold) and study their effects on prediction accuracy. To test CINeMA.py, we used data from two different environmental contaminant studies: an EPA study on household dust and a study on stormwater runoff. Using a reference set based on the analysis performed by highly trained users of the ChromaTOF and GC×GC/TOF-MS systems, the random forest model had the highest prediction accuracies of 86% and 83% on the EPA and Stormwater data sets, respectively. The algorithmic approach had the second-best prediction accuracy (82% and 79%), while the neural network accuracy had the lowest (63% and 67%). All the approaches required less than 1 min to classify 986 observed analytes, whereas manual data analysis required hours or days to complete. Our methods were also able to detect high confidence matches missed during the manual review. Overall, CINeMA.py provides users with a powerful suite of tools that should significantly speed-up data analysis while reducing the possibilities of manual errors and discrepancies among users, and can be applicable to other GC/EI-MS instrument based nontargeted analysis.

Keywords: ChromaTOF; Machine learning; Mass spectral comparison; Nontargeted analysis; PyAutoGUI; Suspect screening.

MeSH terms

  • Algorithms
  • Electrons*
  • Environmental Monitoring
  • Gas Chromatography-Mass Spectrometry
  • Humans
  • Software*