Listeria monocytogenes can cause severe foodborne illness, including miscarriage during pregnancy or death in newborn infants. When outbreaks of L. monocytogenes illness occur, it may be possible to determine the food source of the outbreak. However, most reported L. monocytogenes illnesses do not occur as part of a recognized outbreak and most of the time the food source of sporadic L. monocytogenes illness in people cannot be determined. In the United States, L. monocytogenes isolates from patients, foods, and environments are routinely sequenced and analyzed by whole genome multilocus sequence typing (wgMLST) for outbreak detection by PulseNet, the national molecular surveillance system for foodborne illnesses. We investigated whether machine learning approaches applied to wgMLST allele call data could assist in attribution analysis of food source of L. monocytogenes isolates. We compiled isolates with a known source from five food categories (dairy, fruit, meat, seafood, and vegetable) using the metadata of L. monocytogenes isolates in PulseNet, deduplicated closely genetically related isolates, and developed random forest models to predict the food sources of isolates. Prediction accuracy of the final model varied across the food categories; it was highest for meat (65%), followed by fruit (45%), vegetable (45%), dairy (44%), and seafood (37%); overall accuracy was 49%, compared with the naive prediction accuracy of 28%. Our results show that random forest can be used to capture genetically complex features of high-resolution wgMLST for attribution of isolates to their sources.
Keywords: cluster analysis; foodborne; machine learning; predictive model; whole genome sequence.