The understanding of bacterial population genetics and evolution is crucial in epidemic outbreak studies and pathogen surveillance. However, all epidemiological studies are limited to their sampling capacities which, by being usually biased or limited due to economic constraints, can hamper the real knowledge of the bacterial population structure of a given species. To this end, mathematical models and large-scale simulations can provide a quantitative analytical framework that can be used to assess how or if limited sampling can infer the true population structure. In this article, we address the large-scale simulation of genetic evolution of bacterial populations, using Wright-Fisher model, in the presence of complex host contact networks. We present an efficient approach for large-scale simulations over complex host contact networks, using MapReduce on top of Apache Spark and GraphX API. We evaluate the relation between cluster computing power and simulations speedup and include insights on how bacterial population diversity can be affected by mutation and recombination rates, and network topology.
Keywords: GraphX; graph-parallel computations; large-scale simulations; population genetics; spark.