Motivation: Comparing genomes of individual organisms using next-generation sequencing data is, until now, mostly performed using a reference genome. This is challenging when the reference is distant and introduces bias towards the exact sequence present in the reference. Recent improvements in both sequencing read length and efficiency of assembly algorithms have brought direct comparison of individual genomes by de novo assembly, rather than through a reference genome, within reach.
Results: Here, we develop and test an algorithm, named Magnolya, that uses a Poisson mixture model for copy number estimation of contigs assembled from sequencing data. We combine this with co-assembly to allow de novo detection of copy number variation (CNV) between two individual genomes, without mapping reads to a reference genome. In co-assembly, multiple sequencing samples are combined, generating a single contig graph with different traversal counts for the nodes and edges between the samples. In the resulting 'coloured' graph, the contigs have integer copy numbers; this negates the need to segment genomic regions based on depth of coverage, as required for mapping-based detection methods. Magnolya is then used to assign integer copy numbers to contigs, after which CNV probabilities are easily inferred. The copy number estimator and CNV detector perform well on simulated data. Application of the algorithms to hybrid yeast genomes showed allotriploid content from different origin in the wine yeast Y12, and extensive CNV in aneuploid brewing yeast genomes. Integer CNV was also accurately detected in a short-term laboratory-evolved yeast strain.