Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Sep 15;18(1):411.
doi: 10.1186/s12859-017-1816-4.

mmquant: how to count multi-mapping reads?

Affiliations

mmquant: how to count multi-mapping reads?

Matthias Zytnicki. BMC Bioinformatics. .

Abstract

Background: RNA-Seq is currently used routinely, and it provides accurate information on gene transcription. However, the method cannot accurately estimate duplicated genes expression. Several strategies have been previously used (drop duplicated genes, distribute uniformly the reads, or estimate expression), but all of them provide biased results.

Results: We provide here a tool, called mmquant, for computing gene expression, included duplicated genes. If a read maps at different positions, the tool detects that the corresponding genes are duplicated; it merges the genes and creates a merged gene. The counts of ambiguous reads is then based on the input genes and the merged genes.

Conclusion: mmquant is a drop-in replacement of the widely used tools htseq-count and featureCounts that handles multi-mapping reads in an unabiased way.

Keywords: Multi-mapping reads; Quantification; RNA-Seq.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The author declares that he has no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Possible bioinformatics pipe-line for RNA-Seq differential expression analysis. The analysis starts with several FASTQ files, produced by the sequencing of several replicates of two conditions (e.g. wild type vs mutant). When a genome is available, the reads are mapped, e.g. with STAR [15], and the corresponding positions are stored into a BAM file. A quantification tool, such as mmquant, presented here, counts the number of reads per gene. Statistical test for differential expression is performed by a third tool, like DESeq2 [6]
Fig. 2
Fig. 2
Overview of the method, on an example. a: A toy configuration, with three genes in black: A, B, and C. Notice that A and B overlap. Six reads have been mapped to the genome, some of them (reads 2, 4, and 6) map at two different locations. If a read maps unambiguously to a unique locus and matches a unique gene (like read 1), we attribute the corresponding gene to the read (here, gene A). If a read, like read 2, matches two different genes, we create a “merged” gene, here A–C, and attribute this merged gene to the read. If a read (read 3) does not match any gene, it is not used. If a read (read 4) matches a gene (gene A) and an intergenic region, the read is attributed to the gene only. If a read (read 5) matches two different genes because the genes overlap, the read is also attributed to the merged gene (gene A–B). Similarly, if a read (read 6) matches two overlapping genes and an other gene, the three genes are merged (A–B–C). Table b provides the attributed gene for each read. c is the quantification table and the output of the tool for this configuration. Genes are sorted in lexicographical order, as shown in the example. As a consequence, merged gene (A, B) will always be displayed as A–B, and never as B–A
Fig. 3
Fig. 3
Ambiguous reads resolution. a: The read is included in both genes A and B. Here, the resolution cannot be solved, and the read will be attributed to gene A–B. b: The read is not totally included in gene A, neither in gene B. n A nucleotides of the read overlap with gene A, and n B overlap with gene B, and n A>n B. If n An B, we may attribute the read to gene A only. However, if n An B, the ambiguity cannot be resolved, and the read is attributed to A–B. The two following cases show the rules to resolve ambiguity. c: We suppose here that n A>n B+N, where N is a parameter set by the user (default: 30). In this case, mmquant will attribute the read to gene A only. d: We suppose here that n A>n B×P, where P is given by the user (default: 2). The read will be attributed uniquely to gene A. e: Here, the single end read contains an intron. Exon-wise, the read can be attributed to gene A or B. In case of ambiguity, introns are compared. The intron of the read matches the intron of gene A, whereas gene B has no intron there. The read is thus attributed to A. f: The read is ambiguous exon-wise. We compute nA′ and nB′, the number of nucleotides shared by the intron of the read and the introns of genes A and B respectively. Ambiguity is solved using nA′, nB′ and the rules given in c and d. g: The read is paired-end. In case of ambiguity, n A and n B are computed as the sums of the overlapping bases between the two reads and the gene A and B respectively. The rules presented in c and d apply next

Similar articles

Cited by

References

    1. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bionformatics. 2014;30(7):923–30. doi: 10.1093/bioinformatics/btt656. - DOI - PubMed
    1. Robert C, Watson M. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol. 2015;16:177. doi: 10.1186/s13059-015-0734-x. - DOI - PMC - PubMed
    1. Anders S, Pyl PT, Huber W. HTSeq–a Python framework to work with high-throughput sequencing data. Bionformatics. 2015;31(2):166–9. doi: 10.1093/bioinformatics/btu638. - DOI - PMC - PubMed
    1. Akula N, Barb J, Jiang X, Wendland JR, Choi KH, Sen SK, Hou L, Chen DTW, Laje G, Johnson K, Lipska BK, Kleinman JE, Corrada-Bravo H, Detera-Wadleigh S, Munson PJ, McMahon FJ. RNA-sequencing of the brain transcriptome implicates dysregulation of neuroplasticity, circadian rhythms and GTPase binding in bipolar disorder. Mol Psychiatry. 2014;19(11):1179–85. doi: 10.1038/mp.2013.170. - DOI - PMC - PubMed
    1. Geschwind DH, Konopka G. Neuroscience: Genes and human brain evolution. Nature. 2012;486:481–2. doi: 10.1038/nature11380. - DOI - PMC - PubMed