Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep;198(1):59-73.
doi: 10.1534/genetics.114.165886.

RNA-Seq Alignment to Individualized Genomes Improves Transcript Abundance Estimates in Multiparent Populations

Affiliations
Free PMC article

RNA-Seq Alignment to Individualized Genomes Improves Transcript Abundance Estimates in Multiparent Populations

Steven C Munger et al. Genetics. .
Free PMC article

Abstract

Massively parallel RNA sequencing (RNA-seq) has yielded a wealth of new insights into transcriptional regulation. A first step in the analysis of RNA-seq data is the alignment of short sequence reads to a common reference genome or transcriptome. Genetic variants that distinguish individual genomes from the reference sequence can cause reads to be misaligned, resulting in biased estimates of transcript abundance. Fine-tuning of read alignment algorithms does not correct this problem. We have developed Seqnature software to construct individualized diploid genomes and transcriptomes for multiparent populations and have implemented a complete analysis pipeline that incorporates other existing software tools. We demonstrate in simulated and real data sets that alignment to individualized transcriptomes increases read mapping accuracy, improves estimation of transcript abundance, and enables the direct estimation of allele-specific expression. Moreover, when applied to expression QTL mapping we find that our individualized alignment strategy corrects false-positive linkage signals and unmasks hidden associations. We recommend the use of individualized diploid genomes over reference sequence alignment for all applications of high-throughput sequencing technology in genetically diverse populations.

Keywords: Diversity Outbred (DO); Diversity Outbred mice; MPP; Multiparent Advanced Generation Inter-Cross (MAGIC); Multiparental populations; QTL mapping; RNA-seq; expression QTL; haplotype reconstruction; high-density genotyping; mixed models.

Figures

Figure 1
Figure 1
Flowchart showing the RNA-seq analysis pipeline and Seqnature tool. A Diversity Outbred mouse sample is shown as an example. Genomic DNA is genotyped at 7664 SNPs, which are then input into a hidden Markov model to impute 36-state founder strain genotypes. Seqnature (highlighted in blue) infers genotype transitions by calculating the smallest number of recombinations necessary to produce the observed 36-state patterns and outputs two 8-state genotype transition files. Seqnature constructs two haploid genomes by incorporating founder strain SNPs and indels into the reference genome according to the genotype transition files and creates two gene annotation files with adjusted coordinates (to offset insertions and deletions) and founder strain appended to feature identifiers. The two genomes and annotation files are merged, and then individualized diploid isoform sequences (individualized transcriptome) are constructed and indexed. Sample RNA-seq data are aligned with Bowtie to the individualized transcriptome, and allele-, isoform-, and gene-level abundances are estimated using an EM algorithm (RSEM) to resolve multimapped reads.
Figure 2
Figure 2
Read alignment to an individualized diploid transcriptome yields accurate allelic abundance estimates. Estimated allele frequency (y-axis) is plotted against the ground-truth allele frequency (x-axis) for 5270 genes in the simulated data set of 10 million DO reads that were robustly expressed (sum of allele counts ≥100) and had at least 5 uniquely aligned reads that differentiated the two gene alleles. Allele-level gene abundances are strongly correlated to the ground-truth values (r = 0.82), with the estimated frequency of the lower-expressed allele differing on average by <7% (median = 4%) from the ground-truth value. Most genes have a ground truth and estimated allele frequency near 0.5 (red and orange regions), and some estimates show absolute allele-specific expression (i.e., 0 or 1) while the ground truth is somewhere in between (horizontal lines of dots at top and bottom).
Figure 3
Figure 3
Gene-level abundance estimates in real data are improved by the individualized alignment strategy. (A) Gene-level abundance estimates are plotted for one CAST sample after alignment to the NCBIM37 (x-axis) and CAST transcriptomes (y-axis). Points are colored based on the difference between alignments and the results of the simulation study (n = 11,964 total genes). Gray circles denote genes with abundance estimates that differ by <10% between alignment strategies (n = 8980). Green denotes genes that differ in the real data by >10% between alignment strategies and for which the alignment to CAST improved the abundance estimate in the simulation study (n = 2242). Red denotes genes that differ by >10% in the real data and for which alignment to NCBIM37 improved the abundance estimate in the simulation study (n = 439). Black denotes genes that differ by >10% in the real data but for which the two alignment strategies yielded the same abundance estimates in the simulation study (n = 71). (B) The differences in gene-level abundance estimates between alignment strategies in the real CAST data are plotted as a stacked histogram. The percentage of difference between CAST and NCBIM37 alignments is plotted on the x-axis, and the total number of genes with that difference is plotted on the y-axis. The same coloring conventions are used as in A. White bars denote genes that differ by >10% in the real data but that were not expressed above threshold in the simulated data set (n = 232). Differences were scaled to a maximum value of 100%. (C) Gene-level abundance estimates are plotted for one DO sample after read alignment to the NCBIM37 (x-axis) and individualized transcriptomes (y-axis). A total of 714 genes in the real data differ by >10% between alignment strategies (n = 714/12,248), of which 432 gene estimates were improved by alignment to the individualized transcriptome in the simulation study (green circles), 124 were improved by alignment to NCBIM37 in the simulation (red circles), and 16 yielded the same gene estimate by both alignment strategies in the simulation study (black circles). (D) The difference in gene-level abundance estimates between alignment strategies in the real DO data are plotted as a stacked histogram. The percentage of difference between DO and NCBIM37 alignment is plotted on the x-axis, and the total number of genes with that difference is plotted on the y-axis. The same coloring conventions are used as in C. White bars denote genes that differ by >10% in the real data but that were not expressed above threshold in the simulation study (n = 142).
Figure 4
Figure 4
Alignment of Diversity Outbred mice to individualized transcriptomes (DO IRGs) reveals significant local eQTL and reduces the number of spurious pseudogene eQTL. (A) An example of a local eQTL unmasked by alignment to individualized transcriptomes. Expression estimates for Hebp1 do not appear linked to local genotype when reads are aligned to the common reference (red line). Accounting for individual genetic variation in the alignment step uncovers a strong local eQTL with a peak centered at the gene (blue line; black arrow denotes gene location). (B) Venn diagram showing the overlap of local eQTL from the individualized or common reference alignment strategy. Local eQTL are identified for a majority of expressed genes by one or both alignment strategies. Alignment to individualized transcriptomes (DO IRGs) identifies 2900 novel local associations. Even in the case of the 6097 local eQTL that are identified as significant by both alignments (overlapping region), LOD significance scores are generally higher after alignment to individualized transcriptomes (y-axis in scatterplot) compared to NCBIM37 (x-axis). (C) Alignment to individualized transcriptomes reduces the number of spurious distant eQTL at pseudogenes. Accounting for segregating founder strain polymorphisms in the parent protein-coding gene Rps12 ablates the distant Chr 10 eQTL peak for the pseudogene Rps12-ps2 (compare blue to red lines) located on Chr 14.
Figure 5
Figure 5
Liver expression patterns observed in the DO founder strains suggest that novel local eQTL are real. (A) Alignment to individualized transcriptomes (DO IRGs, blue line) reveals a strong local eQTL for the lincRNA Gm12976 on Chr 4. The eight founder strain coefficients inferred from the additive mapping model are plotted in the inset and show that DO animals that derive this region of Chr 4 from the 129S1/SvImJ strain have higher expression of Gm12976. (B) Allele-level abundance estimates in the DO population show that the 129S1 allele of Gm12976 is high expressing, confirming that the local eQTL is due to cis-acting variation. Founder strain origin is listed on the x-axis, and Gm12976 allelic abundance (upper quartile normalized, square-root transformed) is plotted on the y-axis. (C) This inferred DO strain pattern of Gm12976 expression is concordant with that observed in the eight founder strains. Strains are listed on the x-axis, and Gm12976 abundance (upper quartile normalized, square-root transformed) is plotted on the y-axis.

Similar articles

See all similar articles

Cited by 34 articles

See all "Cited by" articles

References

    1. Aylor D. L., Valdar W., Foulds-Mathes W., Buus R. J., Verdugo R. A., et al. , 2011. Genetic analysis of complex traits in the emerging Collaborative Cross. Genome Res. 21: 1213–1222 - PMC - PubMed
    1. Battle A., Mostafavi S., Zhu X., Potash J. B., Weissman M. M., et al. , 2014. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 24(1): 14–24 - PMC - PubMed
    1. Broman, K. W., 2012 Haplotype probabilities in advanced intercross populations. G3 2: 199–202. - PMC - PubMed
    1. Chen Y., Zhu J., Lum P. Y., Yang X., Pinto S., et al. , 2008. Variations in DNA elucidate molecular networks that cause disease. Nature 452: 429–435 - PMC - PubMed
    1. Cheng R., Abney M., Palmer A. A., Skol A. D., 2011. QTLRel: an R package for genome-wide association studies in which relatedness is a concern. BMC Genet. 12: 66. - PMC - PubMed

Publication types

LinkOut - more resources

Feedback