Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jun 29:9:e54603.
doi: 10.7554/eLife.54603.

Coverage-dependent bias creates the appearance of binary splicing in single cells

Affiliations

Coverage-dependent bias creates the appearance of binary splicing in single cells

Carlos F Buen Abad Najar et al. Elife. .

Abstract

Single-cell RNA sequencing provides powerful insight into the factors that determine each cell's unique identity. Previous studies led to the surprising observation that alternative splicing among single cells is highly variable and follows a bimodal pattern: a given cell consistently produces either one or the other isoform for a particular splicing choice, with few cells producing both isoforms. Here, we show that this pattern arises almost entirely from technical limitations. We analyze alternative splicing in human and mouse single-cell RNA-seq datasets, and model them with a probabilistic simulator. Our simulations show that low gene expression and low capture efficiency distort the observed distribution of isoforms. This gives the appearance of binary splicing outcomes, even when the underlying reality is consistent with more than one isoform per cell. We show that accounting for the true amount of information recovered can produce biologically meaningful measurements of splicing in single cells.

Keywords: alternative splicing; chromosomes; computational biology; gene expression; human; mouse; scRNA-seq; single cell; systems biology.

PubMed Disclaimer

Conflict of interest statement

CB, NY, LL No competing interests declared

Figures

Figure 1.
Figure 1.. Bimodal vs unimodal models of cassette exon splicing.
In the bimodal model, some cells consistently splice in the exon, while others consistently skip it. After mRNA capture and sequencing, observations of Ψ are almost exclusively binary. In the unimodal model, individual cells express some mRNAs that splice in the cassette exon and some that skip it. Low mRNA capture dramatically reduces the number of cells in which both isoforms are observed, artificially inflating binary Ψ values.
Figure 2.
Figure 2.. Splice junction read coverage is correlated with unimodality of splicing distributions.
(a) Comparison of splice junction read coverage and observed Ψ^ for three cassette exons in the Chen dataset, with low (Cadm1 exon 8; chr9: 47829377–47829409), medium (Thyn1 exon 6, chr9: 27006801–27006951), and high coverage (Rbm39 exon 3; chr2: 156178880–156178952). Each dot represents the Ψ^ of that exon in one cell. (b) Ψ^ distribution of the 300 highest coverage cassette exons with intermediate splicing (average Ψ^ between 0.2 and 0.8) in each of the six analyzed datasets. Each row in the heatmap shows the distribution of Ψ^ for one exon across all cells. (c) Relationship between the average read coverage and proportion of binary observations for each cassette exon in the Chen dataset. (d) Correlation between the total number of splice junction reads captured in each cell, and proportion of cassette exons with intermediate splicing that show binary Ψ^ in that cell.
Figure 2—figure supplement 1.
Figure 2—figure supplement 1.. Junction read coverage determines the proportion of binary observations in all analyzed datasets.
(a) As in Figure 2c, the splice junction read coverage of an intermediate exon was anti-correlated with the proportion of cells in which it shows binary Ψ^ (only one isoform is observed) in all datasets. Barplot: Pearson correlation between read coverage and proportion of binary observations of each cassette exon. (b) As in Figure 2d, the total number of splice junction reads in a cell was inversely proportional to the fraction of exons that have binary values in the cell in all datasets. Barplot: Pearson correlation between the total number of captured reads in each cell and the proportion of cassette exons with binary Ψ^. (c) Relationship between transcriptional burst frequency and size, and the proportion of binary observations of an exon. (d) R2 of the linear regression prediction of the proportion of binary observations per exon based on burst frequency, size, number of reads, and a combination of these features.
Figure 3.
Figure 3.. Simulations show that gene expression and capture efficiency influence the observed distribution of splicing.
(a) Simulations of alternative splicing and scRNA-seq under the binary-bimodal model, in which each cell produces one isoform or the other, but rarely both. As in Figure 2b, each row of the histogram shows Ψ^ for one intermediate exon across all cells. The observed Ψ^ distribution is similar to the true Ψ distribution, and its shape is largely unaffected by capture efficiency. (b) Simulations with the non-binary, unimodal model, in which most cells present a mixture of the two alternative isoforms. Exons with high expression have a unimodal distribution of true Ψ. Low capture efficiency results in an increase in binary observations (only one isoform observed), leading to a distortion of the observed distribution of Ψ^ to look bimodal. Only a handful of the highest expressed exons maintain a unimodal distribution of Ψ^. Fewer exons show unimodal splicing as the capture efficiency is reduced. (c) Under the binary-bimodal model, exons with high coverage have slightly fewer binary Ψ observations, and (d) simulated cells with a high number of total splice junction reads have slightly fewer exons with binary Ψ^. (e) Under the unimodal model, exons with intermediate splicing show a strong decrease in binary observations as coverage increases, as seen in real data (Figure 2c). (f) Similarly, simulated cells with high read coverage have a decrease of the proportion of binary Ψ^. (g) Effect of capture efficiency on the proportion of binary observations of cassette exons with underlying Ψ = 0.5. (h) Effect of the initial number of mRNA molecules and underlying Ψ on the proportion of binary Ψ^ observations.
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Theoretical calculations and simulations of the effect of biological and technical factors in splicing observations.
(a) Theoretical likelihood of capturing only mRNAs representing one isoform of an alternatively spliced gene in a single cell, determined by the total number of mRNAs in the cell, the Ψ of the isoforms, and a 10% capture efficiency. (b) Probability of observing a binary Ψ^ (only one isoform observed) given the number of mRNA molecules of that gene present in the cell, when the underlying Ψ is fixed at 0.5. (c) Effect of the number of captured mRNA molecules on the uncertainty of Ψ^ observations. Ψ^ is understood as the maximum likelihood estimate of the underlying Ψ. (d) The posterior probability of having an observed Ψ^ within 0.1 of the underlying Ψ. (e) Distribution of the relative error between the average observed Ψ^, and the average true Ψ, in our simulations under different capture efficiency rates. The relative error is calculated as the difference between the average Ψ^ and the average true Ψ, scaled by the average true Ψ. (f) Distribution of the relative error between the variance in the observed Ψ^, and the variance in the true Ψ, in our simulations under different capture efficiency rates.
Figure 3—figure supplement 2.
Figure 3—figure supplement 2.. Schematic of scRNA-seq splicing simulator.
Simulator of scRNA-seq splicing data. Green elements represent biological variables; blue elements represent technical processes. The underlying Ψ is drawn from a Beta distribution with either a bimodal or unimodal shape. Individual gene expression determines the total number of mRNAs, and these mRNAs are spliced stochastically according to Ψ, producing isoforms that splice in or skip the exon. mRNAs are captured with a probability drawn from a normal distribution. Sequencing produces splice junction reads, which determine the observed Ψ^.
Figure 4.
Figure 4.. Accounting for coverage biases reveals unimodal splicing distributions and differential splicing.
(a) Estimated total mRNA molecules captured per cell. (b) Estimated number of recovered mRNAs vs splice junction reads for cassette exons, averaged across cells. Each dot corresponds to an exon, and its color indicates the proportion of cells in which it has a binary observation (only one isoform observed). We analyzed exons with average Ψ^ between 0.05 and 0.95. (c) Per-cell splice junction coverage rate in each dataset. (d) Cadm1 exon 8 alternative splicing appears binary in many cells in the Chen dataset. Correlation with lineage pseudotime: Spearman’s rs = 0.1. (e) Cadm1 exon after removing cells with fewer than 10 recovered Cadm1 mRNA molecules and fewer splice junction reads than expected from 10 mRNAs (grey). Spearman’s rs = 0.52. (f) PCA projection and clustering of single cells in the Chen dataset, showing differentiation of mouse ES cells into neurons. Red line, lineage inferred with Slingshot. (g) Number of cassette exons with observations from at least 10 mRNA reads in at least 50% of cells in any cluster. (h) Stacked histograms showing the distribution of observed Ψ^ of exons as in (g), in each cell cluster of the Chen dataset. Observations with fewer than 10 mRNA molecules were removed. We show exons with average Ψ^ ranging from 0.1 to 0.9 per cluster. (i) QQ-plot comparing the quantiles of a uniform distribution (x-axis) with the quantiles of the distributions of p-values from the Kruskal-Wallis test (y-axis). A diagonal line (gray dotted line) would mean the p-values are uniformly distributed. A lower area under the curve indicates enrichment for low p-values. The point on the x-axis at which each line crosses the dotted red line indicates the proportion of p-values that are below 0.05 in the distribution. (j) Fold enrichment of exons with a Kruskal-Wallis p < x in the set of exons selected with the mRNA-based filter (blue), and exons selected with a flat read minimum filter (red). (k) Significance p-value of the enrichment, estimated with the hypergeometric test and adjusted for FDR. (l) Example exons that pass the overall filter criteria in the Chen dataset and have p<0.05 in the Kruskal-Wallis test.
Figure 4—figure supplement 1.
Figure 4—figure supplement 1.. Relationship between read coverage, captured mRNAs, and binarity in single cell datasets.
(a) Average number of splice junction reads required to have a 50% likelihood of observing both isoforms, from simulations in Figure 3j. (b) Total mRNAs required to have a 50% likelihood of observing both isoforms, from simulations in Figure 3k. (c) Total mapped reads per cell for each dataset. (d) Estimated average number of recovered mRNAs vs. average number of splice junction reads for each exon. (e) Expression of pluripotency and neuron differentiation marker genes in the Chen dataset; expression is scaled by each row’s maximum and minimum. Cells are ordered by inferred pseudotime. Top color key, clusters from agglomerative clustering; bottom color key, labels from Chen et al. The red box highlights two sub-groups of neurons identified by clustering. (f) Visualization of the distribution of Ψ^ observations toward extreme values in the first cluster of cells of each dataset. For each x cutoff in the x-axis, the y-axis corresponds to the proportion of cells that have Ψ^x or Ψ^(1-x). Blue lines, exons passing the mRNA-based filter; red lines, exons removed by filter. (g) Stacked histograms, as in Figure 4h, for other datasets. The Fletcher dataset is excluded because only five exons pass the mRNA filter. (h) Stacked histograms of exons that do not pass the filter in the Chen dataset; exons were subsampled to match the number that do pass the filter in each cell type cluster. (i) Sex-dependent distribution of Ψ^ of the two bimodal exons of the Chen dataset after cell-type stratification: Smarcad1 exon 3 (chr6: 65043836–65044108) in early Epi cells, Nsfl1c exon 4 (chr2: 151502455–151502460) in late Epi cells.
Figure 4—figure supplement 2.
Figure 4—figure supplement 2.. Analysis of differential splicing among selected exons with the Kruskal-Wallis analysis of variance.
(a) Comparison of the precision of selecting exons with low p-values, for the mRNA-based filter versus selecting exons on a flat minimum of 10 reads. (b) Recall. (c) F1-score. (d) Specificity. (e) Accuracy. (f) Number of exons with px in each dataset.
Figure 4—figure supplement 3.
Figure 4—figure supplement 3.. Analysis of differential splicing among selected exons with the autocorrelation test.
(a) Number of exons with p x in each dataset. (b) QQ-plot comparing the quantiles of a uniform distribution, with the quantiles of the distributions of p-values from the autocorrelation test. (c) Fold enrichment of exons with an autocorrelation p < x in the set of exons selected with the mRNA-based filter (blue), and exons selected with a flat read minimum filter (red). (d) Significance p-value of the enrichment shown in (c). For each p-value limit x, we estimate the significance with the hypergeometric test, and correct for multiple testing using the Benjamini-Hochberg FDR adjustment. (e) F1-score. (f) Specificity. (g) Accuracy.
Figure 4—figure supplement 4.
Figure 4—figure supplement 4.. Distribution of Ψ^ of example exons.
(a) Skipped Cadm1 exon 8 (chr9: 47829377–47829409). We show the relationship between pseudotime and observed Ψ^; as well as violinplots of the distribution of Ψ^ in each cluster before filtering, and after filtering. (b) Violinplots of the distribution of Ψ^ in each cluster after filtering for the exons in Figure 4k: Nsfl1c exon 4 (chr2: 151502455–151502460), Rpn2 exon 16 (chr2: 157323223–157323270), Tecr exon 4 (chr8: 83573411–83573455), Zfp207 exon 8 (chr11: 80393084–80393176).

Similar articles

Cited by

References

    1. Arzalluz-Luque Á, Conesa A. Single-cell RNAseq for the study of isoforms—how is that possible? Genome Biology. 2018;19:110. doi: 10.1186/s13059-018-1496-z. - DOI - PMC - PubMed
    1. Barash Y, Blencowe BJ, Frey BJ. Model-based detection of alternative splicing signals. Bioinformatics. 2010;26:i325–i333. doi: 10.1093/bioinformatics/btq200. - DOI - PMC - PubMed
    1. Buen Abad Najar CF, Lareau L. sc_binary_splicing. 5c2c825GitHub. 2020 https://github.com/lareaulab/sc_binary_splicing
    1. Busskamp V, Lewis NE, Guye P, Ng AH, Shipman SL, Byrne SM, Sanjana NE, Murn J, Li Y, Li S, Stadler M, Weiss R, Church GM. Rapid neurogenesis through transcriptional activation in human stem cells. Molecular Systems Biology. 2014;10:760. doi: 10.15252/msb.20145508. - DOI - PMC - PubMed
    1. Chen G, Schell JP, Benitez JA, Petropoulos S, Yilmaz M, Reinius B, Alekseenko Z, Shi L, Hedlund E, Lanner F, Sandberg R, Deng Q. Single-cell analyses of X chromosome inactivation dynamics and pluripotency during differentiation. Genome Research. 2016;26:1342–1354. doi: 10.1101/gr.201954.115. - DOI - PMC - PubMed

Publication types

Associated data