Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 12;20(1):209.
doi: 10.1186/s12864-019-5556-x.

Linear models enable powerful differential activity analysis in massively parallel reporter assays

Affiliations

Linear models enable powerful differential activity analysis in massively parallel reporter assays

Leslie Myint et al. BMC Genomics. .

Abstract

Background: Massively parallel reporter assays (MPRAs) have emerged as a popular means for understanding noncoding variation in a variety of conditions. While a large number of experiments have been described in the literature, analysis typically uses ad-hoc methods. There has been little attention to comparing performance of methods across datasets.

Results: We present the mpralm method which we show is calibrated and powerful, by analyzing its performance on multiple MPRA datasets. We show that it outperforms existing statistical methods for analysis of this data type, in the first comprehensive evaluation of statistical methods on several datasets. We investigate theoretical and real-data properties of barcode summarization methods and show an unappreciated impact of summarization method for some datasets. Finally, we use our model to conduct a power analysis for this assay and show substantial improvements in power by performing up to 6 replicates per condition, whereas sequencing depth has smaller impact; we recommend to always use at least 4 replicates. An R package is available from the Bioconductor project.

Conclusions: Together, these results inform recommendations for differential analysis, general group comparisons, and power analysis and will help improve design and analysis of MPRA experiments.

Keywords: Enhancer; Massively parallel reporter assays; Statistics.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Structure of MPRA data. Thousands of putative regulatory elements can be assayed at a time in an MPRA experiment. Each element is linked to multiple barcodes. A plasmid library containing these barcoded elements is transfected into several cell populations (samples). Cellular DNA and RNA can be isolated and sequenced. The barcodes associated with each putative regulatory element can be counted to obtain relative abundances of each element in DNA and RNA. The process of aggregation sums counts over barcodes for element in each sample. Aggregation is one method for summarizing barcode level data into element level data
Fig. 2
Fig. 2
Variability of MPRA activity measures depends on element copy number. For multiple publicly available datasets we compute activity measures of putative regulatory element as the log2 ratio of aggregated RNA counts over aggregated DNA counts. Each panel shows the relationship between variability (across samples) of these activity measures and the average log2 DNA levels (across samples). Smoothed relationships are lowess curves representing the local average variability. The last plot shows all lowess curves on the same figure
Fig. 3
Fig. 3
Comparison of detection rates and p-value calibration over datasets. (a) QQ-plots (row 1), and (b) density plots (rows 2 and 3) for p-values for all datasets, including a zoom of the [0,0.1] interval for some datasets (row 3). Over all datasets, most methods show p-values that closely follow the classic mixture of uniformly distributed p-values with an enrichment of low p-values for differential elements. For the datasets which had barcode level counts (Inoue, Ulirsch, and Shen), we used two types of estimators of the activity measure (log-ratio of RNA/DNA) with mpralm, shown in light and dark blue
Fig. 4
Fig. 4
Empirical type I error rates. Type I error rates were estimated for all methods with simulated null data (Methods). For the datasets which had barcode level counts (Inoue, Ulirsch, and Shen), we used two types of estimators of the activity measure (aggregate and average estimator) with mpralm, shown in dark and light blue
Fig. 5
Fig. 5
Number of rejections as a function of observed error rate. To compare the observed detection (rejection) rates of the methods fairly, we compare them at the same observed type I error rates, estimated in Fig. 4. The bottom two rows are zoomed-in versions of the top row. We see that mpralm, edgeR, and DESeq2 consistently have the highest detection rates
Fig. 6
Fig. 6
Estimated FDR. For each dataset and method, the false discovery rate is estimated as a function of the number of rejections. This requires estimation of the proportion of true null hypotheses (Methods). The bottom row is a zoomed-in version of the top row
Fig. 7
Fig. 7
Distribution of quantities related to statistical inference in top ranked elements with mpralm and t-test. MPRA elements that appear in the top 200 elements with one method but not the other are examined here. For these uniquely top ranking elements, the DNA, RNA, and log-ratio percentiles are shown in the first three rows. The effect sizes (difference in mean log-ratios) and residual standard deviations are shown in the last two rows. Overall, uniquely top ranking elements for the t-test tend to have lower log-ratio activity measures, effect sizes, and residual standard deviations
Fig. 8
Fig. 8
Distribution of quantities related to statistical inference in top ranked elements with mpralm and edgeR. Similar to Fig. 7
Fig. 9
Fig. 9
Distribution of quantities related to statistical inference in top ranked elements with mpralm and DESeq2. Similar to Fig. 7
Fig. 10
Fig. 10
Comparison of the average and aggregate estimators For the three datasets containing barcode-level information, we compare the effect sizes (log fold changes in activity levels) resulting from use of the aggregate and average estimators. The y=x line is shown in red
Fig. 11
Fig. 11
Power analysis. Variance and power calculated based on our theoretical model. (a) Variance of the aggregate estimator depends on library size and the true unknown activity level but not considerably on the latter. (b)-(f) Power curves as a function of library size for different effect sizes and sample sizes. Effect sizes are log2 fold-changes
Fig. 12
Fig. 12
Effect size distributions across datasets. Effect sizes in MPRA differential analysis are the (precision-weighted) differences in activity scores between groups, also called log2 fold-changes. The distribution of log2 fold changes resulting from using mpralm with the aggregate estimator are shown here

Similar articles

Cited by

References

    1. White MA. Understanding how cis-regulatory function is encoded in DNA sequence using massively parallel reporter assays and designed sequences. Genomics. 2015;106:165–70. doi: 10.1016/j.ygeno.2015.06.003. - DOI - PMC - PubMed
    1. Melnikov A, Zhang X, Rogov P, Wang L, Mikkelsen TS. Massively parallel reporter assays in cultured mammalian cells. J Vis Exp. 2014. 10.3791/51719. - PMC - PubMed
    1. Grossman SR, Zhang X, Wang L, Engreitz J, Melnikov A, Rogov P, Tewhey R, Isakova A, Deplancke B, Bernstein BE, Mikkelsen TS, Lander ES. Systematic dissection of genomic features determining transcription factor binding and enhancer function. PNAS. 2017;114:1291–300. doi: 10.1073/pnas.1621150114. - DOI - PMC - PubMed
    1. Maricque BB, Dougherty J, Cohen BA. A genome-integrated massively parallel reporter assay reveals DNA sequence determinants of cis-regulatory activity in neural cells. Nucleic Acids Res. 2017;45:16. - PMC - PubMed
    1. Ernst J, Melnikov A, Zhang X, Wang L, Rogov P, Mikkelsen TS, Kellis M. Genome-scale high-resolution mapping of activating and repressive nucleotides in regulatory regions. Nat Biotechnol. 2016;34:1180–90. doi: 10.1038/nbt.3678. - DOI - PMC - PubMed

LinkOut - more resources