Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 May 19;7:261.
doi: 10.1186/1471-2105-7-261.

Empirical Array Quality Weights in the Analysis of Microarray Data

Affiliations
Free PMC article

Empirical Array Quality Weights in the Analysis of Microarray Data

Matthew E Ritchie et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Assessment of array quality is an essential step in the analysis of data from microarray experiments. Once detected, less reliable arrays are typically excluded or "filtered" from further analysis to avoid misleading results.

Results: In this article, a graduated approach to array quality is considered based on empirical reproducibility of the gene expression measures from replicate arrays. Weights are assigned to each microarray by fitting a heteroscedastic linear model with shared array variance terms. A novel gene-by-gene update algorithm is used to efficiently estimate the array variances. The inverse variances are used as weights in the linear model analysis to identify differentially expressed genes. The method successfully assigns lower weights to less reproducible arrays from different experiments. Down-weighting the observations from suspect arrays increases the power to detect differential expression. In smaller experiments, this approach outperforms the usual method of filtering the data. The method is available in the limma software package which is implemented in the R software environment.

Conclusion: This method complements existing normalisation and spot quality procedures, and allows poorer quality arrays, which would otherwise be discarded, to be included in an analysis. It is applicable to microarray data from experiments with some level of replication.

Figures

Figure 1
Figure 1
Design of the METH experiment. The METH experiment compared 3 mRNA sources of interest (0 mM, 1 mM and 3 mM) directly on the first 4 arrays and indirectly via a Panel reference on a further 6 arrays. The arrays are numbered from 1 to 10 in the order they were hybridised. Each arrow indicates a direct comparison made within an array, and points from the Cy3 labelled sample towards the Cy5 labelled sample.
Figure 2
Figure 2
Number of false discoveries from simulated data sets with 3 arrays. For each 3 array simulation in Table 2 (1, 2 and 3), the average false discovery rates calculated using ordinary t-statistics (solid lines) or moderated t-statistics (dashed lines) are given. Panels (a), (c) and (e) show the results for normal data under simulations 1, 2 and 3 respectively while panels (b), (d) and (f) display the corresponding results for non-normal data. Black lines plot the false discovery rates when the most unreliable array is removed from the analysis. Light gray lines are the results obtained using equal weights and dark gray lines show the false discovery rates recovered with array weights. Each line is the average of 1000 simulated data sets. In nearly all cases, the use of array weights in the analysis gives fewer false positives than the other methods. Simulation 1 is the exception, with equal weighting and array weighting producing similar false discovery rates (overlapping curves) for both normal (a) and non-normal (b) data.
Figure 3
Figure 3
Number of false discoveries from simulated data sets with 5 arrays. For each 5 array simulation in Table 2 (4, 5 and 6), the average false discovery rates calculated using ordinary t-statistics (solid lines) or moderated t-statistics (dashed lines) are given. Panels (a), (c) and (e) show the results for normal data under simulations 4, 5 and 6 respectively while panels (b), (d) and (f) display the corresponding results for non-normal data. Black lines plot the false discovery rates when the two most unreliable arrays are removed from the analysis. Light gray lines are the results obtained using equal weights and dark gray lines show the false discovery rates recovered with array weights. Each line is the average of 1000 simulated data sets. In all cases, the use of array weights in the analysis gives fewer false positives than the other methods.
Figure 4
Figure 4
Array weights for the QC LMS and METH data sets. Array weights (vj) calculated for the QC LMS data set (a) and the METH experiment (b). The arrays are ordered by time of hybridisation, and the dashed lines show the unit weight. In each experiment, there are arrays which receive higher and lower relative weights, corresponding to arrays which are more or less reproducible.
Figure 5
Figure 5
MA-plots of QC LMS controls for two arrays. MA-plots for arrays 91 (a) and 19 (b) which were assigned the highest and lowest quality weights respectively. Here M = log2(R/G) is the spot log-ratio and A = (log2G + log2R)/2 is the spot log-intensity. Dashed lines show the theoretical M values from Table 1. The controls from array 91 consistently recover the true spike-in log-ratios and are assigned high weight (v91 = 3.68), whereas the log-ratios from array 19 are considerably more variable, resulting in a very low weight (v91 = 0.11).
Figure 6
Figure 6
Ordinary t-statistics for the QC LMS controls. The t-statistics were calculated using either equal weights or the array weights (vj) from Figure 4(a). Using array weights in the analysis results in more extreme t-statistics for the known differentially expressed controls (D03, D10, U03, U10) which represents a gain in power.
Figure 7
Figure 7
Estimated versus actual array variance parameters from simulated data. The gene-by-gene update algorithm was used to estimate the array variance parameters using 100, 1000 and 10000 genes from a simulated data set with 10 arrays. The estimates (γ^j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFZoWzgaqcamaaBaaaleaacqWGQbGAaeqaaaaa@2FF3@) are plotted against the true values (γj). As more genes are included in the iterations the accuracy of the estimates obtained from the gene-by-gene update algorithm improve, although with as few as 100 genes, the values recovered are broadly correct.

Similar articles

See all similar articles

Cited by 126 articles

See all "Cited by" articles

References

    1. Smyth GK, Yang YH, Speed TP. Statistical issues in cDNA microarray data analysis. Methods Mol Biol. 2003;224:111–136. - PubMed
    1. Bolstad BM, Collin F, Brettschneider J, Simpson K, Cope L, Irizarry R, Speed TP. Quality Control of Affymetrix GeneChip data. In: Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S, editor. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer; 2005. pp. 33–47.
    1. Smyth GK, Speed TP. Normalization of cDNA microarray data. Methods. 2003;31:265–273. - PubMed
    1. Schuchhardt J, Beule A, Malik E, Wolski H, Eickhoff H, Lehrach HH. Normalization strategies for cDNA microarrays. Nucleic Acids Res. 2000;28:e47. - PMC - PubMed
    1. Wildsmith SE, Archer GE, Winkley AJ, Lane PW, Bugelski PJ. Maximization of signal derived from cDNA microarrays. Biotechniques. 2001;30:202–208. - PubMed

MeSH terms

LinkOut - more resources

Feedback