Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 May 1;8:142.
doi: 10.1186/1471-2105-8-142.

Pre-processing Agilent Microarray Data

Affiliations
Free PMC article

Pre-processing Agilent Microarray Data

Marianna Zahurak et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Pre-processing methods for two-sample long oligonucleotide arrays, specifically the Agilent technology, have not been extensively studied. The goal of this study is to quantify some of the sources of error that affect measurement of expression using Agilent arrays and to compare Agilent's Feature Extraction software with pre-processing methods that have become the standard for normalization of cDNA arrays. These include log transformation followed by loess normalization with or without background subtraction and often a between array scale normalization procedure. The larger goal is to define best study design and pre-processing practices for Agilent arrays, and we offer some suggestions.

Results: Simple loess normalization without background subtraction produced the lowest variability. However, without background subtraction, fold changes were biased towards zero, particularly at low intensities. ROC analysis of a spike-in experiment showed that differentially expressed genes are most reliably detected when background is not subtracted. Loess normalization and no background subtraction yielded an AUC of 99.7% compared with 88.8% for Agilent processed fold changes. All methods performed well when error was taken into account by t- or z-statistics, AUCs > or = 99.8%. A substantial proportion of genes showed dye effects, 43% (99% CI: 39%, 47%). However, these effects were generally small regardless of the pre-processing method.

Conclusion: Simple loess normalization without background subtraction resulted in low variance fold changes that more reliably ranked gene expression than the other methods. While t-statistics and other measures that take variation into account, including Agilent's z-statistic, can also be used to reliably select differentially expressed genes, fold changes are a standard measure of differential expression for exploratory work, cross platform comparison, and biological interpretation and can not be entirely replaced. Although dye effects are small for most genes, many array features are affected. Therefore, an experimental design that incorporates dye swaps or a common reference could be valuable.

Figures

Figure 1
Figure 1
Spike-in array. This scatterplot shows the design of the spike-in experiment. The log fold change for each probe (M) is shown as a function of the mean single channel intensity (A). This array was pre-processed using loess normalization without background subtraction. R and G indicate spike-in concentration levels in the red and green channels. Observed fold changes reflect spike-in concentrations with the exception of the blue points in the lower left and to some extent the orange points in the same area. This array is representative of the spike-in arrays in this experiment.
Figure 2
Figure 2
Mean vs median dye swap plots. Each panel shows log2 fold changes from the same two dye-swapped arrays. The effect of choosing mean or median to summarize spot intensity is seen by comparing across rows. Background correction methods can be compared across columns. Mean or median dye swap correlation is similar. Both pre and post-normalization, correlation from highest to lowest is no background adjustment, minimal constant adjustment and local background subtraction.
Figure 3
Figure 3
Post-normalization MA plots. MA plots show log fold changes (M) as a function of the mean single channel intensity (A). Columns show the effect of pre-processing methods. The two arrays shown in the rows are typical of results seen with Agilent chips. The Agilent normalized MA plot exhibits large variablity at low intensities and a low intensity bias toward positive fold changes. Background subtracted, loess normalized MA plots of median or mean fold changes are more variable than the same MA plots when background is not subtracted.
Figure 4
Figure 4
Post-normalization dye swap plots. Each row shows log2 fold changes from the same two dye-swapped arrays. Pre-processing methods are compared across columns. The arrays shown in the rows are representative of results seen with Agilent chips. Agreement across the hybridizations is best for probes that are more than two-fold differentially expressed and when no background subtraction is used.
Figure 5
Figure 5
Variance, bias and background subtraction. Boxplots of the spike-in probes from the self-self experiment show increased variability with Agilent processing and with background subtraction. Each spiked-in probe is spotted 30 times on the array. All 30 replicates on each of the 4 arrays are individually plotted as well. Horizontal "jitter" was added to separate overlayed points. Bias, the distance from the reference line, is greatest for no background subtraction. The first two plots in the top row are lower intensity and this is where the increased variability with background subtraction is most apparent. Intensity increases for the remaining plots and the difference is minimal in the lower set of plots.
Figure 6
Figure 6
Single array ROC curves for fold change and spike-in probes. These ROC curves display the ability to identify differentially expressed probes based on the fold change value. In the spike-in experiment, only the spiked-in probes are present in different quantities in the two channels. The spike-in probes are identified better using median fold changes either with or without background subtraction compared to Agilent processing. Agilent, median with and median without background subtraction, black, red and blue lines respectively.
Figure 7
Figure 7
Multiple array moderated t statistic ROC curves. These ROC curves display the ability to identify differentially expressed probes based on the moderated t-statistic. In the spike-in experiment, only the spiked-in probes are present in different quantities in the two channels. This t-statistic takes the variance of the fold changes into account in a way that borrows strength across all genes. This is necessary with a small number of arrays to reduce the possibility that genes having extremely small variances, by chance alone, are identified as significant. Here the three pre-processing methods performed equally well. The inset of this figure is the same graph where the x axis has been truncated and 1-specificity has been replaced by counts. This shows the very minor improvement using loess normalization with or without background subtraction compared to Agilent normalization. Agilent, median with and without background subtraction, black, red and blue lines respectively.
Figure 8
Figure 8
Dye effects: non spike-in probes. This figure shows the distribution of dye effects. The actual distributions are shown in blue and null distributions are shown in black. Pre-processing methods are compared across columns. In the top row moderated t-statistics are used to measure dye effect. The null distribution was obtained by randomly changing the sign of half of the log ratios for each gene before calculating the t-statistic. Gene specific dye effects are indicated by the heavy tails on the blue curves compared to the black curves of the null distribution. There is also a slight shift to the right for the Agilent processed data and the loess normalized, background subtracted data. In the bottom row, observed distributions of dye effects, as measured by the mean fold changes, are shown to give a sense of their scale. The relationship between null and observed distributions is similar to that seen with moderated t-statistics and is not shown here. Dye effects tend to be small regardless of the pre-processing method used.
Figure 9
Figure 9
Agilent error model. This figure illustrates Agilent's universal error model. The scatterplot shows log fold changes (M) as a function of the mean single channel intensity (A) for a single Agilent processed array. Every log fold change (black point) is matched with a blue point of the same mean intensity level which shows the error in the log fold change as estimated by Agilent's universal error model. For log fold changes that were negative, the error estimate was multiplied by -1 before plotting. These error estimates capture the large global variation that characterizes low intensity genes after Agilent pre-processing. This array is representative of results seen with Agilent arrays.
Figure 10
Figure 10
Accuracy of Agilent universal error model. As part of the Agilent chip design, a set of 100 oligonuceotide sequences are each represented on the array by 10 separate spots. These are useful for evaluating errors in the estimation of intensity. The vertical axis of each panel shows the mean error, measured across replicate spots, using Agilent's universal error model. The horizontal axis shows the observed standard deviation for the same replicates. Each column shows results for a single array. Some probe to probe differences are captured with the Agilent error model for low intensity probes (blue spots). The bottom row gives a close-up of the high intensity end of each array. For high intensity probes (black), where the level of error is small, the universal error model also predicts low error, however, probe to probe differences are not correlated. These arrays are representative of results seen with Agilent arrays.

Similar articles

See all similar articles

Cited by 85 articles

See all "Cited by" articles

References

    1. Tseng GC, Oh MK, Rohlin L, Liao J, Wong W. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Research. 2001;29:2549–2557. doi: 10.1093/nar/29.12.2549. - DOI - PMC - PubMed
    1. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed T. Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research. 2002;30:e15. doi: 10.1093/nar/30.4.e15. - DOI - PMC - PubMed
    1. Smyth G, Yang Y, Speed T. Statistical issues in cDNA microarray data analysis. Methods in Molecular Biology. 2003;224:111–136. - PubMed
    1. Smyth G, Speed T. Normalization of cDNA microarray data. Methods. 2003;31:265–273. doi: 10.1016/S1046-2023(03)00155-5. - DOI - PubMed
    1. Dudoit S, Yang J. Bioconductor R Packages for Exploratory Analysis and Normalization of cDNA Microarray Data. In: Parmigiani G, Garrett E, Irizarry R, Zeger S, editor. The Analysis of Gene Expression Data: Methods and Software. New York: Springer Verlag; 2003.

Publication types

MeSH terms

LinkOut - more resources

Feedback