Gene expression microarrays are a relatively new technology, dating back just a few years, yet they have already become a very widely used tool in biology, and have evolved to a wide range of applications well beyond their original design intent. However, while the use of microarrays has expanded, and the issues of performance optimization have been intensively studied, the fundamental issue of data integrity management has largely been ignored. Now that performance has improved so greatly, the shortcomings of data integrity control methods constitute a greater percent of the stumbling blocks for investigators. Microarray data are cumbersome, and the rule up to this point has mostly been one of hands-on transformations, leading to human errors which often have dramatic consequences. We show in this review that the time lost on such mistakes is enormous and dramatically affects results; therefore, mistakes should be mitigated in any way possible. We outline the scope of the data integrity issue, to survey some of the most common and dangerous data transformations, and their shortcomings. To illustrate, we review some case studies. We then look at the work done by the research community on this issue (which admittedly is meager up to this point). Some data integrity issues are always going to be difficult, while others will become easier-one of our goals is to expedite the use of integrity control methods. Finally, we present some preliminary guidelines and some specific approaches that we believe should be the focus of future research.
Copyright 2003 Wiley Periodicals, Inc.