Identification and correction of spurious spatial ...yulab.icmb.cornell.edu/PDF/QianB2003_2.pdf ·...

4
BENCHMARKS 42 BioTechniques Vol. 35, No. 1 (2003) Microarray experiments are provid- ing a huge amount of genome-wide data on gene expression. Many prior expression analyses have focused on inferring functional relationships (17); however, the quality control and normalization of the raw data that re- sult from microarrays have received less attention. Here we address a sys- tematic error that arises from microar- rays and discuss current methods to re- solve the problem. It is well known that the data from high-throughput experiments embody a significant component of measure- ment error that must be removed be- fore any analysis can be applied to the data. An intuitive idea is to repeat the experiments and decrease the noise by averaging the measurements from replicates (8). Unfortunately, microar- rays are still difficult to repeat; in most cases, researchers do not have many replicates for analysis. A Bayesian probabilistic approach has been pro- posed to address the problem of the small repetition number for microarray experiments (9). While random error can be canceled by replicate experi- ments, systematic error will not dimin- ish by averaging replicates. For exam- ple, a notorious systematic error in microarray experiments is that the ex- pression ratio of a particular gene at different conditions is a function of its absolute expression levels. If one uses a simple fold-change cut off, then the genes with low expression levels tend to numerically meet the given cut off, even though they are not truly differen- tially expressed. Different methods have been proposed to deal with this problem (1015). In this review, we want to direct at- tention toward a type of systematic er- ror that is manifested by the strong in- teraction between neighboring spots on the array. If the replicate experiments are performed on the arrays with same- chip geometry, then these interactions will not be canceled by the replicates. We will first demonstrate this noise via a case study, and then we will discuss the possible source of these artifacts. Finally, we will discuss current meth- ods to solve the problem, in particular, a local averaging approach called stan- dardization and normalization of mi- croarray data (SNOMAD) (16). We examined several different yeast mi- croarray data sets: diauxic shift, α-fac- tor-arrested cell cycle, cdc15-arrested cell cycle, and cdc28-arrested cell cy- cle (1719). To demonstrate the systematic error in the microarray data, we offer the fol- lowing evidence. The relationship be- tween gene expression and physical chip distance can be revealed by com- paring the chip distance map (Figure 1A) to an expression correlation coeffi- cient map (Figure 1B). The horizontal and vertical axes of these two maps rep- resent the positions of the genes along a chromosome. The colors on the dis- tance and correlation maps represent the chip distance and expression corre- lation coefficient between gene pairs, respectively. Interestingly, the highly correlated gene expression regions (Figure 1B, red blocks) always corre- spond to the short chip distance regions (Figure 1A, red blocks), indicating that the observation of two genes to be co- expressed could be mainly due to their short physical distance on the chip. We also calculated the average cor- relation coefficient of gene expression profiles as a function of the physical chip distance between two genes. Fig- ure 2 shows the result for a microarray data set of the yeast α-arrested cell cy- cle. Without an artifact, the average correlation coefficient should be inde- pendent of the chip distance. However, Figure 2 shows that the closer two genes are on the chip, the higher their average correlation coefficient is. This indicates that this data set contains a large proportion of artifacts. Actually, this phenomenon is not unique to Identification and correction of spurious spatial correlations in microarray data Jiang Qian, Yuval Kluger, Haiyuan Yu, and Mark Gerstein Yale University, New Haven, CT, USA BioTechniques 35:42-48 (July 2003) Figure 1. Distance map and expression correlation coefficient map. Both maps are produced using the yeast α-factor-arrested cell-cycle data set, whose x-axis and y-axis represent the first 100 open read- ing frames on chromosome IV. (A) Distance map. The color on each spot represents the distance between the gene on the x-axis and the gene on the y-axis. (B) Expression correlation coefficient map. The color represents the correlation coefficient between the gene pair. The color codes are to the right of the maps. For a detailed analysis, please refer to Yu et al. (manuscript in preparation).

Transcript of Identification and correction of spurious spatial ...yulab.icmb.cornell.edu/PDF/QianB2003_2.pdf ·...

Page 1: Identification and correction of spurious spatial ...yulab.icmb.cornell.edu/PDF/QianB2003_2.pdf · probabilistic approach has been pro-posed to address the problem of the small repetition

BENCHMARKS

42 BioTechniques Vol. 35, No. 1 (2003)

Microarray experiments are provid-ing a huge amount of genome-widedata on gene expression. Many priorexpression analyses have focused oninferring functional relationships(1�–7); however, the quality control andnormalization of the raw data that re-sult from microarrays have receivedless attention. Here we address a sys-tematic error that arises from microar-rays and discuss current methods to re-solve the problem.

It is well known that the data fromhigh-throughput experiments embodya significant component of measure-ment error that must be removed be-fore any analysis can be applied to thedata. An intuitive idea is to repeat theexperiments and decrease the noise byaveraging the measurements fromreplicates (8). Unfortunately, microar-rays are still difficult to repeat; in mostcases, researchers do not have manyreplicates for analysis. A Bayesianprobabilistic approach has been pro-posed to address the problem of thesmall repetition number for microarrayexperiments (9). While random errorcan be canceled by replicate experi-ments, systematic error will not dimin-ish by averaging replicates. For exam-ple, a notorious systematic error inmicroarray experiments is that the ex-pression ratio of a particular gene atdifferent conditions is a function of itsabsolute expression levels. If one usesa simple fold-change cut off, then thegenes with low expression levels tendto numerically meet the given cut off,even though they are not truly differen-tially expressed. Different methodshave been proposed to deal with thisproblem (10�–15).

In this review, we want to direct at-tention toward a type of systematic er-

ror that is manifested by the strong in-teraction between neighboring spots onthe array. If the replicate experimentsare performed on the arrays with same-chip geometry, then these interactionswill not be canceled by the replicates.We will first demonstrate this noise viaa case study, and then we will discussthe possible source of these artifacts.Finally, we will discuss current meth-ods to solve the problem, in particular,a local averaging approach called stan-dardization and normalization of mi-croarray data (SNOMAD) (16). Weexamined several different yeast mi-croarray data sets: diauxic shift, !-fac-tor-arrested cell cycle, cdc15-arrestedcell cycle, and cdc28-arrested cell cy-cle (17�–19).

To demonstrate the systematic errorin the microarray data, we offer the fol-

lowing evidence. The relationship be-tween gene expression and physicalchip distance can be revealed by com-paring the chip distance map (Figure1A) to an expression correlation coeffi-cient map (Figure 1B). The horizontaland vertical axes of these two maps rep-resent the positions of the genes along achromosome. The colors on the dis-tance and correlation maps representthe chip distance and expression corre-lation coefficient between gene pairs,respectively. Interestingly, the highlycorrelated gene expression regions(Figure 1B, red blocks) always corre-spond to the short chip distance regions(Figure 1A, red blocks), indicating thatthe observation of two genes to be co-expressed could be mainly due to theirshort physical distance on the chip.

We also calculated the average cor-relation coefficient of gene expressionprofiles as a function of the physicalchip distance between two genes. Fig-ure 2 shows the result for a microarraydata set of the yeast !-arrested cell cy-cle. Without an artifact, the averagecorrelation coefficient should be inde-pendent of the chip distance. However,Figure 2 shows that the closer twogenes are on the chip, the higher theiraverage correlation coefficient is. Thisindicates that this data set contains alarge proportion of artifacts. Actually,this phenomenon is not unique to

Identification and correction of spurious spatial correlations in microarray dataJiang Qian, Yuval Kluger, Haiyuan Yu, and Mark GersteinYale University, New Haven, CT, USA

BioTechniques 35:42-48 (July 2003)

Figure 1. Distance map and expression correlation coefficient map. Both maps are produced usingthe yeast !-factor-arrested cell-cycle data set, whose x-axis and y-axis represent the first 100 open read-ing frames on chromosome IV. (A) Distance map. The color on each spot represents the distance betweenthe gene on the x-axis and the gene on the y-axis. (B) Expression correlation coefficient map. The colorrepresents the correlation coefficient between the gene pair. The color codes are to the right of the maps.For a detailed analysis, please refer to Yu et al. (manuscript in preparation).

Page 2: Identification and correction of spurious spatial ...yulab.icmb.cornell.edu/PDF/QianB2003_2.pdf · probabilistic approach has been pro-posed to address the problem of the small repetition

microarrays; we performed the samecalculation on cell-cycle data from theGeneChip® (Affymetrix, Santa Clara,CA, USA) (1) and obtained similar re-sults with respect to the artifact.

For microarray experiments, there isyet additional evidence for an artifact.In the cell-cycle experiments, re-searchers measured the gene expres-sion ratio between the different cell-cy-cle stages, using asynchronous culturesof the same cells as a control sample.This control sample is labeled byCy"3 (green). Ideally, the expressionprofiles of green signal for all genesshould be proportional. Thus, the aver-age correlation coefficient for the greensignal should be 1. However, accordingto Figure 3, we found a pattern similarto that of Figure 2. This is a clear mani-festation of an artifact.

From this analysis, we can see thatthe artifact phenomenon is significantand exists in many chips. Thus, the ar-tifact must be taken into account beforeany conclusion can be drawn based onthe raw, uncorrected expression data.The following is an example of whatwe just stated. A naïve analysis of !-factor-arrested yeast cell-cycle datasuggests that chromosomal spatial or-ganization affects gene expression in asystematic way, as displayed in the dis-tribution of highly correlated gene pairsas a function of the relative pair chro-mosomal distance. The figure showsthat (i) adjacent gene pairs tend to have

high correlation coefficients, which isconsistent with findings by Cohen et al.(20), and (ii) genes that are not in thesame vicinity on the chromosome aremore likely to be co-expressed if theirspacing is a multiple of 22 open read-ing frames (ORFs) in microarray ex-periments. Given the fact that manychips, including this particular microar-ray, are printed according to a simpletransformation of the gene order on thechromosome, the observed long-rangecorrelation could be associated with aninherent chip artifact.

The source of the artifact is un-known, but it might be related to thefollowing processes or their combina-tions: (i) the spotting of DNA probeson the chips; (ii) plate effects; (iii) thewashing of cDNA after hybridization;(iv) cross hybridization; or (v) imagescanning. The effect of spotting ofDNA probes on the chip is also calledprint tip effect. A systematic differencemay exist between the print tips andlead to spatial bias between the sectorson the chip. The analysis of variance(ANOVA) method (11) or MA-plot(19) allows for the detection of this spa-tial bias, and a lowess normalizationapproach was proposed to correct thesystematic bias (15). The plate effectoriginates from PCR amplification biasbetween different plates. This effectwould also introduce further variabilityin the measurement of gene expression.Nonspecific probes were used to cor-

rect the effect, based on the assumptionthat DNA concentration results in thisbias. All these effects are not easy toseparate. Furthermore, some of the as-sumptions, such as equal variance be-tween different sectors, may be invalid.This makes the detection and correc-tion of these effects even more difficult.

It would be most desirable, ofcourse, to completely correct for the ar-tifact after determining its source.However, in practice, it is difficult tocorrect for all the spatial bias. For ex-ample, Yang�’s normalization method isable to correct the spatial artifact due toprint tips (15). However, the unit cor-rected here is the chip �“sector�” (orblock), and this is a rather coarse divi-sion; the spatial bias from other sourcesmay still exist within sectors afterYang�’s sector-based normalization. Webelieve that even without fully under-standing the source of the problem, re-searchers are still able to improve thesignal quality.

Here we discuss a popular methodof �“spatial lowess�” in detail (16). Thislocal normalization method allows forthe detection and/or correction of spa-tially systematic artifacts in microarraydata without exactly attributing the arti-facts to certain sources. We applied themethod from Colantuoni et al. (alsocalled SNOMAD) (16) to the !-factor-arrested yeast cell-cycle data and foundthat their method reduced the artifacteffect but failed to remove it complete-

BENCHMARKS

44 BioTechniques Vol. 35, No. 1 (2003)

Figure 2. Average correlation coefficient distribution as a function of thedistance of gene pairs on the chip. The distance between genes is measuredin terms of the number of spots on the chip. This distribution is calculated us-ing the !-factor-arrested cell-cycle data set. The black line is the distributionfor the raw data. The red line is the distribution for the data after the standard-ization and normalization of microarray data (SNOMAD). The green line isthe distribution for the data after deconvolution.

Figure 3. Average correlation coefficient distribution for green signals. Theblack line is the distribution for the raw data. The red line is the distribution forthe data after the standardization and normalization of microarray data(SNOMAD).

Page 3: Identification and correction of spurious spatial ...yulab.icmb.cornell.edu/PDF/QianB2003_2.pdf · probabilistic approach has been pro-posed to address the problem of the small repetition

BENCHMARKS

46 BioTechniques Vol. 35, No. 1 (2003)

ly. The red line in Figure 2 shows theaverage correlation coefficient as afunction of the chip distance after localnormalization. Apparently, the problemis diminished when we compare the sit-uation before the normalization, asshown by the black line in Figure 2.Figure 4 shows the self-correlation ofgenes that were printed twice on thechip. Without the normalization proce-dure, the mode of the distribution of allself-correlations is approximately 0.1,whereas, in an ideal situation, it shouldbe 1. The application of SNOMADdrives this distribution to the right, witha slightly higher mode at 0.15, which isevidence that SNOMAD improves dataquality. The red line in Figure 3 showsthat after local normalization, the paircorrelation function of the green sig-nals is still not homogeneous, whichmeans the method cannot remove theartifact completely. Surprisingly, iteven produces correlation coefficientsbetween green signals close to 0. Thesecorrelations should ideally be 1 be-cause we only use local normalizationwhen we processed the green signalsusing SNOMAD. The results are simi-lar for several normalization methods(data not shown). An important as-sumption of SNOMAD is that the arti-fact is isotropic on the chip, which isactually untrue in most cases. For ex-ample, we calculated the distribution ofthe average correlation coefficient forall the gene pairs in the same rows and

a similar distribution for those pairs inthe same columns. Figure 5 shows theresults for !-factor-arrested yeast cellcycle, and it is clear that the artifactalong the x-axis is quite different fromthat along the y-axis.

We propose a deconvolution ap-proach to address the artifact problemin chip experiments. This is actuallymore of a general approach than localnormalization and may be able to takeinto account anisotropic effects. We as-sume that for each sample indexed bythe letter t, the measured signal #t at achip location x$ % (x,y) can be expressedas a convolution of the true signals

&t: #t(x $) = 'u$

c(u$)&t (x$ - u$),

where the deviation of c from a ( func-tion represents the extent of the chip ar-tifact. (Note that & t is the ratio of thered and green channels.) Thus, neigh-boring and non-neighboring spots af-fect the signal measured at the point x$.According to the convolution theorem,the Fourier transform of #t(xx$) is givenby #t(k$) = c(k$)&t(k$). Because the truesignals & t and the envelope c(k$) thatrepresent the artifact are unknown, weinspect the artifact-free ratios Rt(k$) %#t(k$)/#*(k$) = &t(k$)/&*(k$) for all sam-ples t, where #*(k$) is some referencemeasurement, such as the average of#*(k$) across all samples. A preliminaryresult from this idea is illustrated inFigure 2, where we show average cor-relation coefficients as a function of the

physical distance of gene pairs on thechip. These distributions were calculat-ed using an !-factor-arrested cell-cycledata set. Clearly, SNOMAD fails to re-move all the artificial components inthe expression profiles. On the con-trary, the distribution after our pro-posed deconvolution method is nolonger distance-dependent. The inverseFourier transforms of these ratios, de-noted by Rt(x$), have no straightforwardbiological interpretation. Nevertheless,under the assumption that the convolu-tion model is adequate, substitution of#t(x$) for these ratios and application ofthe above three pieces of evidence (de-signed to reveal the artifact) allow us todiminish chip artifact effects.

To understand the effect of the chipartifact, we propose printing a uniquesequence in each spot of the array. Hy-bridization with the correspondingcDNA will allow us to study the spatialvariations of the red and green signalsand their ratios, where the dyes repre-sent two distinct or equivalent cDNAsamples. Replicating this experimentusing different spot spacing will allowus to study the spot-to-spot interactioneffect and the associated correlationlength. To remove the spatial variationeffect, we propose a variant of this ap-proach. Instead of printing a unique se-quence at each spot, one can link it toeach of the DNA targets. The red chan-nel can then be assigned to the mRNAsamples, and the green channel can be

Figure 4. Distribution of correlation coefficient for the duplicated genes.The x-axis represents the correlation coefficient between a duplicated gene pair.The y-axis represents the number of duplicated gene pairs. The black line is thedistribution for the raw data, and the red line is the distribution for the data afterthe standardization and normalization of microarray data (SNOMAD).

Figure 5. Average correlation coefficient distributions along the x-axis andy-axis on the chip. The distributions are calculated using the raw !-factor-ar-rested cell-cycle data set. The black line is the distribution along the y-axis, andthe red line is the distribution along the x-axis. C.C., correlation coefficient.

Page 4: Identification and correction of spurious spatial ...yulab.icmb.cornell.edu/PDF/QianB2003_2.pdf · probabilistic approach has been pro-posed to address the problem of the small repetition

BENCHMARKS

Vol. 35, No. 1 (2003)

assigned to the unique sequence. Byapplying the same amounts of the sin-gle-sequence cDNA (green dye) andsample cDNA (red dye) and normaliz-ing the signal of the red channel by thegreen channel signal, one can partiallyremove the chip artifact (because thisnormalization also transforms the greensignal to a constant value at all spots).Moreover, these ratios are proportionalto mRNA concentration. Thus, one cancompare expression levels between dif-ferent genes, as with the GeneChips,and not simply compare the relativevariation of the expression of a geneacross experimental conditions. Notethat placing the different probes thatcorrespond to a single gene in randomlocations on the chip [as is done in thenew U133 Affymetrix chips (1)] andestimating their average intensity donot wipe out this artifact.

In summary, we demonstrate a sys-tematic spatial artifact that arises frommicroarray experiments. The source ofthe artifact is not fully understood. Weshow that a local mean normalizationmethod is useful but cannot completelysolve the problem. Finally, we proposeexperimental and analytical proceduresto quantify and manage this artifact.

ACKNOWLEDGMENTS

J.Q., Y.K., and H.Y. contributedequally to this work. The authors aregrateful to Dr. Nicholas Luscombeand Dr. Mike Snyder for useful discus-sion and acknoledge support from theNational Institutes of Health (grantno. NP50 HG02357-01).

REFERENCES

1.Alter, O., P.O. Brown, and D. Botstein.2000. Singular value decomposition forgenome-wide expression data processing andmodeling. Proc. Natl. Acad. Sci. USA97:10101-10106.

2.Ben-Dor, A., R. Shamir, and Z. Yakhini.1999. Clustering gene expression patterns. J.Comput. Biol. 6:281-297.

3.Eisen, M.B., P.T. Spellman, P.O. Brown,and D. Botstein. 1998. Cluster analysis anddisplay of genome-wide expression patterns.Proc. Natl. Acad. Sci. USA 95:14863-14868.

4.Heyer, L.J., S. Kruglyak, and S. Yooseph.1999. Exploring expression data: identifica-tion and analysis of coexpressed genes.Genome Res. 9:1106-1115.

5.Tamayo, P., D. Slonim, J. Mesirov, Q. Zhu,S. Kitareewan, E. Dmitrovsky, E.S. Lander,and T.R. Golub. 1999. Interpreting patternsof gene expression with self-organizing maps:methods and application to hematopoietic dif-ferentiation. Proc. Natl. Acad. Sci. USA96:2907-2912.

6.Tavazoie, S., J.D. Hughes, M.J. Campbell,R.J. Cho, and G.M. Church. 1999. System-atic determination of genetic network archi-tecture. Nat. Genet. 22:281-285.

7.Toronen, P., M. Kolehmainen, G. Wong, andE. Castren. 1999. Analysis of gene expres-sion data using self-organizing maps. FEBSLett. 451:142-146.

8.Dudoit, S., Y.H. Yang, M. Callow, and T.Speed. 2000. Statistical methods for identify-ing differentially expressed genes in replicat-ed cDNA microarray experiments. Technicalreport no. 578 (http://stat-www.berkeley.edu/users/terry/zarray/Html/matt.html).

9.Baldi, P. and A.D. Long. 2001. A Bayesianframework for the analysis of microarray ex-pression data: regularized t-test and statisticalinferences of gene changes. Bioinformatics17:509-519.

10.Fielden, M.R., R.G. Halgren, E. Dere, andT.R. Zacharewski. 2002. GP3: GenePix post-processing program for automated analysis ofraw microarray data. Bioinformatics (Oxford)18:771-773.

11.Finkelstein, D., R. Ewing, J. Gollub, F.Sterky, J.M. Cherry, and S. Somerville.2002. Microarray data quality analysis:lessons from the AFGC project. ArabidopsisFunctional Genomics Consortium. Plant Mol.Biol. 48:119-131.

12.Mutch, D.M., A. Berger, R. Mansourian, A.Rytz, and M.A. Roberts. 2002. The limit foldchange model: a practical approach for select-

ing differentially expressed genes from mi-croarray data. BMC Bioinfomatics 3:17.

13.Tran, P.H., D.A. Peiffer, Y. Shin, L.M.Meek, J.P. Brody, and K.W. Cho. 2002. Mi-croarray optimizations: increasing spot accu-racy and automated identification of true mi-croarray signals. Nucleic Acids Res. 30:e54.

14.Tseng, G.C., M.K. Oh, L. Rohlin, J.C. Liao,and W.H. Wong. 2001. Issues in cDNA mi-croarray analysis: quality filtering, channelnormalization, models of variations and as-sessment of gene effects. Nucleic Acids Res.29:2549-2557.

15.Yang, Y.H., S. Dudoit, P. Luu, D.M. Lin, V.Peng, J. Ngai, and T.P. Speed. 2002. Nor-malization for cDNA microarray data: a ro-bust composite method addressing single andmultiple slide systematic variation. NucleicAcids Res. 30:e15.

16.Colantuoni, C., G. Henry, S. Zeger, and J.Pevsner. 2002. Local mean normalization ofmicroarray element signal intensities acrossan array surface: quality control and correc-tion of spatially systematic artifacts. BioTech-niques 32:1316-1320.

17.DeRisi, J., V. Iyer, and P. Brown. 1997. Ex-ploring the metabolic and genetic control ofgene expression on a genomic scale. Science278:680-686.

18.Spellman, P., G. Sherlock, M. Zhang, V.Iyer, K. Anders, M. Eisen, P. Brown, D.Botstein, and B. Futcher. 1998. Comprehen-sive identification of cell cycle-regulatedgenes of the yeast Saccharomyces cerevisiaeby microarray hybridization. Mol. Biol. Cell9:3273-3297.

19.Zhu, G., P.T. Spellman, T. Volpe, P.O.Brown, D. Botstein, T.N. Davis, and B.Futcher. 2000. Two yeast forkhead genes reg-ulate the cell cycle and pseudohyphal growth.Nature 406:90-94.

20.Cohen, B., R. Mitra, J. Hughes, and G.Church. 2000. A computational analysis ofwhole-genome expression data reveals chro-mosomal domains of gene expression. Nat.Genet. 26:183-186.

Received 12 November 2002; accepted11 April 2003.

Address correspondence to Mark Gerstein,MB&B Department, Bass 432A, 266 Whit-ney Avenue, Yale University, New Haven,CT 06520, USA. e-mail: [email protected]