Question

Changing results depending on context

0

Entering edit mode

January Weiner ▴ 370

@january-weiner-3999

Last seen 9.6 years ago

Dear all, I am evaluating a collection of two-color arrays. There are several sets corresponding to different tissues, there is a treatment (common for each tissue), and the treatment effect is analysed independently in each of the tissues. I am analysing the data using limma. The results in terms of the number of significant genes vary widely depending whether the tissues are analysed separately or jointly. Since the matter is even a bit more complicated, I include only pseudocode below. I guess that this depends on the background variation estimation in eBayes, but the trend is not consistent, e.g. for one tissue, more (many more!) significant results are found when analysed separately, and for another tissue the results are much "better" (ie. more singificant results) if analysed along other data. targets <- readTargets( "targets.txt" ) # Agilent two-color microarrays rg.1 <- read.maimages( targets, columns= list( G= "gMedianSignal", Gb= "gBGMedianSignal", R="rMedianSignal", Rb = "rBGMedianSignal"), annotation = c("Row", "Col","FeatureNum", "ControlType","ProbeName", "GeneName", "SystematicName", "Description" )) rg.2 <- backgroundCorrect( rg.1, method= "normexp", offset=50 ) rg.3 <- normalizeWithinArrays( rg.2, method= "loess" ) rg <- normalizeBetweenArrays( rg.3, method= "quantile" ) rg <- rg[ rg$genes$ControlType == 0, ] # analysing all data together: t2 <- targetsA2C( targets ) design <- model.matrix( ~ 0 + t2$Target ) colnames( design ) <- levels( t2$Target ) corfit <- intraspotCorrelation( rg, design ) fit <- lmscFit( rg, design, correlation= corfit$consensus.correlation ) cmtx <- makeContrasts( "A.T1-A.T2", "B.T1-B.T2", levels= design ) fit <- contrasts.fit( fit, cmtx ) fit <- eBayes( fit ) Above, A and B are tissues, T1 and T2 are treatments. Here, some exemplary results showing the number of significant genes: adj.P.Val < 0.05: A.T1-A.T2: 5601 B.T1-B.T2: 3914 adj.P.Val < 1e-5 A.T1-A.T2: 672 B.T1-B.T2: 758 Now, I analyse the data separately: A.targets <- targets[ 1:16, ] A.rg <- rg[ 1:16, ] A.t2 <- targetsA2C( A.targets ) A.design <- model.matrix( ~ 0 + A.t2$Target ) colnames( A.design ) <- levels( A.t2$Target ) A.corfit <- intraspotCorrelation( A.rg, A.design ) A.fit <- lmscFit( A.rg, A.design, correlation= A.corfit$consensus.correlation ) A.cmtx <- makeContrasts( "A.T1-A.T2", levels= A.design ) A.fit <- contrasts.fit( A.fit, cmtx ) A.fit <- eBayes( A.fit ) B.targets <- targets[ 17:32, ] B.rg <- rg[ 17:32, ] B.t2 <- targetsA2C( B.targets ) B.design <- model.matrix( ~ 0 + B.t2$Target ) colnames( B.design ) <- levels( B.t2$Target ) B.corfit <- intraspotCorrelation( B.rg, B.design ) B.fit <- lmscFit( B.rg, B.design, correlation= B.corfit$consensus.correlation ) B.cmtx <- makeContrasts( "B.T1-B.T2", levels= B.design ) B.fit <- contrasts.fit( B.fit, cmtx ) B.fit <- eBayes( B.fit ) Numbers of significant data are now much, much different: adj.P.Val < 0.05: A.T1-A.T2: 3347 B.T1-B.T2: 5443 adj.P.Val < 1e-5 A.T1-A.T2: 108 B.T1-B.T2: 1102 six times less for tissue A, but almost 50% more for tissue B! How can this be? The log fold changes are stable (i.e., they do not change between the various sets), which is to be expected -- changing the context influences the moderated t statistics, but not the estimation of log fold change. However, the p-values are not simply smaller or larger, there is little correlation of the p-values from one context to another. I'm a bit lost in here. Kind regards, January -- -------- Dr. January Weiner 3 -------------------------------------- Max Planck Institute for Infection Biology Charit?platz 1 D-10117 Berlin, Germany Web?? : www.mpiib-berlin.mpg.de Tel? ?? : +49-30-28460514 Fax ? ?: +49-30-28450505

limma limma • 718 views

ADD COMMENT • link updated 11.9 years ago by James W. MacDonald 65k • written 11.9 years ago by January Weiner ▴ 370

score 0 · Answer 1 · 2012-06-12

Hi January, When you analyze all data together, the denominator of your t-statistic is based on an average of the intra-group variability. In general this will tend to make an analysis more powerful, because you are using more data to compute the variance estimate (which is not a very efficient statistic, so more data tend to improve the estimate). Conversely, when you analyze the two groups separately, the denominator of your t-statistic is based only on the intra-group variance for the two groups under consideration. Without having you data in hand, I can only guess what is going on. However, to me the most likely cause of this is a high variance in one or more of your 'A' groups. This would explain why when you combine the analyses you get fewer genes in the 'B' group and more genes in the 'A' group, and vice versa when you do things separately. There are several ways to see if you have one or more problem chips in the A group. You could look at MA plots using plotMA(), you could use arrayWeights() or arrayWeightsSimple() to see if some of the arrays get severely down-weighted. You could also do a PCA plot of the normalized M values to look at the overall grouping structure. Best, Jim On 6/12/2012 7:15 AM, January Weiner wrote: > Dear all, > > I am evaluating a collection of two-color arrays. There are several > sets corresponding to different tissues, there is a treatment (common > for each tissue), and the treatment effect is analysed independently > in each of the tissues. I am analysing the data using limma. > > The results in terms of the number of significant genes vary widely > depending whether the tissues are analysed separately or jointly. > Since the matter is even a bit more complicated, I include only > pseudocode below. I guess that this depends on the background > variation estimation in eBayes, but the trend is not consistent, e.g. > for one tissue, more (many more!) significant results are found when > analysed separately, and for another tissue the results are much > "better" (ie. more singificant results) if analysed along other data. > > targets<- readTargets( "targets.txt" ) > # Agilent two-color microarrays > rg.1<- read.maimages( targets, columns= list( G= "gMedianSignal", > Gb= "gBGMedianSignal", R="rMedianSignal", Rb = "rBGMedianSignal"), > annotation = c("Row", "Col","FeatureNum", "ControlType","ProbeName", > "GeneName", "SystematicName", "Description" )) > rg.2<- backgroundCorrect( rg.1, method= "normexp", offset=50 ) > rg.3<- normalizeWithinArrays( rg.2, method= "loess" ) > rg<- normalizeBetweenArrays( rg.3, method= "quantile" ) > rg<- rg[ rg$genes$ControlType == 0, ] > > # analysing all data together: > t2<- targetsA2C( targets ) > design<- model.matrix( ~ 0 + t2$Target ) > colnames( design )<- levels( t2$Target ) > corfit<- intraspotCorrelation( rg, design ) > fit<- lmscFit( rg, design, correlation= corfit$consensus.correlation ) > cmtx<- makeContrasts( "A.T1-A.T2", "B.T1-B.T2", levels= design ) > fit<- contrasts.fit( fit, cmtx ) > fit<- eBayes( fit ) > > Above, A and B are tissues, T1 and T2 are treatments. > > Here, some exemplary results showing the number of significant genes: > > adj.P.Val< 0.05: > A.T1-A.T2: 5601 > B.T1-B.T2: 3914 > > adj.P.Val< 1e-5 > A.T1-A.T2: 672 > B.T1-B.T2: 758 > > Now, I analyse the data separately: > > A.targets<- targets[ 1:16, ] > A.rg<- rg[ 1:16, ] > A.t2<- targetsA2C( A.targets ) > A.design<- model.matrix( ~ 0 + A.t2$Target ) > colnames( A.design )<- levels( A.t2$Target ) > A.corfit<- intraspotCorrelation( A.rg, A.design ) > A.fit<- lmscFit( A.rg, A.design, correlation= A.corfit$consensus.correlation ) > A.cmtx<- makeContrasts( "A.T1-A.T2", levels= A.design ) > A.fit<- contrasts.fit( A.fit, cmtx ) > A.fit<- eBayes( A.fit ) > > > B.targets<- targets[ 17:32, ] > B.rg<- rg[ 17:32, ] > B.t2<- targetsA2C( B.targets ) > B.design<- model.matrix( ~ 0 + B.t2$Target ) > colnames( B.design )<- levels( B.t2$Target ) > B.corfit<- intraspotCorrelation( B.rg, B.design ) > B.fit<- lmscFit( B.rg, B.design, correlation= B.corfit$consensus.correlation ) > B.cmtx<- makeContrasts( "B.T1-B.T2", levels= B.design ) > B.fit<- contrasts.fit( B.fit, cmtx ) > B.fit<- eBayes( B.fit ) > > Numbers of significant data are now much, much different: > > adj.P.Val< 0.05: > A.T1-A.T2: 3347 > B.T1-B.T2: 5443 > > adj.P.Val< 1e-5 > A.T1-A.T2: 108 > B.T1-B.T2: 1102 > > six times less for tissue A, but almost 50% more for tissue B! How can this be? > > The log fold changes are stable (i.e., they do not change between the > various sets), which is to be expected -- changing the context > influences the moderated t statistics, but not the estimation of log > fold change. However, the p-values are not simply smaller or larger, > there is little correlation of the p-values from one context to > another. I'm a bit lost in here. > > Kind regards, > > January > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099