limma, analysis on subset of data gives completely different results

0

Entering edit mode

Arvid Sondén ▴ 40

@arvid-sonden-6581

Last seen 11.4 years ago

Dear all, I am currently working with gene expression analysis in limma. I have a total of 146 samples divided into 21 groups. What I want to do is pairwise comparisons between one group (the control group) and the others. The following code shows this for the first pairwise comparison between group B and the control group, also adding batch effects to the model. All groups are included in the "Group" variable. design <- model.matrix(~0+Group+Batch) fit<-lmFit(y$E,design) cont <- makeContrasts( " GroupB- GroupControl", levels=design) fit <- contrasts.fit(fit, cont) fit <- eBayes(fit) tt <- topTable(fit, adjust="BH", coef=" GroupB-GroupControl ", genelist=y$genes, number=Inf) >From the beginning I was only working with this first comparison, and was only using the data from group B and the control group. Now I have extended this to all the data and all the pairwise comparisons. Since I am using all of the data in the lmFit function the fit is different from before when I was only using a part of the data. What makes me confused is that the difference is quite large. Now I have 1378 significant genes compared to 203 before for the GroupB-GroupControl comparison after the BH correction. Is there a possible limma specific explanation for this? I have read the documentation on the functions, and the limma user's guide, but I can't say that I have fully understood what is going on inside the lmFit function. On a more conceptual level I understand that the linear model will change when I add new data and new variables, but it seems to be a too large change in my eyes since the actual comparison is still the same. Best regards, Arvid

limma limma • 4.2k views

ADD COMMENT • link updated 22 months ago by Gordon Smyth 53k • written 11.5 years ago by Arvid Sondén ▴ 40

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 9 hours ago

United States

Hi Arvid, On 7/17/2014 9:59 AM, Arvid Sond?n wrote: > Dear all, > > I am currently working with gene expression analysis in limma. I have a total of 146 samples divided into 21 groups. What I want to do is pairwise comparisons between one group (the control group) and the others. The following code shows this for the first pairwise comparison between group B and the control group, also adding batch effects to the model. All groups are included in the "Group" variable. > > design <- model.matrix(~0+Group+Batch) > fit<-lmFit(y$E,design) > cont <- makeContrasts( " GroupB- GroupControl", levels=design) > fit <- contrasts.fit(fit, cont) > fit <- eBayes(fit) > tt <- topTable(fit, adjust="BH", coef=" GroupB-GroupControl ", genelist=y$genes, number=Inf) > >>From the beginning I was only working with this first comparison, and was only using the data from group B and the control group. Now I have extended this to all the data and all the pairwise comparisons. Since I am using all of the data in the lmFit function the fit is different from before when I was only using a part of the data. What makes me confused is that the difference is quite large. Now I have 1378 significant genes compared to 203 before for the GroupB- GroupControl comparison after the BH correction. > > Is there a possible limma specific explanation for this? I have read the documentation on the functions, and the limma user's guide, but I can't say that I have fully understood what is going on inside the lmFit function. On a more conceptual level I understand that the linear model will change when I add new data and new variables, but it seems to be a too large change in my eyes since the actual comparison is still the same. > The eBayes() step may have an effect since you are using a larger number of samples, and hypothetically the prior would be more accurate. But I would imagine that has little to do with it, as the eBayes() step is primarily intended to help smooth the variance when you have very few replicates. Once you get past a certain number of samples I would imagine it has a diminishing return. On the other hand, you are increasing your degrees of freedom markedly by including all the other groups, so your variance estimates will be much more accurate (and you are then borrowing information from all groups to estimate the intra-group variance, which is how ANOVA works, and isn't something specific to limma). I would think this has the largest effect on your results. Best, Jim > Best regards, > > Arvid > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.5 years ago James W. MacDonald 68k

0

Entering edit mode

On Thu, Jul 17, 2014 at 11:21 AM, James W. MacDonald <jmacdon@uw.edu> wrote: > On the other hand, you are increasing your degrees of freedom markedly by > including all the other groups, so your variance estimates will be much > more accurate (and you are then borrowing information from all groups to > estimate the intra-group variance, which is how ANOVA works, and isn't > something specific to limma). I would think this has the largest effect on > your results. > The dark side of this notion is that if your groupB happens to have higher true variance than the other groups (i.e. heteroscedastic), then by including the other groups you've shrunken the variance too much, and inflated significance. ANOVA (and limma) assumes all groups have the same intra-group variance. We see this when we (mistakenly) co-model data from different groups for genes that are "off" in many groups (and thus have much smaller variance driven by measurement noise) but "on" and more biologically variable in a subset of groups. You can use your two fits to compare how much the estimated variance has decreased with the complete dataset, especially for your newly significant genes, and whether the mean expression of those genes is also much lower in the complete dataset. -Aaron [[alternative HTML version deleted]]

ADD REPLY • link 11.5 years ago Aaron Mackey ▴ 200

0

Entering edit mode

Dear All, The dark side of this notion is that if your groupB happens to have higher > true variance than the other groups (i.e. heteroscedastic), then by > including the other groups you've shrunken the variance too much, and > inflated significance. This appears to be an issue not only with "lmFit", but also with camera. Although I have observed that including additional data increased p-values for the genes and gene sets, respectively (for all the groups). Marcin [[alternative HTML version deleted]]

ADD REPLY • link 11.5 years ago Marcin Cieślik ▴ 20

0

Entering edit mode

Axel Klenk ★ 1.1k

@axel-klenk-3224

Last seen 9 hours ago

Barcelona, Spain

Dear Arvid, you can check for yourself: if what you see is a result of limma's eBayes() step then the logFC's of your two analyses should remain the same while t-statistics and p-values etc. should improve when using all of your data. Cheers, - axel On Thu, Jul 17, 2014 at 3:59 PM, Arvid SondÃ©n <arvid.sonden@gu.se> wrote: > Dear all, > > I am currently working with gene expression analysis in limma. I have a > total of 146 samples divided into 21 groups. What I want to do is pairwise > comparisons between one group (the control group) and the others. The > following code shows this for the first pairwise comparison between group B > and the control group, also adding batch effects to the model. All groups > are included in the "Group" variable. > > design <- model.matrix(~0+Group+Batch) > fit<-lmFit(y$E,design) > cont <- makeContrasts( " GroupB- GroupControl", levels=design) > fit <- contrasts.fit(fit, cont) > fit <- eBayes(fit) > tt <- topTable(fit, adjust="BH", coef=" GroupB-GroupControl ", > genelist=y$genes, number=Inf) > > >From the beginning I was only working with this first comparison, and was > only using the data from group B and the control group. Now I have extended > this to all the data and all the pairwise comparisons. Since I am using all > of the data in the lmFit function the fit is different from before when I > was only using a part of the data. What makes me confused is that the > difference is quite large. Now I have 1378 significant genes compared to > 203 before for the GroupB-GroupControl comparison after the BH correction. > > Is there a possible limma specific explanation for this? I have read the > documentation on the functions, and the limma user's guide, but I can't say > that I have fully understood what is going on inside the lmFit function. On > a more conceptual level I understand that the linear model will change when > I add new data and new variables, but it seems to be a too large change in > my eyes since the actual comparison is still the same. > > Best regards, > > Arvid > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Axel Klenk Research Informatician Information Management Drug Discovery Actelion Pharmaceuticals Ltd. â¢ Gewerbestrasse 16 â¢ CH-4123 Allschwil â¢ Switzerland H91.O1.W01.2 axel.klenk@actelion.com â¢ www.actelion.com Address for visitors: Hegenheimermattweg 91 -- The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email. The content of this email is not legally binding unless confirmed by letter. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com [[alternative HTML version deleted]]

ADD COMMENT • link 11.5 years ago Axel Klenk ★ 1.1k

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 13 minutes ago

WEHI, Melbourne, Australia

Dear Arvid,

As Jim has indicated, you have increased the number of samples you are using to estimate the standard errors more than ten-fold (from 2 groups to 21 groups). You can now call DE results with more confidence, even though the fold changes themselves remain the same.

From what you say, the results appear to be not "completely different", but simply more significant than before, this is hardly surprising given the huge increase in residual degrees of freedom.

Best wishes
Gordon

ADD COMMENT • link 11.5 years ago • updated 22 months ago Gordon Smyth 53k

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 13 minutes ago

WEHI, Melbourne, Australia

Dear All,

The dark side of this notion is that if your groupB happens to have higher true variance than the other groups (i.e. heteroscedastic), then by including the other groups you've shrunken the variance too much, and inflated significance.

This appears to be an issue not only with "lmFit", but also with camera. Although I have observed that including additional data increased p-values for the genes and gene sets, respectively (for all the groups).

Marcin

As Jim had already pointed out (in comments included in Aaron's post but deleted from Marcin's), the assumption of equal variances is a property of linear models and anova in general, not specific to limma let alone to lmFit.

If you consider that groups have substantially difference variances, then limma provides functions arrayWeights() and voomaByGroup() to deal with this.

However the linear regression methods are not particulary sensitive to unequal variances, and results usually become more conservative in this situation rather than the other way around. You would only try to fix if the problem is substantial or there are plenty of replicates in each group.

Best wishes
Gordon

ADD COMMENT • link 11.5 years ago • updated 22 months ago Gordon Smyth 53k

Login before adding your answer.