what causes the difference in the topTable results by the two analyses using limma?
1
1
Entering edit mode
Rao,Xiayu ▴ 550
@raoxiayu-6003
Last seen 8.9 years ago
United States

Hello,

I have a design like below, and I would like to compare (1) cont1 vs. treat1, and (2) cont2 vs. treat2. It is interesting to see that the results are different by analyzing the data in the following 2 ways. For example, the topTables of cont1 vs. treat1 are different using the 2 ways. Could you please let me know what causes the difference??  Which is the correct way to go??

  1. Use the design matrix with all 12 samples, i.e. model.matrix(~0+type), and set up contrasts like diff1=treat1-cont1, and diff2=treat2-cont2, and look at the result of topTable(fit, coef=1), and topTable(fit, coef=2)

samples               type       Experiment

1                            cont1       1

2                            cont1       1

3                            cont1       1

4                           treat1       1

5                           treat1       1

6                           treat1       1

7                           cont2       2

8                           cont2        2

9                           cont2        2

10                       treat2        2

11                       treat2        2

12                       treat2        2

 

           2. Subset the data to the first 6 samples, and use the design matrix based on the 6 samples, and look at the result of topTable(fit). For the other 6 samples in experiment 2, I do them separately.

samples               type       Experiment

1                            cont1       1

2                            cont1       1

3                            cont1       1

4                           treat1       1

5                           treat1       1

6                           treat1       1

 

Thank you very much !

Best regards,

Xiayu

limma topTable • 2.2k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 20 hours ago
The city by the bay

The first approach uses the entire dataset and will give you more degrees of freedom for variance estimation. This will result in a more precise variance estimate for each gene. Ultimately, this should result in greater power for detecting DE genes across each of your contrasts.

The flipside is that the first approach is only sensible when the variance of each gene is similar between the first and second experiments. This is because limma will only estimate a single variance for each gene. If the variances of the two experiments are systematically different, the single estimated value will not accurately model the data from either experiment.

It's generally recommended to use limma on the entire dataset (i.e., your first approach). The second approach should only be considered if there is substantial evidence for different variances between experiments.

ADD COMMENT
0
Entering edit mode

Thank you very much! That's clear. 

ADD REPLY
0
Entering edit mode

Rigging up some analyses to identify this "breaking point" (ie. when to use all data vs. a subset) would be of great interest, I think. Are you aware of some publications/tutorials/vignettes/whatever that touch on this at all? 

ADD REPLY
1
Entering edit mode

I'm not aware of any studies on this issue. I agree that it would be useful to know, especially when deciding how to analyze large datasets. Generally speaking, though, I deal with smaller datasets where there aren't many replicates, so I take all the residual degrees of freedom I can get.

ADD REPLY
1
Entering edit mode

The short answer is that the breaking point is so high that in such cases the samples would usually be best considered as arising from different experiments.

The subsets have to be very different indeed (eg completely different cell types) before analyzing them separately is beneficial. If all the arrays truly arise from the same experiment on the same or similar cell types, then it is usually best to include them all in linear model fits. If there is reason to think that one group (tumors perhaps), are more variable than others, then this can be accommodated in limma, for example by using the voomByGroup function, rather than by breaking the dataset up.

ADD REPLY
0
Entering edit mode

Thanks for chiming in here -- did you mean voomaByGroup, or are you guys working on a voomByGroup, too?

Also, out of curiosity: I've seen the vooma and voomaByGroup functions hiding in the limma package, but haven't had much cause to play with them as I don't work on microarray data (much) these days. Are you folks working on a publication for those, or a more technical writeup somewhere?

ADD REPLY
0
Entering edit mode

Ah, yes, thanks, I did mean voomaByGroup. For RNA-seq data with the same design as above, one could get the same effect as voomaByGroup simply by running voom on subsetted data and cbinding the outputs. We will probably have a more elegant interface for this sort of thing in the package before the next Bioc release -- it will be a generalization of the sample quality weights function.

Would be nice to write it up, but time is the enemy. We'll see.

ADD REPLY

Login before adding your answer.

Traffic: 528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6