Hello,
I have a design like below, and I would like to compare (1) cont1 vs. treat1, and (2) cont2 vs. treat2. It is interesting to see that the results are different by analyzing the data in the following 2 ways. For example, the topTables of cont1 vs. treat1 are different using the 2 ways. Could you please let me know what causes the difference?? Which is the correct way to go??
- Use the design matrix with all 12 samples, i.e. model.matrix(~0+type), and set up contrasts like diff1=treat1-cont1, and diff2=treat2-cont2, and look at the result of topTable(fit, coef=1), and topTable(fit, coef=2)
samples type Experiment
1 cont1 1
2 cont1 1
3 cont1 1
4 treat1 1
5 treat1 1
6 treat1 1
7 cont2 2
8 cont2 2
9 cont2 2
10 treat2 2
11 treat2 2
12 treat2 2
2. Subset the data to the first 6 samples, and use the design matrix based on the 6 samples, and look at the result of topTable(fit). For the other 6 samples in experiment 2, I do them separately.
samples type Experiment
1 cont1 1
2 cont1 1
3 cont1 1
4 treat1 1
5 treat1 1
6 treat1 1
Thank you very much !
Best regards,
Xiayu
Thank you very much! That's clear.
Rigging up some analyses to identify this "breaking point" (ie. when to use all data vs. a subset) would be of great interest, I think. Are you aware of some publications/tutorials/vignettes/whatever that touch on this at all?
I'm not aware of any studies on this issue. I agree that it would be useful to know, especially when deciding how to analyze large datasets. Generally speaking, though, I deal with smaller datasets where there aren't many replicates, so I take all the residual degrees of freedom I can get.
The short answer is that the breaking point is so high that in such cases the samples would usually be best considered as arising from different experiments.
The subsets have to be very different indeed (eg completely different cell types) before analyzing them separately is beneficial. If all the arrays truly arise from the same experiment on the same or similar cell types, then it is usually best to include them all in linear model fits. If there is reason to think that one group (tumors perhaps), are more variable than others, then this can be accommodated in limma, for example by using the voomByGroup function, rather than by breaking the dataset up.
Thanks for chiming in here -- did you mean
voomaByGroup
, or are you guys working on avoomByGroup
, too?Also, out of curiosity: I've seen the
vooma
andvoomaByGroup
functions hiding in the limma package, but haven't had much cause to play with them as I don't work on microarray data (much) these days. Are you folks working on a publication for those, or a more technical writeup somewhere?Ah, yes, thanks, I did mean voomaByGroup. For RNA-seq data with the same design as above, one could get the same effect as voomaByGroup simply by running voom on subsetted data and cbinding the outputs. We will probably have a more elegant interface for this sort of thing in the package before the next Bioc release -- it will be a generalization of the sample quality weights function.
Would be nice to write it up, but time is the enemy. We'll see.