Search
Question: DESeq2 unequal sample sizes
0
14 months ago by

Hi,

I posted this question three weeks ago on BioStars, but now I'm posting it here too, hoping for some advice. I'm using DESeq2 to look for differentially expressed genes. I have three groups with different individuals: Treatment A, Treatment B, and Control.

Sample sizes (individuals) differ between the groups: A=9, B=5, Control=5.

If I want to test A directly against B, the sample size difference is of no problem, I understand that (asked already in this question).

But, when I want to test A vs Control and see how that differs from B vs Control, I worry about the unequal sample sizes, because I detect more significant genes in A just because of more accurate dispersion due to larger sample size.

What are the suggestions to deal with this problem? Should I randomly subsample 5 individuals from A (i.e. discard 4) and only use those in the comparison with the Controls? Or should I use all individuals in A but call significance with a stricter FDR threshold? If the latter, how do I know which FDR is the most suitable?

edit: Since Michael asked for more explicit description of the dataset:

There's treatment A, treatment B, Control (as described above) and at four time points (including day0 (before treatment)). I want to compare A-vs-Ctrl at day0, day1, day2, day3, (and over time with LRT). Then I want to do the same separately with B-vs-Ctrl.  Finally I want to see how AvsCtrl differed from BvsCtrl at each time point. I expect these to differ a lot, for example A may have 10 times as many significant genes during day1 as B did.

modified 14 months ago • written 14 months ago by glados0

You will have to tell us more about the biology of the experiment to get a good answer. Hence I start with a comment, or a question:

Why do you think that looking for differences between A-Ctrl and B-Ctrl gives you an answer better fitting your biological question than simply comparing A to B? (What *is* your question exactly?)

Let's say, A and B are some treatments.

Then the contrast A-Ctrl answers the question: For which genes to I have evidence that their expression is affected by to treatment A? And the contrast A-to=B answers: For which genes do I have evidence that their expression is affected differently by B than by A?

A and B are two different treatments with different individuals. A vs Ctrl will tell me which genes are different in treatment A, and B vs Ctrl will tell me which genes are different in treatment B.  I expect (and do get) quite different results from the two treatments, i.e. A vs Ctrl provides lots of sig genes, whereas not much is happening with B.

Comparing A vs B directly only tells me the difference between the two treatments, not how they compare and differ to the controls, which is the important question. I need to analyse A vs Ctrl separately and B vs Ctrl separately over time, and discuss how the two treatments are different to Ctrls in their responses.

As we had in a recent post on the support site: "Note that (C-A) - (B-A) = C - B."

Sure, but that doesn't make sense in my case since A and B are very different and I have several time points as well. So I need to analyse A vs Ctrl separately over time, and then B vs Ctrl separately over time. These two analyses will be my main results. How A and B compare to the controls. Only in the end, I intend to compare the treatment differences.

0
14 months ago by
Michael Love20k
United States
Michael Love20k wrote:

"I have several time points as well"

Can you revise your original post to fully describe your dataset? What are the time points, and what comparisons are you interested in making? Can you post the entire colData, e.g.: as.data.frame(colData(dds))

You shouldn't down-sample groups that have more samples with DESeq2, the method is designed to handle groups having different "sizes.

Sure, I have edited the post now, but I don't think the time points are really relevant to the question, so in order to keep it simple I only mentioned the groups. Since A and B are two very different treatments I need to analyse them separately and describe their results separately. In the end I want to compare how they differed. If I down-sample treatment A I retrieve about 40% fewer significant genes compared to using all individuals in A, so I definitely need to either do that or use a different FDR threshold.

It's definitely relevant to the question. You should use a totally different design formula if you have 3 groups, or if you have 3 groups each sampled at 4 time points, and want to control for time=0 expression levels in the non-control group.

I'd recommend starting with this example (which you will have a slight difference, because you have three conditions instead of two):

http://www.bioconductor.org/help/workflows/rnaseqGene/#time

When you say "Finally I want to see how AvsCtrl differed from BvsCtrl at each time point", do you mean more specifically that you want to compare B and A at each time point after time=0, controlling for differences in expression between B, A and Ctrl present at time=0?

Thank you very much for the advice. Are you saying I should put all three treatment groups into this formula?

DESeq(ddsTC, test="LRT", reduced = ~ treatment + time)

I understand that significant genes from this LRT will show those that changed during some time point in one of the groups, but it just feels very counter-intuitive because I have two very different treatments and controls, these three groups can't just be lumped together into one treatment group testing between time points? For example, resultnames like these don't make sense to me: "minute_30_vs_0", since it would need to be tested within each group separately. The resultnames would have to be: "day1_treatmentA_vs_day0_treatmentA" for it to make sense to me (alternatively "day1_treatmentA_vs_day1_control).

Are you advising not to compare to the controls at all? But instead only compare to time point 0 separately in each treatment group? That could potentially work for my question, but still, even if I did only that, wouldn't I expect to find many more significant genes in the group with large sample size? (since I get 40% more sig genes than expected in that group with Wald test).

When it comes to comparing A vs B, my intention has always been to evaluate separately what results they got compared to controls at each time point. I.e. treatment B has these 50 sig genes during day1, while treatment A has the same 50 sig genes+ 30 more during that same day. One worry I have related to not comparing to Ctrls: if geneX becomes slightly upregulated in A (but non-significantly), and slightly more upregulated in B, I believe I'd find this geneX significant when comparing B vs Ctrls, but comparing just B vs A would likely cause it to become non-significant, and therefore in that example "(C-A) - (B-A) = C - B" would not be true. Would you agree with that?

I'm going to recommend you to collaborate with a statistician or someone familiar with linear models, because I'm pressed for time right now. The interaction terms in the model can provide you with what you want (differences relative to the control time series, or differences across treatments relative to the control time series). See this diagram in the vignette for how to interpret the interaction terms. You can pull out and test individual interaction terms with 'name' and test="Wald", or you can compare them with contrast=list("...", "...") and test="Wald".

You can additionally perform a LRT for all the differences associated with one or the other treatment by forming full and reduced model.matrix(), and then passing these matrices to the 'full' and 'reduced' arguments (these arguments can take formula or matrices).