My experimental design is the following:
Replicate | Tissue_type | Genotype |
1 | A | HE |
2 | A | DI |
3 | A | HE |
1 | B | HE |
2 | B | HE |
3 | B | DI |
4 | B | DI |
I am interested in obtained D.E. genes for the question HE vs DI given a tissue type. I chose
to merge the tissue_type and genotype columns for the design as recommended by the DESeq2 authors.
However, the number of samples is un-balanced in my experimental design. That means that, for instance,
tissue type A may have twice as more replicates than tissue type B. This would imply that the question
HE vs DI (for tissue type A) would yield more D.E. genes at a given threshold than for tissue type B.
However, I want to know if a given gene is D.E. in HE vs DI for tissue type A and not for tissue type B and in both too, etc...
So, I am wondering:
- Should I balance the sample sizes by selecting randomly replicates from the tissue type that contains more?
- Should I introduce an interaction term so my formula would become: ~ tissue_type + genotype + tissue_type:genotype
Thanks for the help!
Thanks Michael!
Hello!I am facing a similar situation where I am getting many more genes than what I would probably expect biologically during a comparison of 15 vs 3 samples. If my big group has more statistical power could part of the amount of genes come from this and in that case how should someone interpret the results? Could filtering by higher logFC (apart from padj )also help? Many thanks!
I don’t recommend changing the analysis in any way for an unbalanced design.
Michael Love, can you please elaborate on what you mean by there is more statistical power (sensitivity) for the groups that have more samples? What are the effects of having more statistical power for the larger group? Thank you in advance for your clarification.
It is well known that sensitivity increases with sample size. The OP had genotypes across different tissues. If certain tissues have more samples they will have more power, for the within-tissue genotype DE question.