DESeq2 unbalanced sample size
1
0
Entering edit mode
casikecola • 0
@casikecola-16424
Last seen 2.8 years ago

My experimental design is the following:

Replicate Tissue_type Genotype
1 A HE
2 A DI
3 A HE
1 B HE
2 B HE
3 B DI
4 B DI

I am interested in obtained D.E. genes for the question HE vs DI given a tissue type. I chose

to merge the tissue_type and genotype columns for the design as recommended by the DESeq2 authors. 

However, the number of samples is un-balanced in my experimental design. That means that, for instance,

tissue type A may have twice as more replicates than tissue type B. This would imply that the question

HE vs DI (for tissue type A) would yield more D.E. genes at a given threshold than for tissue type B.

However, I want to know if a given gene is D.E. in HE vs DI for tissue type A and not for tissue type B and in both too, etc...

So, I am wondering:

- Should I balance the sample sizes by selecting randomly replicates from the tissue type that contains more? 

- Should I introduce an interaction term so my formula would become: ~ tissue_type + genotype + tissue_type:genotype

Thanks for the help!

deseq2 bioconductor bioinformatics statistics • 961 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 5 hours ago
United States

The unbalanced group size will affect power but I don't think that you should down-sample the larger group. You should just be aware that there is more statistical power (sensitivity) for the groups that have more samples.

The interaction term is only useful here if you want to test for the interaction. If you want to test for DE within each tissue, then your approach is the simplest (this is discussed in the vignette section on interactions, with a diagram).

ADD COMMENT
0
Entering edit mode

Thanks Michael! 

ADD REPLY
0
Entering edit mode

Hello!I am facing a similar situation where I am getting many more genes than what I would probably expect biologically during a comparison of 15 vs 3 samples. If my big group has more statistical power could part of the amount of genes come from this and in that case how should someone interpret the results? Could filtering by higher logFC (apart from padj )also help? Many thanks!

ADD REPLY
0
Entering edit mode

I don’t recommend changing the analysis in any way for an unbalanced design.

ADD REPLY

Login before adding your answer.

Traffic: 293 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6