Question

DESeq2 unbalanced sample size

0

Entering edit mode

casikecola • 0

@casikecola-16424

Last seen 5.4 years ago

My experimental design is the following:

Replicate	Tissue_type	Genotype
1	A	HE
2	A	DI
3	A	HE
1	B	HE
2	B	HE
3	B	DI
4	B	DI

I am interested in obtained D.E. genes for the question HE vs DI given a tissue type. I chose

to merge the tissue_type and genotype columns for the design as recommended by the DESeq2 authors.

However, the number of samples is un-balanced in my experimental design. That means that, for instance,

tissue type A may have twice as more replicates than tissue type B. This would imply that the question

HE vs DI (for tissue type A) would yield more D.E. genes at a given threshold than for tissue type B.

However, I want to know if a given gene is D.E. in HE vs DI for tissue type A and not for tissue type B and in both too, etc...

So, I am wondering:

- Should I balance the sample sizes by selecting randomly replicates from the tissue type that contains more?

- Should I introduce an interaction term so my formula would become: ~ tissue_type + genotype + tissue_type:genotype

Thanks for the help!

deseq2 bioconductor bioinformatics statistics • 2.7k views

ADD COMMENT • link updated 14 months ago by Michael Love 41k • written 5.4 years ago by casikecola • 0

score 0 · Answer 1 · 2018-12-19

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

The unbalanced group size will affect power but I don't think that you should down-sample the larger group. You should just be aware that there is more statistical power (sensitivity) for the groups that have more samples.

The interaction term is only useful here if you want to test for the interaction. If you want to test for DE within each tissue, then your approach is the simplest (this is discussed in the vignette section on interactions, with a diagram).

ADD COMMENT • link 5.4 years ago Michael Love 41k

0

Entering edit mode

Thanks Michael!

ADD REPLY • link 5.3 years ago casikecola • 0

0

Entering edit mode

Hello!I am facing a similar situation where I am getting many more genes than what I would probably expect biologically during a comparison of 15 vs 3 samples. If my big group has more statistical power could part of the amount of genes come from this and in that case how should someone interpret the results? Could filtering by higher logFC (apart from padj )also help? Many thanks!

ADD REPLY • link 4.2 years ago ch_el ▴ 10

0

Entering edit mode

I don’t recommend changing the analysis in any way for an unbalanced design.

ADD REPLY • link 4.2 years ago Michael Love 41k

0

Entering edit mode

Michael Love, can you please elaborate on what you mean by there is more statistical power (sensitivity) for the groups that have more samples? What are the effects of having more statistical power for the larger group? Thank you in advance for your clarification.

ADD REPLY • link 14 months ago Cen • 0

1

Entering edit mode

It is well known that sensitivity increases with sample size. The OP had genotypes across different tissues. If certain tissues have more samples they will have more power, for the within-tissue genotype DE question.

ADD REPLY • link 14 months ago Michael Love 41k