I'd like to apply deseq2 to breast cancer RNAseq expression data to compare metastasis vs non-metastasis patient groups. I have 880 samples in the non-metastasis and only 20 samples in the metastasis group. I was searching for if such sample size differences would make sense to use deseq2 (or any other differentially expressed gene analysis) however could not find many resources to justify my study.
I only came across few biostar and bioconductor messages questioning, for example, use of 15vs3 samples. In general, as far as I understood, Deseq2 works okay with unbalanced sample size, but would it be true for a 20 vs 880 sample comparison case?
I also did a PubMed search, as far as I can see there are not any studies tackling such a problem.
thank you in advance
Michael, thank you so much for your quick reply.
Hi, I have a similar problem in my analysis: 35 vs 800 samples. I found several posts where you have commented that "There is no problem with the balance" for DESeq2. I found this figure on comparison of 3 vs 3 and 2 vs 3 samples in one of your replies. but do you have a literature reference supporting your statement in case of highly imbalanced datasets? Thank you in advance.
It's just that there is no breakdown point for linear models with imbalanced data. The estimates are not biased, although you lose efficiency (power).