I'd like to apply deseq2 to breast cancer RNAseq expression data to compare metastasis vs non-metastasis patient groups. I have 880 samples in the non-metastasis and only 20 samples in the metastasis group. I was searching for if such sample size differences would make sense to use deseq2 (or any other differentially expressed gene analysis) however could not find many resources to justify my study.

I only came across few biostar and bioconductor messages questioning, for example, use of 15vs3 samples. In general, as far as I understood, Deseq2 works okay with unbalanced sample size, but would it be true for a 20 vs 880 sample comparison case?

I also did a PubMed search, as far as I can see there are not any studies tackling such a problem.

thank you in advance

There is no problem with the balance, but I would tend to use limma-voom for analyses with 100s of bulk RNA-seq samples, as it is much faster. I like to use DESeq2 for its Bayesian moderation of fold change in particular, but that is not relevant with sample size this high.

Michael, thank you so much for your quick reply.


