DESeq2: DE Analysis with very imbalanced samples per condition
Entering edit mode
thanos5541 • 0
Last seen 2.9 years ago

Hello everyone,

My group has been conducting a large scale analysis using TCGA data. I'm using the expression results to identify DE genes following the DESeq2 vignette along with lfcShrink (apeglm). I apply the analysis between healthy and diseased samples for multiple organs.

However the healthy samples for almost every organ are about 1-15% of the diseased samples (eg. 44 healthy vs 525 diseased,130 vs 903 or even 3 vs 309!). I do get results for almost every organ studied, but I am skeptical on the actual statistical significance of said results and the amount of bias introduced by such a big difference in the sample numbers representing each condition.

Should I do something differently in the analysis because of such imbalance in the samples per condition or is such an analysis pointless because of this? Are the results with adjusted p-value < 0.1 still considered significant as indicated by DESeq2? Should I decrease the required adjusted p-value to less then 0.05 or find a formula for the significance cutoff?

I have searched for similar cases online, but I could not find any so extremely imbalanced as ours, which is why I am asking this here. I have read that DESeq2 does not need equal samples per condition to provide significant results, but I am not sure if that covers extreme cases like ours.

Thanks in advance

deseq2 cancer • 415 views
Entering edit mode
Last seen 4 days ago
United States

There is nothing to change with a large imbalance in the DESeq2 code.

I will mention that you can also easily rely on nonparametric tests such as Wilcoxon and permutation for FDR computation.

Entering edit mode

I see, thank you very much for your quick response!


Login before adding your answer.

Traffic: 350 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6