Search
Question: DEseq2: any problem with unbalanced number of sample in normal/tumor study?
0
11 months ago by
bharata180320
Japan
bharata180320 wrote:

Hello,

I have downloaded TCGA datasets (htseq count file) for several cancer disease. I realized that each dataset has large number of tumor sample but not the normal sample. For example only 60 samples normal and up to ~500 or more tumor samples. Will this unbalance sample cause any problem if I use DEseq2 to get the differentially expressed gene profile? Thank you veru much.

modified 11 months ago by Michael Love19k • written 11 months ago by bharata180320

I don't believe there will be any major problem due to imbalance; I'd be more worried about lack of matched tumour:normal samples (seems unlikely that they've taken 9 tumour samples from each patient providing a normal), but that's the nature of public clinical data.

0
11 months ago by
Michael Love19k
United States
Michael Love19k wrote:

It's not a problem for DESeq2 to have unbalanced sample sizes.

Note that with more than 100 samples per group, there is a substantial speed-up from using a linear model, such as limma-voom, instead of a generalized linear model. I tend to use limma when I have hundreds of samples per group.

I am not in a hurry and my computer is quite good. For almost 600 samples, it took around 1 hour so I think no problem. As for getting the log transform of read count for expression level from the sample, maybe it will take really long time. In this post : DESeq2 rlog function takes too long I have asked this problem and you gave some tweak. I tried that code long time ago and had some increase in speed. I will try that again now. Thank you.

That tweak is now a fully supported function (I'll make a note on that post):

vsd <- vst(dds, blind=FALSE)