Question regarding class imbalance and low sample size for DESeq2
1
0
Entering edit mode
@munnauppal-18487
Last seen 5.3 years ago

Hi all,

I have a question regarding how robust DESeq2 is to large class imbalances for differential gene expression. I am currently analyzing RNA-seq data from the GTEx database, and I have gone through a workflow of identifying whether certain tissue samples are "hot" or "cold" in terms of immune infiltration. This workflow yields a (somewhat expected) gross class imbalance between "hot" and "cold" samples, with the latter outweighing the former for some tissues by a couple orders of magnitude (ex. 233 cold vs 6 hot in Adipose tissue). I've found some of Michael Love's commentary on Bioconductor to be very helpful regarding how well DESeq2 can handle both low sample sizes and large class imbalances:

DEseq2: any problem with unbalanced number of sample in normal/tumor study?

DESeq2 with unbalanced experimental design

If I have interpreted his comments correctly, it seems that DESeq2 is quite robust in these situations, and classes that are very imbalanced or containing as few as 2-3 samples are not an issue for DESeq2. However, my question concerns how well DESeq2 performs in the setting of both. Can I perform differential expression when the class imbalance is 151 vs 1? Should I use a minimum number, say 5 samples at least in each class, as a threshold for performing differential expression?

Further, according to Kevin Blighe in this thread (https://www.biostars.org/p/273086/), profound class imbalance can result in a large number of genes passing the FDR q value threshold. Is there a systematic way to choose the FDR in this setting such that I can have confidence in the list of differentially expressed genes?

Thanks in advance for all your help!

deseq2 • 2.2k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

My issue with performing this kind of analysis is that using, e.g. a single sample to make a comparison against a larger reference group means that you don't get to observe how variable that set of possible samples could be, you just observe that one sample, and hope that it is representative. DESeq2 assumes that the groups all have the same dispersion parameter, which can be thought of as the coefficient of variation when the counts are sufficiently high (CV = SD/mean). So if the out group (the one with a few samples) does in fact follow a NB distribution with the same dispersion parameter as the large-sample-size reference group, there's not a problem with the inference. If however, the out group has a much higher CV (SD/mean), but you greatly under-sampled this group, it cannot be learned from the data, then the inference will suffer (you will get FP and FN). It's not easily solved by sophistication of methods (although it might help to have a software which estimates different dispersion per group - we do not do this), and it's not easily solved by reducing the sample size of the large group (that will just increase FN).

ADD COMMENT

Login before adding your answer.

Traffic: 925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6