I'm trying to do differential expression on label imbalanced data; my case:control ratio is 2:1. I know that regressions are at the core of DESeq2 machinery and I know regressions have internal machinery for coping with such imbalances. Specifically, you can down-weight observations from the over-represented group to force an equal contribution to the learning. Is observation weighting available to the user in DESeq2?
group 'a' over-represented in learning
glm( gene_i ~ effect , data[,c('a','a','a','a','b','b')] )
group 'a' and 'b' equal representation in learning
glm( gene_i ~ effect , data[,c('a','a','a','a','b','b')] , weight=c(.5,.5,.5,.5,1,1) )
I'm asking because all of my differentially expressed genes have a negative log fold change. I've controlled for rin, rrna, library size/size-factor, and a bunch of other things to no avail. My last best guess is that this is due to a class imbalance.
Thanks in advance,
Ben
I looked at the 1st 5k genes. It looks like there is decent (not exceptional) correlation between the deseq estimated logFC (deseq_lFC) and the t.test estimated lFC (t_lFC).
Probing further, I looked at the distribution of differences between lFC estimated by deseq and t-test. It appears that when the estimations are very similar (diff<.1) there is a bias towards deseq_lFC to be slightly larger than t_lFC. On the other hand, when there is a large disparity (diff>.1) between lFC and t, there is a bias for deseq_lFC to be less than t_lFC. I think this is consistent with my concern that deseq may incur a systematically bias given a label imbalance. Certainly [250-300]/5000 isn't catastrophic but it is reasonably concerning.
You can't conclude anything about systematic bias in DESeq2 based on the distribution of differences here. The differences occur because one method is fitting a normal distribution to the residuals, while the other is fitting a negative binomial. The negative binomial is an asymmetric distribution, while the normal is symmetric, so a bias in the differences is expected. But neither one is guaranteed to be less or more biased than the other relative to the true values.
Nothing you've shown here so far indicates to me a bias in parameter estimation, and nothing of what I know about linear models and GLMs suggests that having unequal class sizes can result in biased parameter estimates, in the absence of additional confounding factors. If you still don't trust your results, try selecting the same number of samples from each group and repeating the analysis, and see if the bias that you're seeing toward large negative logFC values remains. (I would repeat the sampling several times to ensure that you don't get fooled by sampling bias.) Again, though, based on the histogram totals that you posted earlier, I don't see any evidence of an imbalance in logFC values, so it seems to me like you are searching for an explanation for an effect that isn't even present.
I realize that normal and nbiom are different distributions. That is why I was comparing the log(fpkm) rather than the fpkm. Log(nbiom) approximates a normal distributions so the results should be very similar.
But I agree, rather than theorizing just run the numbers. Here is DE with 2:1 case:control (n=71:29). There appear to be nearly 2x the number of DE genes
and here is 10 samplings of DE 1:1 case:control (n=25:25)
So it seems that resolving the label imbalance does not resolve the issue of enrichment for negative lFC. But now I seem to have a new problem: the inconsistency of differential expression. Have you seen anything like this Ryan?