Coping with a label/class imbalance in DESeq2
2
0
Entering edit mode
bkellman • 0
@bkellman-11553
Last seen 4.2 years ago

I'm trying to do differential expression on label imbalanced data; my case:control ratio is 2:1. I know that regressions are at the core of DESeq2 machinery and I know regressions have internal machinery for coping with such imbalances. Specifically, you can down-weight observations from the over-represented group to force an equal contribution to the learning. Is observation weighting available to the user in DESeq2?

### group 'a' over-represented in learning

glm( gene_i ~ effect , data[,c('a','a','a','a','b','b')] )

### group 'a' and 'b' equal representation in learning

glm( gene_i ~ effect , data[,c('a','a','a','a','b','b')] , weight=c(.5,.5,.5,.5,1,1) )

I'm asking because all of my differentially expressed genes have a negative log fold change. I've controlled for rin, rrna, library size/size-factor, and a bunch of other things to no avail. My last best guess is that this is due to a class imbalance.

Ben

deseq2 rnaseq r regression • 1.8k views
0
Entering edit mode

I looked at the 1st 5k genes. It looks like there is decent (not exceptional) correlation between the deseq estimated logFC (deseq_lFC) and the t.test estimated lFC (t_lFC).

> range=1:5000
> t_test = apply(fpkm(dds)[range,],1,function(x) t.test(log(x) ~ dds$dx_clean) ) ) > df=data.frame(deseq_lFC=res[[1]]$log2FoldChange[range],t_lFC=unlist(lapply(t_test,function(x) log(x$estimate[2]/x$estimate[1],2) ))
> cor(df,method='pearson'); cor(df,method='spearman')
deseq_lFC    t_lFC
deseq_lFC  1.000000 0.868438
t_lFC      0.868438 1.000000
deseq_lFC     t_lFC
deseq_lFC 1.0000000 0.9679607
t_lFC     0.9679607 1.0000000

​Probing further, I looked at the distribution of differences between lFC estimated by deseq and t-test. It appears that when the estimations are very similar (diff<.1) there is a bias towards deseq_lFC to be slightly larger than t_lFC. On the other hand, when there is a large disparity (diff>.1) between lFC and t, there is a bias for deseq_lFC to be less than t_lFC.  I think this is consistent with my concern that deseq may incur a systematically bias given a label imbalance. Certainly [250-300]/5000 isn't catastrophic but it is reasonably concerning.

> df$diff = df$deseq_lFC - df$t_lFC > hist(df$diff,plot=F)[1:2]
$breaks [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4$counts
[1]    1    0    0    1    5    9   16   40  210 2056 2597   53   11    1
0
Entering edit mode

You can't conclude anything about systematic bias in DESeq2 based on the distribution of differences here. The differences occur because one method is fitting a normal distribution to the residuals, while the other is fitting a negative binomial. The negative binomial is an asymmetric distribution, while the normal is symmetric, so a bias in the differences is expected. But neither one is guaranteed to be less or more biased than the other relative to the true values.

Nothing you've shown here so far indicates to me a bias in parameter estimation, and nothing of what I know about linear models and GLMs suggests that having unequal class sizes can result in biased parameter estimates, in the absence of additional confounding factors. If you still don't trust your results, try selecting the same number of samples from each group and repeating the analysis, and see if the bias that you're seeing toward large negative logFC values remains. (I would repeat the sampling several times to ensure that you don't get fooled by sampling bias.) Again, though, based on the histogram totals that you posted earlier, I don't see any evidence of an imbalance in logFC values, so it seems to me like you are searching for an explanation for an effect that isn't even present.

0
Entering edit mode

I realize that normal and nbiom are different distributions. That is why I was comparing the log(fpkm) rather than the fpkm. Log(nbiom) approximates a normal distributions so the results should be very similar.

But I agree, rather than theorizing just run the numbers. Here is DE with 2:1 case:control (n=71:29). There appear to be nearly 2x the number of DE genes

> summary(res_all)

out of 12029 with nonzero total read count
LFC > 0 (up)     : 72, 0.6%
LFC < 0 (down)   : 114, 0.95%
outliers [1]     : 0, 0%
low counts [2]   : 700, 5.8%
(mean count < 33)

and here is 10 samplings of DE 1:1 case:control (n=25:25)

# fdr<.1, basemean>10, mean lFC for each sampling
> unlist(lapply(res_sel,function(x) mean( na.omit(x$log2FoldChange[x$baseMean>10 & x$padj<.1] )))) [1] -1.37326400 -0.21707750 -3.04876002 -3.38185159 -0.08706219 -2.61225541 [7] NaN -2.45890173 -0.81794748 -0.54061660 # lFC distributions for each sampling >tmp=do.call(rbind,lapply(res_sel,function(x) hist( na.omit(x$log2FoldChange[x$baseMean>10 & x$padj<.1] ) ,plot=F,breaks=seq(-4,4,.25))[[2]]))
colnames(tmp)=paste('<',seq(-4,4,.25)[-1])

> tmp
< -3.75 < -3.5 < -3.25 < -3 < -2.75 < -2.5 < -2.25 < -2 < -1.75 < -1.5
[1,]       0      1       0    0       0      0       0    0       0      1
[2,]       1      0       0    0       0      0       0    3       1      4
[3,]       0      0       0    1       0      0       0    0       0      0
[4,]       0      0       1    0       0      0       0    0       0      0
[5,]       0      0       0    2       0      0       1    0       2      0
[6,]       0      0       1    0       0      0       3    0       0      0
[7,]       0      0       0    0       0      0       0    0       0      0
[8,]       0      0       0    0       0      0       2    0       0      0
[9,]       0      0       0    1       0      0       3    0       0      1
[10,]       0      0       0    1       0      0       2    1       1      0

< -1.25 < -1 < -0.75 < -0.5 < -0.25 < 0 < 0.25 < 0.5 < 0.75 < 1 < 1.25
[1,]       1    1       1      0       0   0      0     1      0   0      0
[2,]       4    8      21     62     138  44    145    38      6   2      1
[3,]       0    0       0      0       0   0      0     0      0   0      0
[4,]       0    0       0      0       0   0      0     0      0   0      0
[5,]       4    5      23     44     122  78    187    51     13  12     11
[6,]       0    0       0      0       0   0      0     0      0   0      0
[7,]       0    0       0      0       0   0      0     0      0   0      0
[8,]       0    0       0      0       0   0      0     0      0   0      0
[9,]       0    0       0      0       0   0      4     2      1   0      0
[10,]       0    2       3      4       5   2      4     6      0   0      1

< 1.5 < 1.75 < 2 < 2.25 < 2.5 < 2.75 < 3 < 3.25 < 3.5 < 3.75 < 4
[1,]     0      0   0      0     0      0   0      0     0      0   0
[2,]     1      0   0      0     0      0   0      0     0      0   0
[3,]     0      0   0      0     0      0   0      0     0      0   0
[4,]     0      0   0      0     0      0   0      0     0      0   0
[5,]     1      0   0      0     0      0   0      0     0      0   0
[6,]     0      0   0      0     0      0   0      0     0      0   0
[7,]     0      0   0      0     0      0   0      0     0      0   0
[8,]     0      0   0      0     0      0   0      0     0      0   0
[9,]     0      0   0      0     0      0   0      0     0      0   0
[10,]     0      0   0      0     0      0   0      0     0      0   0

So it seems that resolving the label imbalance does not resolve the issue of enrichment for negative lFC. But now I seem to have a new problem: the inconsistency of differential expression. Have you seen anything like this Ryan?

1
Entering edit mode
@mikelove
Last seen 55 minutes ago
United States

I'm having a bit of a hard time following exactly what the concern is. Having results that look like this:

out of 12029 with nonzero total read count
LFC > 0 (up)     : 72, 0.6%
LFC < 0 (down)   : 114, 0.95% 

is not wrong or strange and doesn't indicate a technical problem. It just means that you found 72 DEG with positive and 114 DEG with negative LFC. There is not a systematic imbalance or anything. Note that this is < 2% of the nonzero genes.

Having a 2:1 ratio of classes is not a problem for DESeq2, and nothing special needs to be done to analyze the data.

0
Entering edit mode

Great! I have no empirical means to disagreeing. But I am still stuck on the theory. In the Github repo, line 2018-2020 of core.R says:

mu <- normalizationFactors * t(exp(modelMatrix %*% t(betaRes$beta_mat))) # establish library adjusted betas (I assume?) logLike <- nbinomLogLike(counts(object), mu, dispersions(object)) Looking at that, it seems like twice the number of cases would contribute twice as much to the likelihood. Therefore, during a likelihood assessment, the over-represented group would contribute more to the likelihood. Therefore DE would be more reflective of the overrepresented group. The core issue of interest, bias during learning, can be found starting on lines 298 and 329 of DESeq2.cpp: for (int j = 0; j < y_m; j++) { // note the order for Rf_dnbinom_mu: x, sz, mu, lg dev = dev + -2.0 * Rf_dnbinom_mu(yrow[j], 1.0/alpha_hat[i], mu_hat[j], 1); } conv_test = fabs(dev - dev_old)/(fabs(dev) + 0.1); As far as I'm able to decern, each sample contributes equally to the estimation of the deviance. If that is true, then if one group is over-represented they will contribute more to the deviance that decides when convergence is achieved. This seems analogous to the issue of resampling bias that is resolved in Generalized Estimating Equations by a sample-correlation-based normalization of the error during learning. (my understanding of GEE is limited) Does this make sense? Again, I don't have empirical evidence supporting my concern, only a theoretical concern I'm struggling to resolve. Thanks for your time, Ben ADD REPLY 0 Entering edit mode I'm not sure what you mean by "DE would be more reflective of the overrepresented group". DE between two groups is reflective of both groups equally. As far as I can tell, the only possible issue in the convergence code you show would be if the larger group somehow converged faster than the smaller group, so that the overall convergence threshold was reached and the loop terminated before the smaller group finished converging. However, I exepct this would only be a major issue for group sizes differing by orders of magnitude, e.g. 3 samples vs 500 samples. If you suspect that convergence issues are biasing your results, you could download the DESeq2 source code and modify it to use a more stringent threshold for convergence. But I doubt you will see a measurable difference at any level that matters if you do so. Furthermore, in a case where convergence did become a problem, I wouldn't expect it to consistently bias the estimation in the same direction across many genes. ADD REPLY 0 Entering edit mode I see what you are saying, I do think that speed of convergence could be an issue. Though I don't have an intuition about what scale of class imbalance would be necessary to see a significant difference in convergence speeds. I think I have found code (line 317, DESeq.cpp) that gets at the root of my concern: w = diagmat(mu_hat/(1.0 + alpha_hat[i] * mu_hat)); z = arma::log(mu_hat / nfrow) + (yrow - mu_hat) / mu_hat; solve(beta_hat, x.t() * w * x + ridge, x.t() * w * z); I believe the last line is the update rule during the fitting of beta. In bold is code assessing the divergence between mu and the observed counts. This error assessment is typically used to specify which parameters should change and by how much in the next iteration of the parameter being fit. From this code, it looks like z is being used to determine how much beta should change. My concern is that if more samples are from one group, that group will have more "voting power" in determining when and by how much beta should change thereby resulting in a beta that is biased towards a better prediction of the larger group than the smaller group. That said, I'm not familiar enough with the solve function used here to determine exactly the math at play. If x (the design) is being used to segregate information contributing to the update between groups, then there may be no issue of bias at all. I can't figure out the math in the last line. Can you figure out if the next beta is being determined by the consensus of error or segregated error? ADD REPLY 0 Entering edit mode Honestly, we're going off into the woods a bit in this thread. This is the standard IRLS aglorithm to fit a GLM. ADD REPLY 0 Entering edit mode In the typical notation, beta refers to the model coefficients. So in a design of ~group, each group's coefficient is being estimated from just the samples in that group. The beta for group A isn't somehow going to jump the rails and converge toward the samples in group B, even if group A has 2 samples and group B has thousands. I think you misunderstood what I was saying about convergence speeds. If you could run the algorithm through an infinite number of iterations, all group coefficients would converge exactly, and you'd have no problems. But you have to run a finite number of iterations, so you need some criterion to decide when the coefficients have "stopped" changing. Now, I should say that I'm not exactly sure what this criterion is in DESeq2. If this criterion is based on, say, the average change of all the residuals in the model, then the average could theoretically be small while the residuals for the smaller group are still changing by larger amounts. However, if the convergence criterion was based on the maximum change rather than the average, then there would be no problem, because it wouldn't stop until all the residuals stopped changing. My guess is that, given the availability of processing power these days, most such algorithms probably err on the side of running too many iterations rather than too few. And while I don't know the specifics, I know that the specific algorithms chosen by edgeR and DESeq2 are tuned to be effective on RNA-seq data. So if you think the glm fitting algorithm is causing problems with your data, I'm almost certain you're barking up the wrong tree. ADD REPLY 0 Entering edit mode " Therefore DE would be more reflective of the overrepresented group." No this doesn't really make sense in this case. The only concern I have about groups is if they were to have *very* different true dispersions (think coefficient of variation), while our model has a single dispersion parameter per gene. You can assess this yourself by examining a PCA plot, but it's rarely a concern. You can just use the functions as is. ADD REPLY 0 Entering edit mode I have had concerns about the assumption of consistent variance within each gene across groups but over time I've come to accept it as respectably conservative. ADD REPLY 0 Entering edit mode Just a note: the dispersion is closer to the coefficient of variation ( SD / mean ). The assumption doesn't fix the variance of the counts across groups, just the relationship of the variance to the mean. Large counts have larger variance, and the model accommodates this. ADD REPLY 0 Entering edit mode If you really want a heteroskedastic model, you can do it with limma using voomWithQualityWeights. But if you don't have some reason to suspect sample heteroskedasticity in your data a priori, it's probably not going to make much of a difference in the results (compared to regular voom without quality weights). I have asked about the equal-dispersion assumption before, and the answer from Gordon Smyth (one of the edgeR authors) was that in the NB GLM context, fitting an equal-dispersion model to a data set that actually has unequal dispersions results in a more conservative test, not a liberal one. So the resulting FDR values are still valid as an upper bound. ADD REPLY 0 Entering edit mode @ryan-c-thompson-5618 Last seen 12 months ago Scripps Research, La Jolla, CA Using weights for this purpose is not really appropriate. Weights are designed for the case where some samples have better measurement precision than others, or some groups have a higher variance than others. Unless you have prior information to that effect, you should not be adding weights. (Alternatively, you can use limma's voomWithQualityWeights to have limma attempt to determine the appropriate precision weights automatically.) If one group has more samples, then the mean for that group will be more precise. This is an inherent property of statistics, and if you fight it by arbitrarily forcing equal weights at the group level, you will simply lose statistical power. This is not the purpose of weights in a (generalized) linear model. If the control mean is 1 and the case mean is 2, having more cases won't somehow push the case group's mean up to 3 or down to -1. It will just be a more precise estimate of 2. So class size imbalance cannot explain your observed imbalance between upregulated and downregulated genes. If you have already controlled for every other possible cause of such imbalance, you might have to consider the possibility that your treatment actually causes more downregulation than upregulation. There are certainly treatments where such an imbalance is expected. For example, miRNA/siRNA should primarily downregulate their targets, while treatment with a growth factor might cause an upregulation of a large number of growth and proliferation-related genes. ADD COMMENT 0 Entering edit mode As far as I understand, coping with sample accuracy is only one use of regression penalties/weights. It may also be used to cope with a class imbalance (combating-imbalance). The issue is that when the parameters are being learned, the over-represented group will drive the learning more since learning in a regression is typically error driven (every next iteration uses an updated weighting on the observations according to which observations had the highest error). This means an over-represented class can drive the final parameter in one direction or another. If possible I'd like to stick with DESeq2 to minimize change to my code base. To say that everything is up/downregulated seems very strange. miRNA are specific to their targets and growth factors may cause a lot of upregulation but not exclusively upregulation. ADD REPLY 0 Entering edit mode That article is referring to a machine learning context, where class imbalance can indeed cause issues if not handled properly. Specifically, prediction of the class of a new sample can be biased toward the class with more samples in it. However, you are not trying to classify samples, you are looking for differentially expressed genes. So the possible bias of the fitted model with respect to sample classification is not relevant. In a simple 2-class model, you are effectively doing something analogous to a t-test but using a different distribution instead of the normal. The control sample mean is computed from the control samples, and the case mean is calculated from the case samples. Then the (log) fold change is the (log) ratio of the two means. Having more samples in one group or the other will not bias this measure. If you doubt this, compute the log CPM for each individual sample and for each group and manually verify that for each gene, the group means correspond closely with the mean of the samples for that group. (They won't correspond exactly, since the arithmetic mean assumes a normal distribution, not a negative binomial.) ADD REPLY 0 Entering edit mode That is the goal of DESeq but the method used to get there, parameterizing a regression, uses (it should) an error based method. This means it should be sensitive to class imbalance. At the very least, it is weird that I have tons of low logFC upregulation and yet almost exclusively high logFC downregulation: > hist(res[[1]]$log2FoldChange,plot=F,breaks=seq(-3,3,.5))[1:2]

$breaks [1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0$counts

[1]    1    2    2    5   74 5800 6052   90    3    0    0    0

I'll check the FPKM (readily available) as you suggested.

0
Entering edit mode

If I'm reading this correctly, you have 10 genes below -1 logFC and 3 genes above +1 logFC. And for logFC between -1 and +1, there are more positives logFC than negative. This does not strike me as a bias toward large negative logFC.

0
Entering edit mode

It is more clear without the gene filter I'm currently using. The low base mean genes (mean(log(fpkm))<.5) are much more biased towards negative logFC. Also, I accidentally responded below.

1
Entering edit mode

This, I can believe. I have seen similar effects in many data sets. The reason is that you are running up against the noise floor in one group but not in the other. This commonly occurs when one group has substantially more sequencing depth (and therefore a lower noise floor) than the other. Since one of your groups has more samples than the other, it likely also has more total reads. To avoid this, you must set your detection threshold at least as high as the noise floor in the least-sequenced group. In my experience, an FPKM of 0.5 is generally not considered to be expressed at a detectable level.

0
Entering edit mode

Yea, that is what I did, I increased my minimum base mean to mean(log(fpkm_gene))<.5. But after increasing the minimum fpkm, I still appear to be somewhat skewed to lower logFC.

(log (fpkm) not fpkm, either way, not a ton of expression)

0
Entering edit mode

This is an interesting insight. I haven't heard the term "noise floor" before. Where is that from?

0
Entering edit mode

"Noise floor" a general term: https://en.wikipedia.org/wiki/Noise_floor

In the context of RNA-seq, there are a couple of noise floors in play, although one of them typically dominates the others in a given experiment. Based on sequencing depth, you have a "measurement noise floor", since low counts carry very little information about a gene's expression. In the limiting case where the expected count of a gene is less than one, the count carries almost no information. If different samples have different sequencing depths, then they also have different measurement noise floors. Note that this noise floor operates on the raw count scale.

In addition, you have a biological noise floor, which is the expected count for a gene that is not expressed. This may be nonzero due to technical phenomena such as DNA contamination or biological phenomena such as "leaky transcription" of non-expressed genes. Any measurement below this noise floor cannot be confidently said to represent real, regulated gene expression, even if you have sufficient sequencing depth to confidently say that the measurement is nonzero. If different samples have different levels of DNA contamination, they will have different biological noise floors. Note that this noise floor operates on the scale of normalized expression levels (measured by e.g. TPM values).

Lastly, depending on which statistical model you are using, there may be a count level below which the model breaks down and gives absurd answers. For example, the limma voom method interprets a gene with all zero counts or nearly all zero counts as having near zero variance, which causes a downward distortion at the low end of mean-variance trend that voom is trying to estimate. This distortion can cause voom to give spuriously high weights to very low-count measurements. So it is important when using voom to inspect the plot and make sure the curve is well-behaved (generally, the left side of the mean-variance curve should be monotonically decreasing). This threshold is not, strictly speaking, a "noise floor", since it represents a break down of the assumptions of the statistical method rather than a lack of information in the data. However, it is similar to a noise floor, in that it represents a limit below which you cannot do useful statistics. This threshold is less of a problem for DESeq or edgeR, which model the counts directly as counts, but in general, when analyzing count data, the greatest violations of the model assumptions are likely to occur at the lowest counts.