DESeq2: Poor dispersion fit, even when a local or custom fit is used
1
0
Entering edit mode
@frederik-ziebell-14676
Last seen 9 hours ago
Heidelberg, Germany

I have a RNASeq data set with ~650 samples. When I run DESeq2, I get a very poor fit of the mean-dispersion trend:

Here is also a plot of the original trend (without final dispersion estimates), and color coded the number of dispersion iterations:

This is the output of DESeq:

​In DESeqDataSet(se, design = ~condition + run) :
some variables in design formula are characters, converting to factors
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 64 workers
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates, fitting model and testing: 64 workers
-- replacing outliers and refitting for 8367 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

I also tried to supply a custom fit as described in the vignette:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds)
useForMedian <- 1e-7 < mcols(dds)$dispGeneEst & mcols(dds)$dispGeneEst < 1e1
medianDisp <- median(mcols(dds)$dispGeneEst[useForMedian], na.rm=TRUE) dispersionFunction(dds) <- function(mu) medianDisp dds <- estimateDispersionsMAP(dds) dds <- nbinomWaldTest(dds) But this resulted in an even worse fit: Do you have any suggestions how to improve the fit? Thanks, Frederik deseq2 dispersion • 3.6k views ADD COMMENT 0 Entering edit mode Hi Mike, indeed there are batch effects in the data. The way samples are processed is that 48 samples are always multiplexed together. So for these samples, library prep is performed, followed by sequencing and the combination of both is encoded as a "run". Here is a PCA of the vst-transformed counts (with blind=TRUE), numbers idicate the different runs: ... and the same after batch correction (subtracting the run effects from the respective samples): You are probably right that there are still batch effects remaining, so you would suggest to detect remaining effects with SVA/RUV and see if adding these to the design improves the situation? ADD REPLY 0 Entering edit mode If I were you, I'd try to figure out what is going on with PC1 there on the original data. That is a massive batch effect you would at least want to know about for removing such problems on future datasets. I've seen such differences before when the samples were in fact different tissues. I think you should investigate such hypotheses. Otherwise, yes, I would try to detect remaining effects with SVA or RUV and add these to the model. You'll have to figure out how many more factors are needed. ADD REPLY 0 Entering edit mode @mikelove Last seen 20 hours ago United States hi Frederik, My first guess here is that there are some outlier samples or residual unwanted variation in here (e.g. batch effects). Do you have a PCA plot using the vst() transformation? I see you have "run" in the design, is this sequencing run? I've looked in depth into batch effects, and various steps of RNA-seq protocol, and how they affect quantification, and the sequencing run itself is not the biggest contributor to unwanted variation. Instead it is library preparation batch. Even if you don't have library preparation batch information, you can use SVA or RUV to detect sources of unwanted variation and control for them in the design. ADD COMMENT 0 Entering edit mode Hi Mike, I followed up with your suggestion and figured out the source of the batch effect. It was a different read-length that affected the mapping, since I was using STAR with splice junction overhang equal to read length minus 1. After trimming all reads to the same length, the effect is gone: However, the problem of the poor dispersion fit still persists: It turns out that almost all genes either get the same very high dispersion estimate of ~140 or the default minDisp value of 1e-8. What is left are only ~400 genes that do not get one of those two extreme dispersion estimates: I tried your suggestion using SVA and adding the first few SVs to the design or even removing a few outlier samples based on the SVA output, but without any improvements. I don't know if the problem still is remaining variation not captured in the design. I compared a few genes from the minimum and maximum dispersion groups. The left (right) column shows a gene from the minimum (maximum) group. Plotted are the different conditions versus the normalized count (top) or versus the residuum of predicted - observed normalized count (bottom). All three pairs show genes with similar baseMean and variance of the residuals, but one with minimum and one with maximum dispersion estimate. And finally, an overview of all genes with the baseMean plotted against the variance of the residuals. The groups that get the minimum and maximum dispersion estimate intermingle and the few genes that do not get an extreme value are somehow in between: Do you have any suggestions how to continue? ADD REPLY 0 Entering edit mode hi Frederik, Can you try this again, but first removing the genes with very low counts for nearly all samples, before you run DESeq()? Having so many genes with 0 counts for nearly all samples seems to be throwing off the dispersion fit, and you'll do much better to just filter these out for this dataset with so many samples. You can try something like: dds <- estimateSizeFactors(dds) keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= n dds <- dds[keep,] Where you pick n to be a number of samples that might have DE compared to the rest, e.g. the smallest biological group. ADD REPLY 0 Entering edit mode Hi Mike, I tried several filtering methods but none seemed to resove the problem. I made the problem smaller by just keeping two genes in the data with similar baseMean and variance. If I use ~condition+run as design, one gene gets a very low dispersion estimate and the other a very high one (upper row is normalized count, low row the residuals, factes the different runs): Now comes the interesting part: If I either drop a few of the runs or change the design to ~1, both genes get a more meaningful dispersion estimate. Dropped runs (both genes get the exact same dispGeneEst): and only intercept model: I have a hard time finding out what goes "wrong" in estimateDispersionsGeneEst. How can I continue? ADD REPLY 0 Entering edit mode Can you show me the dispersion plot when you use a filtering method? E.g. n set to the number of samples in a condition, or in a run? ADD REPLY 0 Entering edit mode Here is the plot when setting n to the samples in a condition, which makes up a biological group: Even when I use a more aggressive filter and set n to the number of samples in a run (48), the problem still remains: ADD REPLY 0 Entering edit mode dear Frederik, I have some ideas but maybe it would be easier for me to take a look at some of the data. Could you send me a random subset of the dds? E.g. 1000 genes from the set with mean counts > 10? You can remove any sample information except run and condition. You can email to michaelisaiahlove at gmail. ADD REPLY 0 Entering edit mode hi Frederik, I took a look, and an issue here is that there is confounding between run and your biological condition. Here's a heatmap to visualize this, where the light blue blocks are n=3. library(pheatmap) with(colData(dds), pheatmap(table(condition, run), scale="none", show_rownames=FALSE)) This means that you can't reliably separate the "run" and the "condition" effect on counts, although it is not perfect confounding, or else DESeq2 would have thrown an error. There seems to be a single biological condition at the top which spans runs, and this means it's not perfect confounding, but nearly so. It's still a problem though for estimation, even though the software didn't print an error. Since the biological groups are for the most part nested within runs, I think you need to just use a design of ~condition, or else you risk the "run" term removing the true biological effects. ADD REPLY 0 Entering edit mode Hi Mike, thank you for having a look at my data! I'm also aware of the problem of "almost" confounding, but thought I could get away with it, since most runs share two conditions with other runs. We are currently setting up new runs to re-measure previous conditions and de-confound the design. However, if I just use ~condition as design, I still get high dispersion estimates: Do you think this is now because of the batch effect I do not take into consideration? Is there anything else I can try in the meantime than just wait for more data to arrive? ADD REPLY 0 Entering edit mode hi Frederik, I'm not sure why it trips up on the dispersion estimation with ~condition, but I did notice that you have 48 samples in one condition. This group is going to dominate any dispersion estimation task, as most of the other groups have 3 samples and so aren't providing as much information with respect to the dispersion estimates. I'd use the following approach then: dds2 <- dds[,dds$condition == "WT_k1_c0"]
dds2 <- estimateDispersions(dds2)
plotDispEsts(dds2)
dispersions(dds) <- dispersions(dds2)
dds <- nbinomWaldTest(dds)

This gives dispersion estimates which look reasonable on the dispersion plot, and because of the size of this group, it's going to dominate the final estimates even if you used all the small conditions as well. Remember, the DESeq2 model assumes that the dispersion is the same across all conditions, so picking one group doesn't violate the model at all, and you've got plenty of precision from n=48.

Still, I'd recommend to use ~condition so you don't remove biological differences due to the strong partial confounding of run and condition.

0
Entering edit mode

One more note. estimateDispersions() will be fast, you don't need to parallelize. You only need to run estimateSizeFactors() before this, which is also fast.

nbinomWaldTest() is still slow, because you have 100s of conditions and 600 samples.You can easily split across multiple cores and recombine with BiocParallel:

idx <- factor(sort(rep(seq_len(nworkers), length.out=nrow(dds))))
dds <- do.call(rbind,
bplapply(levels(idx), function(l) {
nbinomWaldTest(dds[idx == l,])
}, BPPARAM=BPPARAM)
)