Question: DESeq2: Poor dispersion fit, even when a local or custom fit is used
gravatar for Frederik Ziebell
3 days ago by
EMBL Heidelberg
Frederik Ziebell0 wrote:

I have a RNASeq data set with ~650 samples. When I run DESeq2, I get a very poor fit of the mean-dispersion trend:

Here is also a plot of the original trend (without final dispersion estimates), and color coded the number of dispersion iterations:

This is the output of DESeq:

​In DESeqDataSet(se, design = ~condition + run) :
  some variables in design formula are characters, converting to factors
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 64 workers
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates, fitting model and testing: 64 workers
-- replacing outliers and refitting for 8367 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing


I also tried to supply a custom fit as described in the vignette:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds)
useForMedian <- 1e-7 < mcols(dds)$dispGeneEst & mcols(dds)$dispGeneEst < 1e1
medianDisp <- median(mcols(dds)$dispGeneEst[useForMedian], na.rm=TRUE)
dispersionFunction(dds) <- function(mu) medianDisp
dds <- estimateDispersionsMAP(dds)
dds <- nbinomWaldTest(dds)


But this resulted in an even worse fit:

Do you have any suggestions how to improve the fit?



ADD COMMENTlink modified 3 days ago • written 3 days ago by Frederik Ziebell0

Hi Mike,

indeed there are batch effects in the data. The way samples are processed is that 48 samples are always multiplexed together. So for these samples, library prep is performed, followed by sequencing and the combination of both is encoded as a "run". Here is a PCA of the vst-transformed counts (with blind=TRUE), numbers idicate the different runs:

... and the same after batch correction (subtracting the run effects from the respective samples):

You are probably right that there are still batch effects remaining, so you would suggest to detect remaining effects with SVA/RUV and see if adding these to the design improves the situation?

ADD REPLYlink modified 3 days ago • written 3 days ago by Frederik Ziebell0

If I were you, I'd try to figure out what is going on with PC1 there on the original data. That is a massive batch effect you would at least want to know about for removing such problems on future datasets. I've seen such differences before when the samples were in fact different tissues. I think you should investigate such hypotheses.

Otherwise, yes, I would try to detect remaining effects with SVA or RUV and add these to the model. You'll have to figure out how many more factors are needed.

ADD REPLYlink written 3 days ago by Michael Love17k
gravatar for Michael Love
3 days ago by
Michael Love17k
United States
Michael Love17k wrote:

hi Frederik,

My first guess here is that there are some outlier samples or residual unwanted variation in here (e.g. batch effects). Do you have a PCA plot using the vst() transformation?

I see you have "run" in the design, is this sequencing run? I've looked in depth into batch effects, and various steps of RNA-seq protocol, and how they affect quantification, and the sequencing run itself is not the biggest contributor to unwanted variation. Instead it is library preparation batch. Even if you don't have library preparation batch information, you can use SVA or RUV to detect sources of unwanted variation and control for them in the design. 

ADD COMMENTlink written 3 days ago by Michael Love17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 191 users visited in the last hour