I have a RNASeq data set with ~650 samples. When I run DESeq2, I get a very poor fit of the mean-dispersion trend:
Here is also a plot of the original trend (without final dispersion estimates), and color coded the number of dispersion iterations:
This is the output of DESeq:
In DESeqDataSet(se, design = ~condition + run) : some variables in design formula are characters, converting to factors estimating size factors estimating dispersions gene-wise dispersion estimates: 64 workers mean-dispersion relationship -- note: fitType='parametric', but the dispersion trend was not well captured by the function: y = a/x + b, and a local regression fit was automatically substituted. specify fitType='local' or 'mean' to avoid this message next time. final dispersion estimates, fitting model and testing: 64 workers -- replacing outliers and refitting for 8367 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing
I also tried to supply a custom fit as described in the vignette:
dds <- estimateSizeFactors(dds) dds <- estimateDispersionsGeneEst(dds) useForMedian <- 1e-7 < mcols(dds)$dispGeneEst & mcols(dds)$dispGeneEst < 1e1 medianDisp <- median(mcols(dds)$dispGeneEst[useForMedian], na.rm=TRUE) dispersionFunction(dds) <- function(mu) medianDisp dds <- estimateDispersionsMAP(dds) dds <- nbinomWaldTest(dds)
But this resulted in an even worse fit:
Do you have any suggestions how to improve the fit?
Thanks,
Frederik
Hi Mike,
indeed there are batch effects in the data. The way samples are processed is that 48 samples are always multiplexed together. So for these samples, library prep is performed, followed by sequencing and the combination of both is encoded as a "run". Here is a PCA of the vst-transformed counts (with blind=TRUE), numbers idicate the different runs:
... and the same after batch correction (subtracting the run effects from the respective samples):
You are probably right that there are still batch effects remaining, so you would suggest to detect remaining effects with SVA/RUV and see if adding these to the design improves the situation?
If I were you, I'd try to figure out what is going on with PC1 there on the original data. That is a massive batch effect you would at least want to know about for removing such problems on future datasets. I've seen such differences before when the samples were in fact different tissues. I think you should investigate such hypotheses.
Otherwise, yes, I would try to detect remaining effects with SVA or RUV and add these to the model. You'll have to figure out how many more factors are needed.