Question: DESeq2 Dispersion Per Condition
0
3.1 years ago by
dsg160
dsg160 wrote:

Dear Michael,

I am hoping to use DESeq2 to analyze the sequencing results of a Multiply Parallel Reporter Assay that I performed. I was wondering if you might be able to help me answer a question that's been stumping me?

In this assay, we transfect cells with a diverse library of plasmids and then perform RNA-seq to assess the expression of each plasmid. My ultimate goal is to determine if a given sequence takes up a greater (or lesser) fraction in the RNA-seq library than it takes up in the plasmid library (and thus was differentially expressed).

I performed 6 biological replicates of RNA-seq (independent transfections), and 6 replicates of sequencing library preparation from the plasmid library that was used for transfection. The issue I'm running into now is that the dispersion in the RNA-seq replicates will be much higher than the dispersion in the plasmid library prep replicates. I know that DESeq originally calculated dispersion for each condition separately. My question is:

Does DESeq2 still calculate dispersion for each condition separately, such that the low dispersion of my plasmid reps will not artificially lower the overall dispersion?

I would greatly appreciate any help you might be able to provide. Thanks a bunch!

-Dustin

deseq2 • 1.3k views
modified 3.1 years ago • written 3.1 years ago by dsg160
0
3.1 years ago by
Michael Love23k
United States
Michael Love23k wrote:

DESeq2 calculates a single dispersion value for each gene. This means that if one group has a higher dispersion value than the other, the gene-wise estimate will be somewhere in the middle. Remember, then that in the last step, information is shared across all genes to moderate dispersion estimates toward the trend for genes with similar mean (see DESeq2 paper).

I would guess that having the dispersion value in the middle is not a big issue for sensitivity and specificity. There may be a small gain for modeling each condition with its own dispersion, but the big gain in performance for DE methods comes from sharing information about dispersion across genes.

You can estimate dispersions for each group separately (build a DESeqDataSet for each group with design ~1 and estimateDispersions). You could then compare these gene-wise estimates ( mcols(dds.sub)\$dispGeneEst ) to see how different they really are. You could also compare using the overall dispersion estimate with using the dispersion estimate from the group that you suspect has higher dispersion. Note you can do:

dispersions(dds) <- dispersions(dds.sub)
​dds <- nbinomWaldTest(dds)

"There may be a small gain for modeling each condition with its own dispersion"

--> If you expect broad differences in gene expression between samples (e.g., healthy embryo vs sick liver?), don't you think the gain in power would be substantial?

Broad differences in dispersion (CV) you mean? Again, it's not assuming constant variance:

boxplot(rnbinom(100, mu=rep(c(5,100),each=50), size=1/.1) ~ factor(rep(1:2,each=50)))
0
3.1 years ago by
dsg160
dsg160 wrote:

Thanks so much for confirming this, and for responding so quickly! I will do comparisons between the different methods you suggested to see what I am working with.