Question: Estimating group-specific dispersion in DESeq2
0
2.8 years ago by
bkellman0
bkellman0 wrote:

Is there a reason that the 'per-sample' dispersion calculation was discontinued for DESeq2? I am working with ASD samples which have been observed [ http://dx.doi.org/10.1371/journal.pone.0016715 ] to exhibit lower variance. I need to estimate the ASD and typical dispersion separately. Can I do this with DESeq2?

I can run DESeq::estimateDispersions(...) but that produces two dispersion vectors. How I'm not sure how to integrate these dispersion vectors into DESeq2. It seems like there is only room from one dispersion vector in the analysis.

Ben

modified 2.8 years ago by Michael Love24k • written 2.8 years ago by bkellman0
Answer: Estimating group-specific dispersion in DESeq2
0
2.8 years ago by
Michael Love24k
United States
Michael Love24k wrote:

Ok, so let's link to the original post as well:

C: Estimation of dispersion in DeSeq and DeSeq2

As I said, DESeq2 only supports a single dispersion value per gene, and this was for generalizing the code base to all designs. Usually it's fine to use a single dispersion value per gene, because the condition with the larger dispersion (if they are truly different across groups) will raise the single estimated value. Can you give an example of a gene in which the dispersion estimates are very discordant, maybe show the plotCounts() for this gene? Note that the dispersion is not the same as the variance. The single dispersion value allows for different groups to have different variance.

Or you can go ahead and use DESeq (1), or I believe baySeq, to have different dispersion values per gene.

I ran a quick preliminary check. It looks like the correlation of dispersions diverges quickly when the min rowSums is raised from 1,10,100 (for 100 samples: 70 ASD, 30 TD). So for genes with a base mean > 1 there is little cohesion between the separate assessments. Let me get an example gene.

> cutoff=1
> tmp0=DESeq::newcountdataset ( conditions=dds_tmp$dx_clean)(counts(dds_tmp)[rowSums(counts(dds_tmp))>cutoff,]) > tmp0=DESeq::estimateSizeFactors(tmp0) > tmp12 = DESeq::estimateDispersions( tmp0 ,method = "per-condition" ) > tmp22 = DESeq::estimateDispersions( tmp0 ,method = "pooled" ) >cor(data.matrix(data.frame(disp1=tmp12@featureData@data$disp_1,disp2=tmp12@featureData@data$disp_2,pooled=tmp22@featureData@data$disp_pooled)),method='pearson')

disp1     disp2    pooled
disp1  1.0000000 0.9672386 0.9907651
disp2  0.9672386 1.0000000 0.9759009
pooled 0.9907651 0.9759009 1.0000000

>
> cutoff=10
> tmp0=DESeq::newcountdataset ( conditions=dds_tmp$dx_clean)(counts(dds_tmp)[rowSums(counts(dds_tmp))>cutoff,]) > tmp0=DESeq::estimateSizeFactors(tmp0) > tmp12 = DESeq::estimateDispersions( tmp0 ,method = "per-condition" ) > tmp22 = DESeq::estimateDispersions( tmp0 ,method = "pooled" ) cor(data.matrix(data.frame(disp1=tmp12@featureData@data$disp_1,disp2=tmp12@featureData@data$disp_2,pooled=tmp22@featureData@data$disp_pooled)),method='pearson')
disp1     disp2    pooled
disp1  1.0000000 0.8870488 0.9557692
disp2  0.8870488 1.0000000 0.9216594
pooled 0.9557692 0.9216594 1.0000000

>
> cutoff=100
> tmp0=DESeq::newcountdataset ( conditions=dds_tmp$dx_clean)(counts(dds_tmp)[rowSums(counts(dds_tmp))>cutoff,]) > tmp0=DESeq::estimateSizeFactors(tmp0) > tmp12 = DESeq::estimateDispersions( tmp0 ,method = "per-condition" ) > tmp22 = DESeq::estimateDispersions( tmp0 ,method = "pooled" ) >cor(data.matrix(data.frame(disp1=tmp12@featureData@data$disp_1,disp2=tmp12@featureData@data$disp_2,pooled=tmp22@featureData@data$disp_pooled)),method='pearson')

disp1     disp2    pooled
disp1  1.0000000 0.6627738 0.8129001
disp2  0.6627738 1.0000000 0.8345221
pooled 0.8129001 0.8345221 1.0000000

I'd use DESeq2 to estimate the gene's dispersions for each group, these are more accurate per gene than DESeq. See instructions in my previous answer. And then I'm mostly interested in the difference for individual genes and the plotCounts() figure.

Hi Michael - Is there a cut off for differences in dispersions as to whether the contrast function is suitable to use over testing pairwise groups by inputting the pairs of data in independently?

No strict cutoff. Take a look at the PCA plot and see if the pairs are very different. You can make the plot and post an image here (using imgur.com for example) if you like.

​This was the pca created from logtransformed data using blind = FALSE to make sure biological variation from treatment was not smoothed out excessively. Colours are the experimental treatment groups (i.e. 5 replicates per treatment). I think that the dispersal is too varied and independent pairwise comparison would be best. Thank you again for all your advice.