Question

Estimating group-specific dispersion in DESeq2

0

Entering edit mode

bkellman • 0

@bkellman-11553

Last seen 8.6 years ago

Is there a reason that the 'per-sample' dispersion calculation was discontinued for DESeq2? I am working with ASD samples which have been observed [ http://dx.doi.org/10.1371/journal.pone.0016715 ] to exhibit lower variance. I need to estimate the ASD and typical dispersion separately. Can I do this with DESeq2?

I can run DESeq::estimateDispersions(...) but that produces two dispersion vectors. How I'm not sure how to integrate these dispersion vectors into DESeq2. It seems like there is only room from one dispersion vector in the analysis.

Ben

deseq2 deseq estimatedispersions • 3.9k views

ADD COMMENT • link updated 9.3 years ago by Michael Love 43k • written 9.3 years ago by bkellman • 0

score 0 · Answer 1 · 2016-11-04

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

Ok, so let's link to the original post as well:

C: Estimation of dispersion in DeSeq and DeSeq2

As I said, DESeq2 only supports a single dispersion value per gene, and this was for generalizing the code base to all designs. Usually it's fine to use a single dispersion value per gene, because the condition with the larger dispersion (if they are truly different across groups) will raise the single estimated value. Can you give an example of a gene in which the dispersion estimates are very discordant, maybe show the plotCounts() for this gene? Note that the dispersion is not the same as the variance. The single dispersion value allows for different groups to have different variance.

Or you can go ahead and use DESeq (1), or I believe baySeq, to have different dispersion values per gene.

ADD COMMENT • link 9.3 years ago Michael Love 43k

0

Entering edit mode

I ran a quick preliminary check. It looks like the correlation of dispersions diverges quickly when the min rowSums is raised from 1,10,100 (for 100 samples: 70 ASD, 30 TD). So for genes with a base mean > 1 there is little cohesion between the separate assessments. Let me get an example gene.

> cutoff=1
> tmp0=DESeq::newcountdataset ( conditions=dds_tmp$dx_clean)(counts(dds_tmp)[rowSums(counts(dds_tmp))>cutoff,])
> tmp0=DESeq::estimateSizeFactors(tmp0)
> tmp12 = DESeq::estimateDispersions( tmp0 ,method = "per-condition" )
> tmp22 = DESeq::estimateDispersions( tmp0 ,method = "pooled" )
>cor(data.matrix(data.frame(disp1=tmp12@featureData@data$disp_1,disp2=tmp12@featureData@data$disp_2,pooled=tmp22@featureData@data$disp_pooled)),method='pearson')

           disp1     disp2    pooled
disp1  1.0000000 0.9672386 0.9907651
disp2  0.9672386 1.0000000 0.9759009
pooled 0.9907651 0.9759009 1.0000000

>
> cutoff=10
> tmp0=DESeq::newcountdataset ( conditions=dds_tmp$dx_clean)(counts(dds_tmp)[rowSums(counts(dds_tmp))>cutoff,])
> tmp0=DESeq::estimateSizeFactors(tmp0)
> tmp12 = DESeq::estimateDispersions( tmp0 ,method = "per-condition" )
> tmp22 = DESeq::estimateDispersions( tmp0 ,method = "pooled" )
cor(data.matrix(data.frame(disp1=tmp12@featureData@data$disp_1,disp2=tmp12@featureData@data$disp_2,pooled=tmp22@featureData@data$disp_pooled)),method='pearson')
           disp1     disp2    pooled
disp1  1.0000000 0.8870488 0.9557692
disp2  0.8870488 1.0000000 0.9216594
pooled 0.9557692 0.9216594 1.0000000

>
> cutoff=100
> tmp0=DESeq::newcountdataset ( conditions=dds_tmp$dx_clean)(counts(dds_tmp)[rowSums(counts(dds_tmp))>cutoff,])
> tmp0=DESeq::estimateSizeFactors(tmp0)
> tmp12 = DESeq::estimateDispersions( tmp0 ,method = "per-condition" )
> tmp22 = DESeq::estimateDispersions( tmp0 ,method = "pooled" )
>cor(data.matrix(data.frame(disp1=tmp12@featureData@data$disp_1,disp2=tmp12@featureData@data$disp_2,pooled=tmp22@featureData@data$disp_pooled)),method='pearson')

           disp1     disp2    pooled
disp1  1.0000000 0.6627738 0.8129001
disp2  0.6627738 1.0000000 0.8345221
pooled 0.8129001 0.8345221 1.0000000

ADD REPLY • link 9.3 years ago bkellman • 0

0

Entering edit mode

I'd use DESeq2 to estimate the gene's dispersions for each group, these are more accurate per gene than DESeq. See instructions in my previous answer. And then I'm mostly interested in the difference for individual genes and the plotCounts() figure.

ADD REPLY • link 9.3 years ago Michael Love 43k

0

Entering edit mode

Hi Michael - Is there a cut off for differences in dispersions as to whether the contrast function is suitable to use over testing pairwise groups by inputting the pairs of data in independently?

ADD REPLY • link 8.8 years ago bekah ▴ 40

0

Entering edit mode

No strict cutoff. Take a look at the PCA plot and see if the pairs are very different. You can make the plot and post an image here (using imgur.com for example) if you like.

ADD REPLY • link 8.8 years ago Michael Love 43k

0

Entering edit mode

This was the pca created from logtransformed data using blind = FALSE to make sure biological variation from treatment was not smoothed out excessively. Colours are the experimental treatment groups (i.e. 5 replicates per treatment). I think that the dispersal is too varied and independent pairwise comparison would be best. Thank you again for all your advice.