Skip estimateDispersions() DESeq2
4
0
Entering edit mode
@nickydriedonks-10918
Last seen 5.3 years ago

Dear all,

In my experimental design, I have 20 libraries, coming from 20 F2 interspecific hybrids, meaning every genotype is different. 10 individuals are tolerant and the other 10 are sensitive to a specific type of stress.

I’m interested in what genes are differentially expressed in these two group of plants (tolerant vs sensitive).

Because of the unique genotype in each individual, I notice that the variance within a group (tolerant or sensitive) is big. As I do have a biological explaination for this, I don’t what to correct for this variance.

This is why I’d like to skip the estimateDespersions() function in DESeq2 and continue with the nbionamWald() function directly after library normalization. However, this function requires dispersion estimates for the estimateDispersions() function.

> cds<-estimateSizeFactors(dds)
> skipdisp<-nbinomWaldTest(cds)
Error in nbinomWaldTest(cds) :
  testing requires dispersion estimates, first call estimateDispersions()

I hope anyone can help me out, or have any other suggestions for analyzing these data

Much appreciated,

Nicky

I’m using

R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

‘DESeq2’ version 1.10.1

deseq2 estimatedispersions • 1.2k views
1
Entering edit mode
@mikelove
Last seen 52 minutes ago
United States

You need to have a sense of the variability within the groups in order to perform inference on whether the groups are different. The dispersion estimation is just that: seeing how different the individuals within a group are from each other, so that we can determine how likely it is that we see these 10 with such higher or lower values than these other 10, under the assumption that the groups actually have equal mean ("the null hypothesis").

1
Entering edit mode
@ryan-c-thompson-5618
Last seen 12 months ago
Scripps Research, La Jolla, CA

Any test of differential expression is necessarily a comparison between intra-group variation and inter-group variation, which in this case are dispersion and log fold change, respectively. You're asking to compute a fraction using only a numerator and no denominator - it's a nonsensical idea. If your experimental system has inherently high variance, you can't just eliminate that variance by pretending it doesn't exist. You could simply compute the fold change for each gene and use that to rank your genes, but you will have no estimate of significance for each gene and your gene list will contain such a high proportion of false positives as to be useless.

The only possible approach I can think of would be to do multiple replicates for each genotype and then use limma's duplicateCorrelation to model the inter-genotype variation as a random effect. (I'm not a plant biologist, so I'm not sure how you'd do replicates of individual F2 plants. I guess you'd have to break off parts of the plant and re-plant them to obtain multiple clones of the same genotype?) Obviously this isn't something you can do with the samples you have now, though.

1
Entering edit mode

Agree.

It's worth questioning the proposed extra effort though for the question at hand. That effort at generating more replicates for each genotype in the end won't necessarily give you a substantially different result in your comparison of these 10 genotypes vs those 10 genotypes. It will add precision to the individual genotype estimates, though the within-group across-genotype variability is what drives the significance of the test across groups.

0
Entering edit mode

That's a good point, the within-genotype variance that would be estimated by duplicateCorrelation is not very informative for the comparison in question.

0
Entering edit mode
@nickydriedonks-10918
Last seen 5.3 years ago

Thanks a lot for your quick response!

With respect to generating more replicates, I'm afraid it's not possible to do this as it's pretty costly.

This is why we were thinking of having these 10 plants within each group as being "replicates" although they are no true replicates genotype wise.

So if I understand it correctly, this function only provides estimations. However, I thought I also performs shrinkage, as I thought this is automatically done in the DESeq() function.

That is what concerns me with respect to these genes that are highly different between the samples from the same group.

0
Entering edit mode

"this function only provides estimations"

This is the same as any parametric statistical test, you use the data to estimate parameters, and then compute probabilities using the estimated parameters.

"I thought I also performs shrinkage"

Yes, you shrink the dispersion estimates to generate more reliable estimates using the information from all genes (see the DESeq2 paper for more details).

"That is what concerns me with respect to these genes that are highly different between the samples"

This is all automatically handled by the statistical model. Genes are allowed to have different dispersion values. And as you increase sample size, the shrinkage disappears. So an experiment with 3 vs 3 will exhibit more shrinkage than if you gathered more samples and until it was a 10 vs 10 experiment.

0
Entering edit mode

0
Entering edit mode

What you are describing is an experimental tradeoff you have made - you can't get true replicates, so you are substituting in something else. This has nothing to do with how you fit the model, but it certainly has an impact on how you interpret your results. In other words, you have to fit the model the way Mike and Ryan keep telling you, or it just won't work. And there are some downsides to doing that with replicates that aren't really replicates. But that is an unfortunate aspect of making tradeoffs - you are implicitly trading the ability to actually do the experiment with some interpretability.