I have different species, I'm interested in both DE and exploring the variance of expression in their genes which I calculate from the rlog values, e.g. more variance in genes for one species that may reflect their adaptiveness to heterogenous environment.
I am trying to understand what blind parameter is actually doing, the current recommendation is to set blind=FALSE for downstream analysis, that is NOT to blind the transfrom to the experimental design (which in my case is species). From the help file:
blind: logical, whether to blind the transformation to the experimental design. blind=TRUE should be used for comparing samples in an manner unbiased by prior information on samples, for example to perform sample QA (quality assurance). blind=FALSE should be used for transforming data for downstream analysis, where the full use of the design information should be made. blind=FALSE will skip re-estimation of the dispersion trend, if this has already been calculated. If many of genes have large differences in counts due to the experimental design, it is important to set blind=FALSE for downstream analysis.
I am also read this discussion: https://stat.ethz.ch/pipermail/bioconductor/2014-February/057965.html, where Mike Love said:
When blind=FALSE, the experimental design is only used by the VST and rlog transformations in calculating the gene-wise dispersion estimates, in order to fit a trend line through the dispersions over the mean. Only the trend line is then used by the transformations, not the gene-wise estimates. Therefore, for visualization, clustering, or machine learning applications I tend to recommend blind=FALSE.
The downside of setting blind=TRUE, is that large differences due to the experimental design (e.g., cell types in your case, or different water columns in the linked discussion above), will inflate the gene-wise dispersion estimates. When most of the genes contain such large differences across conditions, this will raise the trend-line, and then the transformed values will be greatly shrunken toward each other for most genes, which is an undesirable loss of signal.
Since I'm interested in gene-wise dispersion itself, it's not clear to me how the transform affects the signal I want to detect. From the discussion I linked, it appears I should set blind=FALSE to avoid shrinking of values toward each other? But I'm not sure how the transform affects the variance?
Or should I calculate variances from just the normTransform or cpm or FPKM?
Yeah, I agree with Kevin that you'd be better off comparing dispersion values. You can make two datasets with
design=~1
and then estimate the dispersion for each one and compare. I would recommend filtering onbaseMean
then, as the dispersion estimate is noisy for small counts (seeplotDispEst
, for example).Thank you for the suggestion! The idea I had was to calculate variance of each gene for each group and run difference in variance statistical test between groups for each gene.
Michael can you explain what you mean by two datasets, do you mean a dataset per species and estimate dispersion for each? How would I take into account batch effects? Should it be design=~batch instead of design=~1.
Is this the baseMean filtering?
Yeah, if you have batches then
design=~batch
.I just meant, after running either
estimateSizeFactors
orDESeq
, to do:I don't have a statistical test in DESeq2, this was more a suggestion for a qualitative comparison. There are other packages from differential variance testing for RNA-seq, but I don't know off hand which are the best. But likely those take counts as input, not rlog or VST data.
Thank you. There is extensive literature for testing difference in scale (as opposed to difference in location) so I was planning to implement a number of those tests.
Do you have the names of the packages that do differential variance testing for RNA-seq? I have not seen them before and did not see them when I googled. I can look at the algos they use. I know that many tests have assumption of similar means.
Edited: In this case, is it suitable to use dispersion as the measure of difference in scale, and would I be able to run the difference in scale tests myself on the extracted dispersions? (https://support.bioconductor.org/p/75260/)
There is some discussion of methods here (although this particular method is for DNA methylation)
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0465-4
This method for difference in distributions should work if you have sufficient sample size per group (which you would need anyway to reliably test for differences in variance):
https://www.bioconductor.org/packages/release/bioc/html/scDD.html
I don't have any recommendation for statistical testing of difference in dispersion parameter, I was suggesting for qualitative comparison only. For a test, I'd go with a method that has been benchmarked. I found these above just googling for "difference in variance/distribution rna-seq" for a few minutes.
Thanks Michael, I found MDSeq (https://academic.oup.com/nar/article/45/13/e127/3848359) and like diffVar for DNA methylations, uses some implementation of the Levene test.
The Lepage test simultaneously tests both location and scale, if I explore that test would it be appropriate to use the mean of rlog values and dispersion values?
I haven’t really explored so you’re on your own. Rlog or VST seems like a good idea.