Hi!
On top of a standard DE test with limma-voom, I want to carry out a differential variability analysis between two conditions (diets 1 and 2), controlling for categorical batches (e.g., sequencing days) and surrogate variables explaining non-condition/batch structure in the data. The goal is to identify differences in variability that may not purely be a consequence of differences in mean (e.g., control of transcription/degradation rather than increased transcription). My dataset has >100 biological replicates per condition so I got good power to detect them if they exist.
My main questions are if the vst()-based Levene's test approach below is correctly coded and/or sensible?
counts # count matrix
all_covariates # dataframe with conditions (1 or 2), seqday (1 or 2 or 3) and SVs (continuous)
conditions<- as.factor(all_covariates$conditions)
seqday <- as.factor(all_covariates $seqday)
sv <- as.numeric(all_covariates $sv)
data <- DESeqDataSetFromMatrix(countData = counts,
colData = all_covariates,
design = ~conditions)
vst.data = vst(data, blind=F)
contrasts(seqday) <- contr.sum(levels(seqday))
batches <- model.matrix(~seqday+sv)
vst.counts.bc <- limma::removeBatchEffect(vst.counts, covariates = batches[,-1], design = ~conditions)
pvalues <- sapply(1:nrow(vst.counts.bc), function(i){
data <- cbind.data.frame(gene = as.numeric(t(vst.counts.bc[i,])), conditions)
result <- leveneTest(gene~conditions, data, center='median')
p <- result$`Pr(>F)`[1]
return(p)
})
bh <- p.adjust(pvalues, method = "BH")
For some extra context and correct me if I'm wrong - my reading of the source code is that the vst() corrects for library sizes and a mean-variance relationship that varies from gene to gene per condition, in particular addressing the dispproportionately high variances of genes that are either super-low and super-high in mean expression. Furthermore, the median-centred Levene's test should be robust to deviations from normality. Hence, the above approach feels sound. Yet, I haven't seen a paper go for this. Instead, they use packages to fit negative binomials models on TMM+CPM data that model changes in variability as an overdispersion parameter obeying a quadratic mean-variance relationship (e.g., https://journals.physiology.org/doi/full/10.1152/physiolgenomics.00128.2018?), in spite of some papers reporting higher-order polynomial mean-variance relationships in RNAseq data (https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-020-00842-z).
Thanks for the quick and reassuring response Mike! Some follow-ups if I may on the second point.
1) Any comment on the appropriate summary statistic to report increases and decreases in variability? I've been using the fold change in MAD estimated on condition-specific subsets of vst() data so far. I take it the NB method would be the same but for the dispersion?
2) I take it I'm not supposed to use the vst() data for the NB GLM but TMM-adjusted CPM?
3) As such, would the dispersions estimated in each method be different? And thus have an impact on the genes called DV and/or DE?
See Gordon's reply below for an implementation you can use directly.