Question

Differential variability modelling with vst() expression

0

Entering edit mode

jtsy6383 • 0

@58973972

Last seen 10 hours ago

Germany

Hi!

On top of a standard DE test with limma-voom, I want to carry out a differential variability analysis between two conditions (diets 1 and 2), controlling for categorical batches (e.g., sequencing days) and surrogate variables explaining non-condition/batch structure in the data. The goal is to identify differences in variability that may not purely be a consequence of differences in mean (e.g., control of transcription/degradation rather than increased transcription). My dataset has >100 biological replicates per condition so I got good power to detect them if they exist.

My main questions are if the vst()-based Levene's test approach below is correctly coded and/or sensible?

    counts # count matrix
    all_covariates # dataframe with conditions (1 or 2), seqday (1 or 2 or 3) and SVs (continuous)
    conditions<- as.factor(all_covariates$conditions)
    seqday <- as.factor(all_covariates $seqday)
    sv <- as.numeric(all_covariates $sv)

    data <- DESeqDataSetFromMatrix(countData = counts,
                                   colData = all_covariates,
                                   design =  ~conditions) 
    vst.data = vst(data, blind=F) 

    contrasts(seqday) <- contr.sum(levels(seqday))
    batches <- model.matrix(~seqday+sv)
    vst.counts.bc  <- limma::removeBatchEffect(vst.counts, covariates = batches[,-1],  design = ~conditions) 

    pvalues <- sapply(1:nrow(vst.counts.bc), function(i){
      data <- cbind.data.frame(gene = as.numeric(t(vst.counts.bc[i,])), conditions)
      result <- leveneTest(gene~conditions, data, center='median')
      p <- result$`Pr(>F)`[1]
      return(p)
    })
    bh <- p.adjust(pvalues, method = "BH")

For some extra context and correct me if I'm wrong - my reading of the source code is that the vst() corrects for library sizes and a mean-variance relationship that varies from gene to gene per condition, in particular addressing the dispproportionately high variances of genes that are either super-low and super-high in mean expression. Furthermore, the median-centred Levene's test should be robust to deviations from normality. Hence, the above approach feels sound. Yet, I haven't seen a paper go for this. Instead, they use packages to fit negative binomials models on TMM+CPM data that model changes in variability as an overdispersion parameter obeying a quadratic mean-variance relationship (e.g., https://journals.physiology.org/doi/full/10.1152/physiolgenomics.00128.2018?), in spite of some papers reporting higher-order polynomial mean-variance relationships in RNAseq data (https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-020-00842-z).

limma DESeq2 • 60 views

ADD COMMENT • link updated 7 hours ago by Gordon Smyth 52k • written 10 hours ago by jtsy6383 • 0

score 0 · Answer 1 · 2025-01-09

Looking at differential variance across groups, where you regress out orthogonal nuisance factors from VST data, seems correct and would point you to the types of gene you are looking for.

However, you could also consider fitting dispersion in a NB GLM, and then comparing dispersion. You could test constant dispersion vs per-group with a likelihood ratio test.

score 0 · Answer 2 · 2025-01-09

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

See

Phipson, B., Oshlack, A. DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging. Genome Biol 15, 465 (2014). https://doi.org/10.1186/s13059-014-0465-4

which uses Levene's test.

ADD COMMENT • link 7 hours ago Gordon Smyth 52k