batch effect and size factor differences: vst vs. rlog
Entering edit mode
Hannah • 0
Last seen 13 months ago
United States

I have gene expression data from some symbiont critters in response to temperature stress, where we replicated temperature treatments across two types of experiments (specifically, in culture and in symbiosis). These separate experiments were both sequenced using TagSeq, but were sequenced at different facilities at different times. The result of both the distinct experiments and using different sequencing facilities is large size factor differences across the two experiments. See here:

The size factors are split by experiment type

My current course of action for analyzing this data is as follows: raw counts -> batch correction for experiment type using ComBat-seq to get modified counts -> differential expression using DESeq2 to get dds -> rlog transformation -> PCAs.

My question is whether an rlog transformation is appropriate here, specifically considering the different size factors across samples. Additionally, should I be using a blind = FALSE argument in the rlog transformation to account for these size factor differences, since they are associated with the experimental design?

You can see here that, regardless of whether I use vst or rlog, or plotPCA or prcomp, we get separation by experiment type and by temperature treatment. However, I just want to make sure that I am most appropriately making the comparison, since these data are from two separate experiments.

Different normalization methods (vst vs. rlog) produce different results

DESeq2 • 398 views
Entering edit mode
Last seen 2 days ago
United States

| usually recommend VST for its speed and simplicity, but here rlog may be a better choice because it more directly accounts for size factor differences. I don't think blind=FALSE is a problem, this has to do with the estimation of the dispersion trend only

For DE analysis, for this particular case of perfect confounding of size factor with condition, I would recommend to remove features unless they have minimal counts in (nearly) all samples, e.g. rowSums(counts(dds) >= 10) >= 40. While we do not usually filter at such a high level, with your case of perfect confounding of size factor and condition, you won't be able to tell the difference between a 0 from lower expression and from lower sequencing depth.

Entering edit mode

Ok great, thank you so much for your input Michael!


Login before adding your answer.

Traffic: 196 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6