Question

batch effect and size factor differences: vst vs. rlog

0

Entering edit mode

Hannah • 0

@646efae9

Last seen 2.7 years ago

United States

I have gene expression data from some symbiont critters in response to temperature stress, where we replicated temperature treatments across two types of experiments (specifically, in culture and in symbiosis). These separate experiments were both sequenced using TagSeq, but were sequenced at different facilities at different times. The result of both the distinct experiments and using different sequencing facilities is large size factor differences across the two experiments. See here:

The size factors are split by experiment type

My current course of action for analyzing this data is as follows: raw counts -> batch correction for experiment type using ComBat-seq to get modified counts -> differential expression using DESeq2 to get dds -> rlog transformation -> PCAs.

My question is whether an rlog transformation is appropriate here, specifically considering the different size factors across samples. Additionally, should I be using a blind = FALSE argument in the rlog transformation to account for these size factor differences, since they are associated with the experimental design?

You can see here that, regardless of whether I use vst or rlog, or plotPCA or prcomp, we get separation by experiment type and by temperature treatment. However, I just want to make sure that I am most appropriately making the comparison, since these data are from two separate experiments.

Different normalization methods (vst vs. rlog) produce different results

DESeq2 • 949 views

ADD COMMENT • link 2.8 years ago Hannah • 0

score 2 · Accepted Answer · 2021-07-23

| usually recommend VST for its speed and simplicity, but here rlog may be a better choice because it more directly accounts for size factor differences. I don't think blind=FALSE is a problem, this has to do with the estimation of the dispersion trend only

For DE analysis, for this particular case of perfect confounding of size factor with condition, I would recommend to remove features unless they have minimal counts in (nearly) all samples, e.g. rowSums(counts(dds) >= 10) >= 40. While we do not usually filter at such a high level, with your case of perfect confounding of size factor and condition, you won't be able to tell the difference between a 0 from lower expression and from lower sequencing depth.