Question

Normalization by variance stabilizing transformation VST

0

Entering edit mode

kristoffersandas • 0

@6c372dab

Last seen 3.2 years ago

Sweden

Hello!

I am a bit confused about the normalization performed by the DESeq2 varianceStabilizingTransformation() and vst() functions in addition to the actual variance stabilization. My understanding is that the normalization by division by size factors (which are automatically calculated?) corrects for both library size and library composition. But the reference manual specifically states that it corrects for library size, while nothing is mentioned about library composition. Is there something I'm missing here? The use of the variance stabilized data is PCA and heatmap plotting.

Finally, am I correct in assuming the design parameter only affects the variance stabilization in the vst() function, not the additional normalization? The subsetting and stabilization happen first, then the data is normalized as in varianceStabilizingTransformation?

thanks! Kris

library composition vst DESeq2 Normalization • 5.2k views

ADD COMMENT • link 4.0 years ago kristoffersandas • 0

score 0 · Answer 1 · 2021-11-29

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 days ago

United States

Can you say what you mean by library composition? Can you give an example of what you want to control for?

The design is only used for estimating the parameters of the transformation (the design is needed to assess the amount of within-group dispersion) but then afterwards the same transformation is applied to all the samples, so in that way it is not using the sample grouping in applying the transformation. Sample group information is also not used in the size factor calculation.

ADD COMMENT • link 4.0 years ago Michael Love 43k

0

Entering edit mode

With library composition I mean correcting for genes with vastly different expression in one sample compared to others, or only expressed in certain samples, not in others. My understanding of it is that if a specific gene X is very highly expressed in sample A compared to sample B, and you correct only for sequencing depth by calculating cpm for example, the remaining genes in sample A will appear to be much less expressed than in sample B, where gene X has not taken up such a big chunk of the counts, whereas in reality the only DE gene might be gene X.

I believe this is what median of ratios do, but I might be wrong.

ADD REPLY • link 4.0 years ago kristoffersandas • 0

0

Entering edit mode

DESeq2 (and DESeq, and other methods on Bioconductor) uses a robust estimator for scaling counts that won't be affected by these situations. We use the median feature (in terms of its LFC to a reference) to compute size factors for a sample, rather than just using the ratio of the total count to the total count of a reference.