Hello!
I am a bit confused about the normalization performed by the DESeq2 varianceStabilizingTransformation() and vst() functions in addition to the actual variance stabilization. My understanding is that the normalization by division by size factors (which are automatically calculated?) corrects for both library size and library composition. But the reference manual specifically states that it corrects for library size, while nothing is mentioned about library composition. Is there something I'm missing here? The use of the variance stabilized data is PCA and heatmap plotting.
Finally, am I correct in assuming the design parameter only affects the variance stabilization in the vst() function, not the additional normalization? The subsetting and stabilization happen first, then the data is normalized as in varianceStabilizingTransformation?
thanks! Kris
With library composition I mean correcting for genes with vastly different expression in one sample compared to others, or only expressed in certain samples, not in others. My understanding of it is that if a specific gene X is very highly expressed in sample A compared to sample B, and you correct only for sequencing depth by calculating cpm for example, the remaining genes in sample A will appear to be much less expressed than in sample B, where gene X has not taken up such a big chunk of the counts, whereas in reality the only DE gene might be gene X.
I believe this is what median of ratios do, but I might be wrong.
DESeq2 (and DESeq, and other methods on Bioconductor) uses a robust estimator for scaling counts that won't be affected by these situations. We use the median feature (in terms of its LFC to a reference) to compute size factors for a sample, rather than just using the ratio of the total count to the total count of a reference.
Ok, that settles it. Thanks a lot!