Question

an unexplained phenomenon using variance stabilizing transformation for downstream analysis

0

Entering edit mode

lirongrossmann ▴ 80

@lirongrossmann-13938

Last seen 5.2 years ago

Hi Everyone,

I am using variance stabilizing transformation (vsd from now on) for normalization in order to perform downstream analysis on a raw count expression matrix. To be specific, I have two groups (say group A and group B) that I want to separate based on the expression levels of certain genes. I found several genes that separate the two groups (using Deseq2) and want to test my hypothesis using an independent set of samples. When using the vsd on the ENTIRE test set (group A+ group B), I get that genes separate the two groups with a certain accuracy. When I use vsd on each group of the test set separately (i.e. vsd on group A and vsd on group B), I get the the two groups are separated even better based on these genes.

Why is it when I run vsd on A+B I get different results when I ran vsd on A and vsd on B? I assume vsd takes the interaction between the samples, so is there a way to eliminate it? Should I use a different normalization method? If so which one is recommended?

Thanks!

variancestabilizingtransformation normalization • 2.3k views

ADD COMMENT • link 8.0 years ago lirongrossmann ▴ 80

score 1 · Answer 1 · 2017-12-12

The variance stabilizing transformation in DESeq2 is not a normalization method, it is (as the name says) a transformation. For normalization, it uses the usual DESeq2 estimateSizeFactors normalization.

It is normal and expected that the results differ if you call estimateSizeFactors on the complete matrix, versus if you call it on the A and B subsets separately. The latter is wrong since it defies the purpose of normalization. So you can ignore the result from that analysis.

As always, posting reproducible code examples and session_info would help.

score 0 · Answer 2 · 2017-12-12

Thank you very much for the clarification!!

My biggest issue is which values to use in order to train the model? Should I use the normalized values or should I used the transformed values (which if I understand correctly are also normalized). I used "mat" to train my model based on the "Response classification" (see code below).

Here is my code:

ep<-read.table("exp.train.txt",header = TRUE, row.names = 1)

cp<-read.csv("train.csv")

dds <-DESeqDataSetFromMatrix(countData = ep,colData = cp,design =~Response)

dds <- dds[ rowSums(counts(dds)) > 1, ]

dds <- estimateSizeFactors(dds)

vsd <- varianceStabilizingTransformation(dds)

mat<-assay(vsd)