an unexplained phenomenon using variance stabilizing transformation for downstream analysis
2
0
Entering edit mode
@lirongrossmann-13938
Last seen 4.2 years ago

Hi Everyone,

I am using variance stabilizing transformation (vsd from now on) for normalization in order to perform downstream analysis on a raw count expression matrix. To be specific, I have two groups (say group A and group B) that I want to separate based on the expression levels of certain genes. I found several genes that separate the two groups (using Deseq2) and want to test my hypothesis using an independent set of samples. When using the vsd on the ENTIRE test set (group A+ group B), I get that genes separate the two groups with a certain accuracy. When I use  vsd on each group of the test set separately (i.e. vsd on group A and vsd on group B), I get the the two groups are separated even better based on these genes. 

Why is it when I run vsd on A+B I get different results when I ran vsd on A and vsd on B? I assume vsd takes the interaction between the samples, so is there a way to eliminate it? Should I use a different normalization method? If so which one is recommended?

Thanks!

variancestabilizingtransformation normalization • 1.8k views
ADD COMMENT
1
Entering edit mode
@wolfgang-huber-3550
Last seen 3 months ago
EMBL European Molecular Biology Laborat…

The variance stabilizing transformation in DESeq2 is not a normalization method, it is (as the name says) a transformation. For normalization, it uses the usual DESeq2 estimateSizeFactors normalization.

It is normal and expected that the results differ if you call estimateSizeFactors on the complete matrix, versus if you call it on the A and B subsets separately. The latter is wrong since it defies the purpose of normalization. So you can ignore the result from that analysis.

As always, posting reproducible code examples and session_info would help.

 

ADD COMMENT
0
Entering edit mode
@lirongrossmann-13938
Last seen 4.2 years ago

Thank you very much for the clarification!!

My biggest issue is which values to use in order to train the model? Should I use the normalized values or should I used the transformed values (which if I understand correctly are also normalized). I used "mat" to train my model based on the "Response classification" (see code below).

Here is my code:


ep<-read.table("exp.train.txt",header = TRUE, row.names = 1) 

cp<-read.csv("train.csv")

dds <-DESeqDataSetFromMatrix(countData = ep,colData = cp,design =~Response)

dds <- dds[ rowSums(counts(dds)) > 1, ]

dds <- estimateSizeFactors(dds)

vsd <- varianceStabilizingTransformation(dds)

mat<-assay(vsd)

ADD COMMENT

Login before adding your answer.

Traffic: 344 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6