Question: an unexplained phenomenon using variance stabilizing transformation for downstream analysis
gravatar for lirongrossmann
11 months ago by
lirongrossmann10 wrote:

Hi Everyone,

I am using variance stabilizing transformation (vsd from now on) for normalization in order to perform downstream analysis on a raw count expression matrix. To be specific, I have two groups (say group A and group B) that I want to separate based on the expression levels of certain genes. I found several genes that separate the two groups (using Deseq2) and want to test my hypothesis using an independent set of samples. When using the vsd on the ENTIRE test set (group A+ group B), I get that genes separate the two groups with a certain accuracy. When I use  vsd on each group of the test set separately (i.e. vsd on group A and vsd on group B), I get the the two groups are separated even better based on these genes. 

Why is it when I run vsd on A+B I get different results when I ran vsd on A and vsd on B? I assume vsd takes the interaction between the samples, so is there a way to eliminate it? Should I use a different normalization method? If so which one is recommended?


ADD COMMENTlink modified 11 months ago • written 11 months ago by lirongrossmann10
gravatar for Wolfgang Huber
11 months ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

The variance stabilizing transformation in DESeq2 is not a normalization method, it is (as the name says) a transformation. For normalization, it uses the usual DESeq2 estimateSizeFactors normalization.

It is normal and expected that the results differ if you call estimateSizeFactors on the complete matrix, versus if you call it on the A and B subsets separately. The latter is wrong since it defies the purpose of normalization. So you can ignore the result from that analysis.

As always, posting reproducible code examples and session_info would help.


ADD COMMENTlink written 11 months ago by Wolfgang Huber13k
gravatar for lirongrossmann
11 months ago by
lirongrossmann10 wrote:

Thank you very much for the clarification!!

My biggest issue is which values to use in order to train the model? Should I use the normalized values or should I used the transformed values (which if I understand correctly are also normalized). I used "mat" to train my model based on the "Response classification" (see code below).

Here is my code:

ep<-read.table("exp.train.txt",header = TRUE, row.names = 1) 


dds <-DESeqDataSetFromMatrix(countData = ep,colData = cp,design =~Response)

dds <- dds[ rowSums(counts(dds)) > 1, ]

dds <- estimateSizeFactors(dds)

vsd <- varianceStabilizingTransformation(dds)


ADD COMMENTlink written 11 months ago by lirongrossmann10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 245 users visited in the last hour