choosing the normalization method (rlog, variance stabilizing transformation)
1
1
Entering edit mode
@lirongrossmann-13938
Last seen 4.2 years ago

Hi everyone,

I was hoping to get an answer on an issue I have been struggling for a while. 

I have a raw count data from RNA-seq experiment and want to develop a model for separating two group of samples. I used Deseq2 to select my top genes and trained and test the model on the dataset using variance stabilizing transformation. 

To make sure my model is robust, I tried to use rlog and other normalization methods (TPM, RPKM) on the raw count matrix with the same set of selected genes. 

My problem - I get different performance of my model (different accuracy) depending on the normalization method (even between rlog and vsd). Of note, just by looking at the values of the normalized matrix, I can see that there is a substantial difference in the normalized counts between the different methods. For example, in one of the selected genes the normalized value for one sample is 4.328 using vsd and 0.02 using RPKM. I am not sure I fully understand where this big difference is coming from.

Anyone has encountered a similar situation? Any help would be appreciated.

Thanks!

variancestabilizingtransformation rlog transformation deseq2 • 5.0k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 3 days ago
United States

The variance stabilizing transformations are very different from TPM and RPKM. These latter normalizations allow for comparison of values across genes, because they are proportional to original counts of transcripts. However, you will see that they are not variance stabilizing. Distances between samples will be highly weighted by contributions from gene with highest TPM. We recommend in the DESeq and DESeq2 papers to use variance stabilization when comparing samples e.g. using a distance metric, as it takes into account the precision of the measurements and reduces contributions of noise from genes with low counts.

ADD COMMENT
0
Entering edit mode

Dear Michael, so in a survival analysis with Kaplan-Meier curves stratifying patients by high/low expression of a gene, which unit of measurement in your opinion would be the most appropriate to use? Thank you! Ale

ADD REPLY

Login before adding your answer.

Traffic: 379 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6