Question

choosing the normalization method (rlog, variance stabilizing transformation)

1

Entering edit mode

lirongrossmann ▴ 80

@lirongrossmann-13938

Last seen 5.3 years ago

Hi everyone,

I was hoping to get an answer on an issue I have been struggling for a while.

I have a raw count data from RNA-seq experiment and want to develop a model for separating two group of samples. I used Deseq2 to select my top genes and trained and test the model on the dataset using variance stabilizing transformation.

To make sure my model is robust, I tried to use rlog and other normalization methods (TPM, RPKM) on the raw count matrix with the same set of selected genes.

My problem - I get different performance of my model (different accuracy) depending on the normalization method (even between rlog and vsd). Of note, just by looking at the values of the normalized matrix, I can see that there is a substantial difference in the normalized counts between the different methods. For example, in one of the selected genes the normalized value for one sample is 4.328 using vsd and 0.02 using RPKM. I am not sure I fully understand where this big difference is coming from.

Anyone has encountered a similar situation? Any help would be appreciated.

Thanks!

variancestabilizingtransformation rlog transformation deseq2 • 6.5k views

ADD COMMENT • link updated 2.2 years ago by Alessandro Silvestris • 0 • written 8.1 years ago by lirongrossmann ▴ 80

score 1 · Answer 1 · 2017-12-12

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

The variance stabilizing transformations are very different from TPM and RPKM. These latter normalizations allow for comparison of values across genes, because they are proportional to original counts of transcripts. However, you will see that they are not variance stabilizing. Distances between samples will be highly weighted by contributions from gene with highest TPM. We recommend in the DESeq and DESeq2 papers to use variance stabilization when comparing samples e.g. using a distance metric, as it takes into account the precision of the measurements and reduces contributions of noise from genes with low counts.

ADD COMMENT • link 8.1 years ago Michael Love 43k

0

Entering edit mode

Dear Michael, so in a survival analysis with Kaplan-Meier curves stratifying patients by high/low expression of a gene, which unit of measurement in your opinion would be the most appropriate to use? Thank you! Ale

ADD REPLY • link 2.2 years ago Alessandro Silvestris • 0