Question

Why does rlog and vst use normalized counts?

0

Entering edit mode

Jon Bråte ▴ 250

@jon-brate-6263

Last seen 2.6 years ago

Norway

Hi,

For RNAseq data I have used DESeq2's rlog transformed counts for making exploratory plots and quality assessment of my dataset. But I didn't realize that it used normalized counts. I am afraid that this will "force" the samples to look more similar, e.g. in a boxplot of the counts each sample. Am I right about this? And what is the reason for using normalized counts? Is it better to use log-transformed raw counts if one wants to compare whether some samples are very deviant from others, e.g. in boxplots?

Jon

deseq2 • 2.8k views

ADD COMMENT • link updated 7.4 years ago by Steve Lianoglou ★ 13k • written 7.4 years ago by Jon Bråte ▴ 250

score 0 · Answer 1 · 2016-12-21

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Presumably you want your exploratory plots to reveal something about the underlying biology of your experiment. In this case you absolutely want to used some form of normalized counts, and (likely almost) never the raw data itself.

To a first approximation, using the normalized counts side steps the technical artifact that arises when two different samples are sequenced to different depths, ie. if we have two replicate libraries from the same condition and one library produced 5 million reads, and the other produces 20 million, the expression of each gene will be 4x higher in library 2 than 1.

Is that a situation that you want to identify, or one that you want to control for?

ADD COMMENT • link 7.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks for your answer.

Now, this might not be an exploratory analysis, but to give a concrete example I have single-cell transcriptome data and I know that all the samples were sequenced to an approximately equal depth. But I have no idea how PCR-bias have influenced the gene expression and what the distribution of reads across genes are. If I make a boxplot of log2 transformed counts I get this and with rlog the samples look much more uniform. I know that single-cell data is perhaps not the best example, but I am afraid that using rlog conceals the variation? Or am I mistaken about the use of this transformation/normalization?

ADD REPLY • link 7.4 years ago Jon Bråte ▴ 250

1

Entering edit mode

Noting the fact that you are explicitly talking about single cell RNA-seq data is probably something worth updating your question to point out :-)

Unfortunately I haven't had the opportunity to really dig into the universe of single cell rnaseq. The first concern I'd have with using DESeq2's rlog or vst transforms is that I'm not sure if it's well suited to deal with the drop out effects one observes in single cell rnaseq data ... perhaps Mike or Wolfgang can chime in here with their thoughts.

If it were me, I'd first approach the analysis of single cell rnaseq data using Aaron Lun's f1000 workflow, and then deviate from that only when I feel more comfortable with how the scRNA-seq data that I get behaves.

ADD REPLY • link 7.4 years ago Steve Lianoglou ★ 13k

1

Entering edit mode

hi,

The rlog won't perform well with data which strongly deviates from negative binomial, which single cell RNA-seq certainly does, because you often have genes which are mostly 0's but then highly expressed in a minority of cells. The rlog will likely overshrink these differences for these genes. It has to do with the construction of the rlog.

The vst() is just a monotonic function applied to normalized counts, so this is safer.

Or you could use normTransform() which is log2(normalized counts + 1), perhaps with a higher pseudocount to help stabilize the variance (see meanSdPlot as in the DESeq2 vignette).

ADD REPLY • link 7.4 years ago Michael Love 41k

1

Entering edit mode

Re peculiarities of single cell data:

"...you often have genes which are mostly 0's but then highly expressed in a minority of cells": these are likely not really highly expressed genes, but rather, low expressed ones that just happen to have gotten highly (PCR)-amplified in a few cells.
That also means that the term Drop-out is about as misleading as can be. The zeros are likely to be real, the large numbers are the artefact.
The problem goes away with molecule barcodes.

Re normalization in VST, rlog: conceptually, both of these only make sense after correction for technical biases such as sequencing depth, it is unclear how to even define them otherwise. For raw data QC, just use plain old log(x+1), or asinh(x)

ADD REPLY • link 7.4 years ago Wolfgang Huber ★ 13k