Question

regularized log transformation- loss of zero values for sparsely expressed genes

0

Entering edit mode

longwoodSequencer • 0

@longwoodsequencer-11269

Last seen 7.7 years ago

I'm using the rlog function in the DESeq2 package and I notice a quirk in the transformed data that I do not know what to make of: for genes that are expressed in a small proportion of samples (say, for gene X, 10 samples have non-zero raw counts out of 300 samples), the transformed dataset has no zero values at all; instead, the majority of samples have some other value that is either negative or positive. Negative count doesn't make sense so I could, I suppose, deal with that by zeroing all counts less than 1 in the transformed dataset, but I don't know what to do about the cases where most samples have a positive value, say 3.5, and a small proportion have other higher values- it's as if the zero-level for that gene is shifted to a small positive number. This is seen only with genes expressed in a small proportion of samples, and the amount of shift, positive or negative, varies across genes. I notice the same with variance-stabilizing transformation and regardless of whether I set blind=FALSE or not.

Have others noticed this with their dataset? If so, how did you deal with it? I don't know how much of an impact this would have on the results of clustering-type exploratory analyses, but I am also not comfortable with seeing that a gene that should not be expressed at all in most samples has positive counts for all of them.

deseq2 rlog transformation vst variancestabilizingtransformation rlog • 3.7k views

ADD COMMENT • link updated 7.7 years ago by Michael Love 41k • written 7.7 years ago by longwoodSequencer • 0

2

Entering edit mode

Keep in mind that the rlog transformation and VST are both log-like transformations, which means that they can theoretically return any value from -Inf to +Inf, and zero is not a special number in any way.

ADD REPLY • link 7.7 years ago Ryan C. Thompson ★ 7.9k

score 2 · Answer 1 · 2016-08-09

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 7 hours ago

United States

"Negative count doesn't make sense"

First, Ryan is correct that rlog and VST return log2-like values, so negative values are normal, and simply indicate an expected count less than 1. Many samples will have expected counts less than 1 in a very sparse dataset.

Secondly, the rlog and VST may not be optimal for very sparse data. If I were you I would compare to other transformations and pick based on properties such as the stabilization of variance over the mean (see vignette) and preservation of signal (seen for example through a PCA plot).

ADD COMMENT • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Thank you Ryan and Michael for your quick responses. I see how negative values are possible in the transformation but that's easier to deal with since that can be interpreted as '<1'. I am bummed about the positive values though. I've uploaded images with an example of this and also a comparison of log2, vst and rlog in stabilizing variance over means. I have ~18000 genes and ~300 samples.

I realized after reading your comment that I do have a large number of sparse genes in this dataset so I will try next with less sparse genes. But could you elaborate what you mean by 'preservation of signal (seen for example through a PCA plot)'?

http://imgur.com/a/p1c9d

ADD REPLY • link 7.7 years ago longwoodSequencer • 0

2

Entering edit mode

Here VST and rlog are much better at stabilizing variance than log2(x+1). You might try a higher pseudocount for log2 as well while you are making comparisons.

What I meant by preservation of signal is to inspect if you have biologically meaningful separation of groups in the PCA plot. While a transformation may not be able to bring this signal out if it does not exist in the data, you would want a good transformation and visualization to make biological signal prominent.

ADD REPLY • link 7.7 years ago Michael Love 41k