Voom on TCGA data shifts count distributions towards negative values
3
1
Entering edit mode
paola.ostano ▴ 10
@paolaostano-7124
Last seen 6.8 years ago
Italy

Dear Gordon and dear all,

I'm using the voom transformation on RSEM "raw" counts from TCGA (RNASeq experiments). To avoid problems associated with zero values, I added +1 to the original read counts.

The problem is that count distributions are shifted towards negative values, after voom transformation. And then if I perform a two-class comparison, I obtain some differentially expressed genes (with statistically significant p.values) that have negative values in both classes... I really don't know how to manage these negative values.

The RSEM log2 raw counts have the following distribution (I just selected 10 samples):

http://i.imgur.com/mJ56iNb.png

After voom transformation, the distributions are shifted towards negative values.

http://i.imgur.com/dEvIvOb.png

The code I used is the following:

library (limma)
library (edgeR)

a <- read.table ("RSEM_genes_data_raw_count+1.txt", header=T,sep="\t")
dim(a)
counts <- as.matrix(a[,2:11])
rownames(counts) <-  a$gene_id
design=cbind(NH=c(1,1,1,1,1,0,0,0,0,0),NL=c(0,0,0,0,0,1,1,1,1,1))

v <-  voom (counts, design, plot=TRUE)

Could you please help me in solving my problem?

Thank you very much in advance

Paola

voom TCGA RNAseq • 3.7k views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 58 minutes ago
United States

Hi Paola,

Edited.

This isn't actually a problem at all, and there is no need to 'manage' the negative values. By using voom() you are transforming the data to counts/million (cpm) and taking logs. Any gene with a relatively small number of counts will end up with a cpm value less than one. This will necessarily result in lots of negative values when you take logs, but the end goal is to analyze the data using conventional ANOVA methods, which are quite happy to deal with negative values.

In addition, the resulting values don't 'mean' anything in a conventional sense. In other words, a negative value doesn't mean that the gene is somehow negatively expressed. In fact it doesn't really mean anything until you compare to the value for that same gene in a different sample (which is what you do when you fit a model and make contrasts).

As an example, you could have a gene that has a negative expression value in all samples, and when you fit the model you get a mean expression of say -3 in one group and -4 in another. All this means is that this gene is up-regulated in the first group as compared to the second (-3 - (-4) = 1, and since this is log_2, a difference of 1 is a two-fold difference).

Does that make sense?

ADD COMMENT
0
Entering edit mode

Additionally: I didn't see it in the OP's code, but take care that all "official" voom examples suggest that you should first filter out low count genes before applying voom. I haven't seen any formal analysis on why it must be so, but I believe Gordon previously mentioned it necessary to do so from their empirical testing (Gordon, apologies if I've misinterpreted previous posts and please correct me if I'm wrong).

ADD REPLY
0
Entering edit mode

In general, low count genes don't have enough power to reject the null hypothesis. Now, the absolute size of the counts is lost during log-transformation, but this does not prevent loss of power in the subsequent linear modelling. In particular, the mean-variance trend fitted by voom usually rises at low abundances. This results in a high sample variance which limits power for low count genes. If these genes will not reject, it's best to remove them to reduce the severity of the multiple testing correction later on. In addition, the assumption of normality for log-counts is not accurate when the counts are very low.

ADD REPLY
0
Entering edit mode

I wasn't referring to (the perhaps mild) increase in power you get by not testing on low-expressed genes, but I had thought the issue was more that voom had issues "correctly" modeling the mean-variance trend when you don't use a mild low-expression filter.

For instance, imagine we don't remove low count genes prior to running voom, then post-voom you then remove low count genes before you do any downstream "normal limma" maneuvers (fitting/testing), you would presumably recover the same "power" by removing the low count genes before running the tests -- would you say that these two steps are equally valid?

I, of course, can empirically test that, but I was just curious what the official party line is.

Sorry if I'm making a mountain out of a mole hill here, but my paranoia just comes from the simple statement in section 15.3 of the limma user's guide that reads "The limma-voom method assumes that rows with zero or very low counts have been removed". It's the "assumes" bit that makes me wonder, is all.

ADD REPLY
2
Entering edit mode

There are at least two issues with variance modelling for low counts. The first is that the variances become discrete, such that the loess curve doesn't fit very well. This also affects the estimation of the prior d.f., as the variance of the variances isn't smoothly distributed. For example, you can end up with many variances of zero when all counts are exactly the same.

The other issue is that residual d.f. are lost when you have many zero counts. Specifically, observations with fitted values of (log-)zero don't contribute anything to the total residual deviance of the linear model. The default limma pipeline doesn't account for this and will end up underestimating the variance. This manifests as a drop in the mean-variance trend at low abundances, rather than the expected increase.

Filtering to remove these low or zero counts avoids the worst of these problems.

ADD REPLY
1
Entering edit mode

Steve, you are correct in all respects. The official party line is that filtering of genes with consistently low counts should be done prior to voom, and the reason for this is more fundamental than multiple testing issues. voom fits a trend to the genewise variances. The variances are essentially undefined for very low count genes, and fitting a trend to undefined quantities is not a reliable thing to do. In recent versions, we have robustified voom so that it now protects itself from genes with all zero counts, but the general principle still remains.

Of course, all this goes beyond the scope of the orginal post, which was about a far more basic question.

ADD REPLY
0
Entering edit mode

It is indeed beyond the scope of the OP's question, but thanks to you and Aaron for taking the time to elaborate further.

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 12 hours ago
WEHI, Melbourne, Australia

As Jim has said, there is no problem here.

The log2 counts per million output by voom are on a different scale to log counts, so there is no reason to expect them to be similar.

There is also no need to add 1 to the counts -- better to work with the original counts rather than making ad hoc transformations such as this.

On the other hand, it would be beneficial to filter out genes with very few reads before running voom, for example:

keep <- rowSums(counts) > 20
counts <- counts[keep,]

etc.

ADD COMMENT
0
Entering edit mode
paola.ostano ▴ 10
@paolaostano-7124
Last seen 6.8 years ago
Italy

Dear James and dear Gordon,

Thank you very much for your clarifications and suggestions.

I will filter out low-counts genes.

 

Thank you again!

Paola

ADD COMMENT

Login before adding your answer.

Traffic: 382 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6