Question

proper transformation for differential expression in normalized log expression data

0

Entering edit mode

rachel.melamed • 0

@rachelmelamed-7219

Last seen 9.3 years ago

United States

Dear R people,

I want to do an analysis of differentially expressed genes between tumor and normals. I was hoping to start from the expression data from TCGA_PANCAN_exp_HiSeqV2_PANCAN . This contains 8415 tumor and normal data, across 27 tumor types, and I was envisioning doing some sort of anova-like test to find the set of genes differentially expressed in tumors (maybe using edgeR). However, a lot of processing went into this data: apparently each sample was: RSEM expected counts; normalized to its 75th percentile; log2(x+1) transformed; normalized between cohorts (at the very least). So it is not the count data I've used before. Looking at the mean-variance plot (meanSdPlot(expr, ranks=FALSE), ), there's quite a strong pattern (https://dl.dropboxusercontent.com/u/10824188/Screen%20Shot%202015-01-07%20at%2011.01.52%20AM.png). The genes near zero ("average expression" I believe) have the lowest variance. Is this data appropriate for my purpose? Is there any transformation I could do?

Thank you very much,

Rachel

mean vs variance

tcga rnaseq deseq2 vsn edger • 6.3k views

ADD COMMENT • link updated 9.3 years ago by Gordon Smyth 50k • written 9.3 years ago by rachel.melamed • 0

score 3 · Answer 1 · 2015-01-08

3

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

The RSEM expected counts from the TCGA project will work fine with either limma-voom or edgeR. However, with such a large number of samples, limma-voom is easily the best choice from a computational point of view. (Note I mean voom, not vooma.)

None of the other data columns are usable and you must not do any data transformation.

The two mean-variance plots that you give (from meanSdPlot and vooma) look very bad indeed, nonsense really. There is no way that you should be getting a v-shape on zero as in these plots. You don't say what expression values or what design matrix you used to make these plots but, however it has been done, it looks incorrect.

ADD COMMENT • link 9.3 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks for the reply (and for your packages). The data are supplemental from Hoadley, et al., https://tcga-data.nci.nih.gov/docs/publications/TCGApancan_2014/rnaseq_input.txt . They used it to cluster cancers, so maybe it made more sense for their purpose. I have had great results using either limma or edgeR with the RSEM expected counts from TCGA data, so hopefully someday soon I'll have time to wrangle 8000 tumor counts data and try this analysis again.

ADD REPLY • link 9.3 years ago rachel.melamed • 0

0

Entering edit mode

On that note - When you say "The RSEM expected counts from the TCGA project will work fine with either limma-voom .." - do you mean RSEM Normalized counts that I pull off of TCGA (vs raw counts)? Because that is precisely what count data I am working with at the moment, and was figuring out the best way to transform the data prior to using the limma package for differential expression testing. I'd appreciate your input on that. Thank you!

ADD REPLY • link 9.2 years ago vkartha • 0

0

Entering edit mode

For edgeR, you'd need the raw counts (or the expectation thereof from RSEM). The absolute size of "normalized counts" has little meaning, and the mean-variance relationship for the NB model will become undefined. For voom, you could - theoretically - use the normalized counts, because the function will empirically model whatever mean-variance relationship is present in the data. However, the sensibility of this strategy depends on how the normalization was performed. I think it'd be a lot safer to get something as close to the raw counts as possible, and then normalizing within the voom/limma pipeline.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

score 2 · Answer 2 · 2015-01-07

If you want to use edgeR, the original counts would be the most ideal input. You might be able to get away with "expected" counts, but once you start manipulating them with log-transformations and normalization, they're not going to be interpretable as counts anymore. A simple reversal of the log-transformation is not sufficient, as the normalization steps will have distorted the absolute size of the resulting values (which will affect the mean-variance relationship in edgeR's statistical model).

If you can only get access to the log-values, you might want to look into the vooma function from the limma package. This will estimate the mean-variance relationship in order to compute observation-specific precision weights. These weights can then be used for linear modelling of the log-values. That said, the mean-variance relationship that you've shown is quite bizarre and might interfere with proper modelling. I suspect that this is a result of one of the normalization steps, though I'm not familiar enough with TCGA processing to say that with any certainty.