Question: proper transformation for differential expression in normalized log expression data
0
gravatar for rachel.melamed
4.7 years ago by
United States
rachel.melamed0 wrote:

Dear R people,

I want to do an analysis of differentially expressed genes between tumor and normals.  I was hoping to start from the expression data from TCGA_PANCAN_exp_HiSeqV2_PANCAN . This contains 8415 tumor and normal data, across 27 tumor types, and I was envisioning doing some sort of anova-like test to find the set of genes differentially expressed in tumors (maybe using edgeR).  However, a lot of processing went into this data: apparently each sample was: RSEM expected counts; normalized to its 75th percentile; log2(x+1) transformed; normalized between cohorts (at the very least). So it is not the count data I've used before.  Looking at the mean-variance plot (meanSdPlot(expr, ranks=FALSE), ), there's quite a strong pattern (https://dl.dropboxusercontent.com/u/10824188/Screen%20Shot%202015-01-07%20at%2011.01.52%20AM.png).  The genes near zero ("average expression" I believe) have the lowest variance.  Is this data appropriate for my purpose?  Is there any transformation I could do? 

Thank you very much,

Rachel

mean vs variance

rnaseq vsn edger deseq2 tcga • 3.0k views
ADD COMMENTlink modified 4.7 years ago by Gordon Smyth38k • written 4.7 years ago by rachel.melamed0
Answer: proper transformation for differential expression in normalized log expression d
3
gravatar for Gordon Smyth
4.7 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

The RSEM expected counts from the TCGA project will work fine with either limma-voom or edgeR. However, with such a large number of samples, limma-voom is easily the best choice from a computational point of view. (Note I mean voom, not vooma.)

None of the other data columns are usable and you must not do any data transformation.

The two mean-variance plots that you give (from meanSdPlot and vooma) look very bad indeed, nonsense really. There is no way that you should be getting a v-shape on zero as in these plots. You don't say what expression values or what design matrix you used to make these plots but, however it has been done, it looks incorrect.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Gordon Smyth38k

Thanks for the reply (and for your packages).  The data are supplemental from Hoadley, et al., https://tcga-data.nci.nih.gov/docs/publications/TCGApancan_2014/rnaseq_input.txt .  They used it to cluster cancers, so maybe it made more sense for their purpose.  I have had great results using either limma or edgeR with the RSEM expected counts from TCGA data, so hopefully someday soon I'll have time to wrangle 8000 tumor counts data and try this analysis again.

ADD REPLYlink written 4.7 years ago by rachel.melamed0

On that note - When you say "The RSEM expected counts from the TCGA project will work fine with either limma-voom .." - do you mean RSEM Normalized counts that I pull off of TCGA (vs raw counts)? Because that is precisely what count data I am working with at the moment, and was figuring out the best way to transform the data prior to using the limma package for differential expression testing. I'd appreciate your input on that. Thank you!

ADD REPLYlink written 4.6 years ago by vkartha0

For edgeR, you'd need the raw counts (or the expectation thereof from RSEM). The absolute size of "normalized counts" has little meaning, and the mean-variance relationship for the NB model will become undefined. For voom, you could - theoretically - use the normalized counts, because the function will empirically model whatever mean-variance relationship is present in the data. However, the sensibility of this strategy depends on how the normalization was performed. I think it'd be a lot safer to get something as close to the raw counts as possible, and then normalizing within the voom/limma pipeline.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Aaron Lun24k
Answer: proper transformation for differential expression in normalized log expression d
2
gravatar for Aaron Lun
4.7 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

If you want to use edgeR, the original counts would be the most ideal input. You might be able to get away with "expected" counts, but once you start manipulating them with log-transformations and normalization, they're not going to be interpretable as counts anymore. A simple reversal of the log-transformation is not sufficient, as the normalization steps will have distorted the absolute size of the resulting values (which will affect the mean-variance relationship in edgeR's statistical model).

If you can only get access to the log-values, you might want to look into the vooma function from the limma package. This will estimate the mean-variance relationship in order to compute observation-specific precision weights. These weights can then be used for linear modelling of the log-values. That said, the mean-variance relationship that you've shown is quite bizarre and might interfere with proper modelling. I suspect that this is a result of one of the normalization steps, though I'm not familiar enough with TCGA processing to say that with any certainty.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Aaron Lun24k

Thank you!  I suspect I'm wasting my time with this data but all i have is the normalized data.  I tried vooma and it doesnt look much  better. So limma would probably give me garbage results right? Maybe some nonparametric test would work. 

 

ADD REPLYlink written 4.7 years ago by rachel.melamed0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 161 users visited in the last hour