I want to do an analysis of differentially expressed genes between tumor and normals. I was hoping to start from the expression data from TCGA_PANCAN_exp_HiSeqV2_PANCAN . This contains 8415 tumor and normal data, across 27 tumor types, and I was envisioning doing some sort of anova-like test to find the set of genes differentially expressed in tumors (maybe using edgeR). However, a lot of processing went into this data: apparently each sample was: RSEM expected counts; normalized to its 75th percentile; log2(x+1) transformed; normalized between cohorts (at the very least). So it is not the count data I've used before. Looking at the mean-variance plot (meanSdPlot(expr, ranks=FALSE), ), there's quite a strong pattern (https://dl.dropboxusercontent.com/u/10824188/Screen%20Shot%202015-01-07%20at%2011.01.52%20AM.png). The genes near zero ("average expression" I believe) have the lowest variance. Is this data appropriate for my purpose? Is there any transformation I could do?

The RSEM expected counts from the TCGA project will work fine with either limma-voom or edgeR. However, with such a large number of samples, limma-voom is easily the best choice from a computational point of view. (Note I mean voom, not vooma.)

None of the other data columns are usable and you must not do any data transformation.

The two mean-variance plots that you give (from meanSdPlot and vooma) look very bad indeed, nonsense really. There is no way that you should be getting a v-shape on zero as in these plots. You don't say what expression values or what design matrix you used to make these plots but, however it has been done, it looks incorrect.

Thanks for the reply (and for your packages). The data are supplemental from Hoadley, et al., https://tcga-data.nci.nih.gov/docs/publications/TCGApancan_2014/rnaseq_input.txt . They used it to cluster cancers, so maybe it made more sense for their purpose. I have had great results using either limma or edgeR with the RSEM expected counts from TCGA data, so hopefully someday soon I'll have time to wrangle 8000 tumor counts data and try this analysis again.

On that note - When you say "The RSEM expected counts from the TCGA project will work fine with either limma-voom .." - do you mean RSEM Normalized counts that I pull off of TCGA (vs raw counts)? Because that is precisely what count data I am working with at the moment, and was figuring out the best way to transform the data prior to using the limma package for differential expression testing. I'd appreciate your input on that. Thank you!

For edgeR, you'd need the raw counts (or the expectation thereof from RSEM). The absolute size of "normalized counts" has little meaning, and the mean-variance relationship for the NB model will become undefined. For voom, you could - theoretically - use the normalized counts, because the function will empirically model whatever mean-variance relationship is present in the data. However, the sensibility of this strategy depends on how the normalization was performed. I think it'd be a lot safer to get something as close to the raw counts as possible, and then normalizing within the voom/limma pipeline.

If you want to use edgeR, the original counts would be the most ideal input. You might be able to get away with "expected" counts, but once you start manipulating them with log-transformations and normalization, they're not going to be interpretable as counts anymore. A simple reversal of the log-transformation is not sufficient, as the normalization steps will have distorted the absolute size of the resulting values (which will affect the mean-variance relationship in edgeR's statistical model).

If you can only get access to the log-values, you might want to look into the vooma function from the limma package. This will estimate the mean-variance relationship in order to compute observation-specific precision weights. These weights can then be used for linear modelling of the log-values. That said, the mean-variance relationship that you've shown is quite bizarre and might interfere with proper modelling. I suspect that this is a result of one of the normalization steps, though I'm not familiar enough with TCGA processing to say that with any certainty.

Thank you! I suspect I'm wasting my time with this data but all i have is the normalized data. I tried vooma and it doesnt look much better. So limma would probably give me garbage results right? Maybe some nonparametric test would work.

Thanks for the reply (and for your packages). The data are supplemental from Hoadley, et al., https://tcga-data.nci.nih.gov/docs/publications/TCGApancan_2014/rnaseq_input.txt . They used it to cluster cancers, so maybe it made more sense for their purpose. I have had great results using either limma or edgeR with the RSEM expected counts from TCGA data, so hopefully someday soon I'll have time to wrangle 8000 tumor counts data and try this analysis again.

On that note - When you say "The RSEM expected counts from the TCGA project will work fine with either limma-voom .." - do you mean RSEM

Normalizedcounts that I pull off of TCGA (vsrawcounts)? Because that is precisely what count data I am working with at the moment, and was figuring out the best way to transform the data prior to using the limma package for differential expression testing. I'd appreciate your input on that. Thank you!For

`edgeR`

, you'd need the raw counts (or the expectation thereof from RSEM). The absolute size of "normalized counts" has little meaning, and the mean-variance relationship for the NB model will become undefined. For`voom`

, you could - theoretically - use the normalized counts, because the function will empirically model whatever mean-variance relationship is present in the data. However, the sensibility of this strategy depends on how the normalization was performed. I think it'd be a lot safer to get something as close to the raw counts as possible, and then normalizing within the`voom`

/`limma`

pipeline.