Dear Bioconductor users,
I am working with TCGA RNA-seq data. I want to perform unbiased selection of features associated with overall survival of hepatocellular carcinoma patients using elastic net regularized cox regression modeling and then rank them by heir coefficient sizes. Me and my colleagues are so confused with the data should be used and the appropriate normalization strategy. Specifically, I have downloaded the .rsem.genes.results. I've understood that the "raw_count" is the estimated number of fragments derived from a given gene and the "scaled_estimate" is the fraction of transcripts made up by a given gene. The "scaled_estimate" could maybe be used as well, e.g. by multiplying it with 1M to get "transcripts per million" (TPM) which Li and Dewey state should be more comparable across samples. So, which of the following normalized values should I use (and why to use) for cox glmnet analysis (in order to fulfill the statistical assumptions required for linear modeling) : the vst normalized "raw_counts" (using varianceStabilizingTransformation function of DEseq2 package), the voom transformed "raw_counts" (using both TMM normalization and voom tranformation) or the log2(TPM+prior count) (TPM= "scaled_estimate"*10^6)?
I would appreciate hearing your opinion!!!
Thank you very very much in advance!!!
Sincerely,
Panagiotis Mokos
Dear Love,
Thank you very much for your response!
Just a little correction.The phrase in the brackets is not "which to use" but "why to use".
In varianceStabilizingTransformation R documentation is refered that the transformations performed from this function are useful when checking for outliers or as input for machine learning techniques. These function apply only between-sample normalization and further transformation (log-like). Do you beleive that the lack of within-sample normalization (correction of transcript length bias) bias downstream analyses (such as clustering, multiple regression, regularization, etc.). In other words do you beleive that the log (TPM+prior count) is more appropriate metric than vst normalized values to use for downsteam analyses?
Thank you very very much for your time in advance!!
Sincerely,
Panagiotis Mokos
If you use the VST or rlog in DESeq2, and you have used e.g. Salmon, Sailfish or kallisto to quantify the abundances, the normalized, transformed values also take into account within-sample biases such as gene length changes, etc.
If you want a kallisto-processed version of the TCGA RNA-Seq data, you might consider my paper here: http://www.nature.com/articles/srep39259
Dear Love,
Thank you very very much for your quick and informative reply!!
Panagiotis
Dear Love,
I have read the paper here "Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research" and the tximport vignette. In this pipeline you note that there are two suggested ways of importing estimates for use with gene-level differential expression methods. The first method, recommended for edgeR and DESeq2, is to use the estimated counts from the quantification tools, and additionally to use the transcript-level abundance estimates to calculate an offset that corrects for changes to the average transcript length across samples. The second one (which you recommend for limma-voom as this software does not use the offset matrix stored in
y$offset)
is to use thetximport
argumentcountsFromAbundance="lengthScaledTPM" (
scaled using the average transcript length, averaged over samples and to library size)
or"scaledTPM" (
scaled to library size)
, and then to use the count matrixtxi$counts
directly as a regular count matrix.In .rsem.genes.results.txt file of TCGA RNA-seq data the variable "scaled_estimate" is the fraction of transcripts made up by a given gene. Also the "scaled_estimate" could maybe be used as well, e.g. by multiplying it with 1M to get "transcripts per million" (TPM) as I mentioned above. Could I perform varianceStabilizingTransformation using DEseq2 package to these counts or could i use the v$E (v= result of voom) after voom function as you recommend to use either lengthScaledTPM or scaledTPM counts?
I would appreciate hearing your opinion!!!
Thank you very very much for your time in advance!!!
Sincerely,
Panagiotis Mokos
hi Panagiotis,
You can try different things and compare using meanSdPlot like we do in the DESeq2 vignette.
If you want to use the VST it really only makes sense to apply it to something like counts. You can obtain something on the scale of counts by scaling up TPMs to the typical library size, that is, make the column sums of the matrix close to the average number of fragments in the samples.
Dear Love,
I have just read in detail your RNA-seq workflow from this link http://www.bioconductor.org/help/workflows/rnaseqGene/#the-deseqdataset-sample-information-and-the-design-formula. At some point you refer "While the rlog is on roughly the same scale as the log2 counts, the VST has a upward shift for the smaller values. It is the differences between samples (deviation from y=x in these scatterplots) which will contribute to the distance calculations and the PCA plot." I can't understand why the lower values have an upward shift and I am wondering that these lower, shifted and normalized values bias downstream analyses. Do you beleive that the rlog works better than vst in this example with scatterplots?
I would appreciate hearing your opinion!!!
Thank you very very much for your time in advance!!!
Sincerely,
Panagiotis Mokos
hi Panagiotis,
I wouldn't be worried about the upward shift. The important thing for downstream analysis is the difference between samples. Both rlog and VST shrink differences between samples for genes that have small counts. Can you say more about what situation (e.g. you mention bias) that you are worried about?
Dear Love,
Thank you very much for your response. Could you explain more the differences between the above-mentioned scatterplots (ie between the rlog and VST method)? Also what happens about the variance of lowly expressed genes between rlog and VST transformation (given that the mean values of lowly expressed genes are increased using rlog transformation)? Finally why did you prefer the rlog counts for all downstream analyses in RNA-seq workflow?
Thank you very very much for your time in advance!!
Sincerely,
Panagiotis Mokos
Could you explain more the differences between the above-mentioned scatterplots (ie between the rlog and VST method)?
I don't know if I have more to add than it says in the workflow.
Also what happens about the variance of lowly expressed genes between rlog and VST transformation (given that the mean values of lowly expressed genes are increased using rlog transformation)?
Both VST and rlog shrink the variance of lowly expressed genes. I wouldn't worry much about the mean shift. The VST is shifted such that vst(x) ~= log2(x) for large counts x.
Finally why did you prefer the rlog counts for all downstream analyses in RNA-seq workflow?
No reason in particular, I don't have much more to say on the differences beyond what is said in the vignette section on transformations.
Dear Love,
Thank you very very much for your response!!
Sincerely,
Panagiotis