Question

Problem about use of TCGA rna-seq rsem gene results for regularized cox regression modeling

1

Entering edit mode

panagiotis.mokos ▴ 40

@panagiotismokos-9709

Last seen 7.5 years ago

Dear Bioconductor users,

I am working with TCGA RNA-seq data. I want to perform unbiased selection of features associated with overall survival of hepatocellular carcinoma patients using elastic net regularized cox regression modeling and then rank them by heir coefficient sizes. Me and my colleagues are so confused with the data should be used and the appropriate normalization strategy. Specifically, I have downloaded the .rsem.genes.results. I've understood that the "raw_count" is the estimated number of fragments derived from a given gene and the "scaled_estimate" is the fraction of transcripts made up by a given gene. The "scaled_estimate" could maybe be used as well, e.g. by multiplying it with 1M to get "transcripts per million" (TPM) which Li and Dewey state should be more comparable across samples. So, which of the following normalized values should I use (and why to use) for cox glmnet analysis (in order to fulfill the statistical assumptions required for linear modeling) : the vst normalized "raw_counts" (using varianceStabilizingTransformation function of DEseq2 package), the voom transformed "raw_counts" (using both TMM normalization and voom tranformation) or the log2(TPM+prior count) (TPM= "scaled_estimate"*10^6)?

I would appreciate hearing your opinion!!!

Thank you very very much in advance!!!

Sincerely,

Panagiotis Mokos

tcga rnaseq glmnet voom deseq2 • 4.3k views

ADD COMMENT • link 7.8 years ago panagiotis.mokos ▴ 40

score 0 · Answer 1 · 2017-03-13

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

hi Panagiotis,

I guess "which to use" is in the end up to you as the analyst. The VST is described in the DESeq and DESeq2 paper. The point is to stabilize the variance of the transformed counts across the dynamic range. If you have specific questions about methods I can answer these.

ADD COMMENT • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Dear Love,

Thank you very much for your response!

Just a little correction.The phrase in the brackets is not "which to use" but "why to use".

In varianceStabilizingTransformation R documentation is refered that the transformations performed from this function are useful when checking for outliers or as input for machine learning techniques. These function apply only between-sample normalization and further transformation (log-like). Do you beleive that the lack of within-sample normalization (correction of transcript length bias) bias downstream analyses (such as clustering, multiple regression, regularization, etc.). In other words do you beleive that the log (TPM+prior count) is more appropriate metric than vst normalized values to use for downsteam analyses?

Thank you very very much for your time in advance!!

Sincerely,

Panagiotis Mokos

ADD REPLY • link 7.8 years ago panagiotis.mokos ▴ 40

1

Entering edit mode

If you use the VST or rlog in DESeq2, and you have used e.g. Salmon, Sailfish or kallisto to quantify the abundances, the normalized, transformed values also take into account within-sample biases such as gene length changes, etc.

ADD REPLY • link 7.8 years ago Michael Love 43k

1

Entering edit mode

If you want a kallisto-processed version of the TCGA RNA-Seq data, you might consider my paper here: http://www.nature.com/articles/srep39259

ADD REPLY • link 7.8 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Dear Love,

Thank you very very much for your quick and informative reply!!

Panagiotis

ADD REPLY • link 7.8 years ago panagiotis.mokos ▴ 40

0

Entering edit mode

Dear Love,

I have read the paper here "Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research" and the tximport vignette. In this pipeline you note that there are two suggested ways of importing estimates for use with gene-level differential expression methods. The first method, recommended for edgeR and DESeq2, is to use the estimated counts from the quantification tools, and additionally to use the transcript-level abundance estimates to calculate an offset that corrects for changes to the average transcript length across samples. The second one (which you recommend for limma-voom as this software does not use the offset matrix stored in y$offset) is to use the tximport argumentcountsFromAbundance="lengthScaledTPM" (scaled using the average transcript length, averaged over samples and to library size) or "scaledTPM" (scaled to library size), and then to use the count matrix txi$counts directly as a regular count matrix.

In .rsem.genes.results.txt file of TCGA RNA-seq data the variable "scaled_estimate" is the fraction of transcripts made up by a given gene. Also the "scaled_estimate" could maybe be used as well, e.g. by multiplying it with 1M to get "transcripts per million" (TPM) as I mentioned above. Could I perform varianceStabilizingTransformation using DEseq2 package to these counts or could i use the v$E (v= result of voom) after voom function as you recommend to use either lengthScaledTPM or scaledTPM counts?

I would appreciate hearing your opinion!!!

Thank you very very much for your time in advance!!!

Sincerely,

Panagiotis Mokos

ADD REPLY • link 7.8 years ago panagiotis.mokos ▴ 40

0

Entering edit mode

hi Panagiotis,

You can try different things and compare using meanSdPlot like we do in the DESeq2 vignette.

If you want to use the VST it really only makes sense to apply it to something like counts. You can obtain something on the scale of counts by scaling up TPMs to the typical library size, that is, make the column sums of the matrix close to the average number of fragments in the samples.

ADD REPLY • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Dear Love,

I have just read in detail your RNA-seq workflow from this link http://www.bioconductor.org/help/workflows/rnaseqGene/#the-deseqdataset-sample-information-and-the-design-formula. At some point you refer "While the rlog is on roughly the same scale as the log2 counts, the VST has a upward shift for the smaller values. It is the differences between samples (deviation from y=x in these scatterplots) which will contribute to the distance calculations and the PCA plot." I can't understand why the lower values have an upward shift and I am wondering that these lower, shifted and normalized values bias downstream analyses. Do you beleive that the rlog works better than vst in this example with scatterplots?

I would appreciate hearing your opinion!!!

Thank you very very much for your time in advance!!!

Sincerely,

Panagiotis Mokos

ADD REPLY • link 7.8 years ago panagiotis.mokos ▴ 40

0

Entering edit mode

hi Panagiotis,

I wouldn't be worried about the upward shift. The important thing for downstream analysis is the difference between samples. Both rlog and VST shrink differences between samples for genes that have small counts. Can you say more about what situation (e.g. you mention bias) that you are worried about?

ADD REPLY • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Dear Love,

Thank you very much for your response. Could you explain more the differences between the above-mentioned scatterplots (ie between the rlog and VST method)? Also what happens about the variance of lowly expressed genes between rlog and VST transformation (given that the mean values of lowly expressed genes are increased using rlog transformation)? Finally why did you prefer the rlog counts for all downstream analyses in RNA-seq workflow?

Thank you very very much for your time in advance!!

Sincerely,

Panagiotis Mokos

ADD REPLY • link 7.8 years ago panagiotis.mokos ▴ 40

0

Entering edit mode

Could you explain more the differences between the above-mentioned scatterplots (ie between the rlog and VST method)?

I don't know if I have more to add than it says in the workflow.

Also what happens about the variance of lowly expressed genes between rlog and VST transformation (given that the mean values of lowly expressed genes are increased using rlog transformation)?

Both VST and rlog shrink the variance of lowly expressed genes. I wouldn't worry much about the mean shift. The VST is shifted such that vst(x) ~= log2(x) for large counts x.

Finally why did you prefer the rlog counts for all downstream analyses in RNA-seq workflow?

No reason in particular, I don't have much more to say on the differences beyond what is said in the vignette section on transformations.

ADD REPLY • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Dear Love,

Thank you very very much for your response!!

Sincerely,

Panagiotis

ADD REPLY • link 7.8 years ago panagiotis.mokos ▴ 40