Dear Bioconductor users,
I am working with TCGA RNA-seq data. I want to perform unbiased selection of features associated with overall survival of hepatocellular carcinoma patients using elastic net regularized cox regression modeling and then rank them by heir coefficient sizes. Me and my colleagues are so confused with the data should be used and the appropriate normalization strategy. Specifically, I have downloaded the .rsem.genes.results. I've understood that the "raw_count" is the estimated number of fragments derived from a given gene and the "scaled_estimate" is the fraction of transcripts made up by a given gene. The "scaled_estimate" could maybe be used as well, e.g. by multiplying it with 1M to get "transcripts per million" (TPM) which Li and Dewey state should be more comparable across samples. So, which of the following normalized values should I use (and why to use) for cox glmnet analysis (in order to fulfill the statistical assumptions required for linear modeling) : the vst normalized "raw_counts" (using varianceStabilizingTransformation function of DEseq2 package), the voom transformed "raw_counts" (using both TMM normalization and voom tranformation) or the log2(TPM+prior count) (TPM= "scaled_estimate"*10^6)?
I would appreciate hearing your opinion!!!
Thank you very very much in advance!!!