Question

ABOUT TRANSFORMATION OF RNA-SEQ DATA FOR GLMNET COX SURVIVAL ANALYSIS

1

Entering edit mode

panagiotis.mokos ▴ 40

@panagiotismokos-9709

Last seen 8.5 years ago

Dear Bioconductor users,

I am working with RNA-seq data (raw counts) and I want to perform regularized cox regression modelling using glmnet package. First, I have performed VST transformation that makes RNA-seq data homoscedastic. Next do i have to set the argument of glmnet function standardize= TRUE for variable standardization (all variables to have unit variance) prior to fitting the model sequence and then use the resulting unstandardized coefficients to rank the selected features (genes) or in my case the default standardization is not appropriate ?

Thank you for your time in advance!!

Sincerely,

Panagiotis Mokos

glmnet deseq2 • 3.3k views

ADD COMMENT • link updated 8.9 years ago by Michael Love 43k • written 8.9 years ago by panagiotis.mokos ▴ 40

score 3 · Accepted Answer · 2017-02-28

3

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 hours ago

United States

hi Panagiotis,

The glmnet software is optimized to have unit variance predictors, so I can see how you got to this dilemma.

Scaling (for each gene, across samples) and VST are to some degree at odds. The VST shrinks technical variance so that biological differences are not overwhelmed. And doing so it outperforms simply transformations such as log(x + 1). But then if you force all genes to have unit variance, you undo that effect, increasing technical noise which was just shrunk.

I'd suggest you use the VST, then use a variance filter on the VST data to remove genes with minimal variance (take a look at the meanSdPlot to get a sense of the genes which likely have no biological signal, see vignette), then feed the remaining genes to glmnet with standardize=TRUE.

ADD COMMENT • link 8.9 years ago Michael Love 43k

0

Entering edit mode

Dear Love,

Thank you very much for your useful information!!

Please, could you explain more about this gene filtering (based on variance) or send me a link (the above-mentioned vignette)?

Also, in your opinion, is it better to prior standardize (unit variance) the VST-filtered data and then input them to glmnet algorithm setting standardize= FALSE? In other words do you believe that the final coefficient sizes (which they will be used to rank the selected features) should reflect the differences of gene variances?

Thank you for your time !!!

Sincerely,

Panagiotis

ADD REPLY • link 8.9 years ago panagiotis.mokos ▴ 40

0

Entering edit mode

The DESeq2 vignette is available by typing into R:

vignette("DESeq2")

You should definitely read this over, particularly the part about transformations. It's the detailed user guide for the software, which has grown over 7 years of DESeq1/2.

(All Bioconductor software is required to have a detailed software vignette.)

I don't really have any extra opinion on the downstream usage beyond my suggestion above. If this seems to confusing or difficult, you could just filter out low counts genes based on some heuristic you define and then use glmnet on log counts.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

Dear Love,

Thank you very much for your response!!

Panagiotis

ADD REPLY • link 8.9 years ago panagiotis.mokos ▴ 40