Question

Normalization and transformation of RNA-seq raw data for regularized linear regression models

0

Entering edit mode

panagiotis.mokos ▴ 40

@panagiotismokos-9709

Last seen 6.7 years ago

Dear Bioconductor users,

I am working with RNA-seq data. I want to perform regularized elastic net linear regression modeling using RNA-seq data. First, I have transformed my data using VOOM transformation and then i would use the log-cpm values of the E component of EList output object.The inverse variance weights, after voom, is a numeric matrix and in the glmnet function could be input a vector with size equal to the number of observations.Could I (and how to) use the precision weights at the observation level in glmnet function?

Thank you very much for your time in advance!!!

Sincerely,

Panagiotis Mokos

rnaseq glmnet limma voom deseq2 • 2.4k views

ADD COMMENT • link 7.1 years ago panagiotis.mokos ▴ 40

score 0 · Answer 1 · 2017-03-11

If you want to model each gene separately, you'll probably need to supply each row of v$E as the y argument in glmnet, along with the corresponding row of v$weights as weights (assuming v is the output of voom). I'm not completely sure that the precision weights from voom are applicable as the "observation weights" mentioned in ?glmnet; it seems that there's weighted least squares in there somewhere, so it's probably fine. Anyway, this is the standard analysis approach where each gene is considered on its own merits (excepting situations involving empirical Bayes, or when learning latent variables). Of course, summarizing the output across thousands of genes is its own challenge; you should know what statistics are of interest a priori (e.g., percentage of genes where a particular variable is non-zero, the ideal lambda from cross-validation).

The other option is to model all genes together, presumably using family="mgaussian" where each gene is a separate response. In this setting, it doesn't seem possible to supply a weight matrix to weights, so if you want to do this you'll just have to make do without weighting. Note that coercing v$E and v$weights to vectors and stuffing them into glmnet will almost definitely not do what you want - you'd either be assuming that each variable affects all genes in the same manner (which is very unlikely) or you'd have to include gene-specific variables in the design matrix (in which case you might as well just fit each gene separately).

Of course, the bigger question is why you want to do this in the first place. Are you trying to use elastic nets to learn which variables should be put into a model for a DE analysis? This is generally considered to be a Bad idea; automatic model selection methods will happily learn models that provide great predictive power but whose variables cannot be easily interpreted in the context of the questions of interest.

P.S. Split the limma and voom tags in your post, otherwise maintainers won't be notified.

P.P.S. Respond to answers using the "add comment" button, don't add a new answer.