sva :: sva can variable of interest be a matrix itself?
1
0
Entering edit mode
@reyhansonmez-11940
Last seen 7.3 years ago

Hi,

 

 

 

I want to associate gene expression of some people with their metabolite levels. In my case metabolite levels are measured as a function of ~400 features. My question is if I can use sva, if the variable of interest is matrix instead of a vector? In all the exaples I saw variable of interest was always a vector like cancer status, etc..    Thanks!

 

 

 

 

sva variable of ineterest matrix • 1.2k views
ADD COMMENT
1
Entering edit mode
Jeff Leek ▴ 650
@jeff-leek-5015
Last seen 3.1 years ago
United States

This is a great question! SVA isn't specifically designed for this scenario but here is a hack that you could try: 

 

(1) Perform the SVD on the metabolite data, take the top 2-3 singular vectors (the ones that explain 80%+ of the variation)

(2) Treat these as the "primary variables" in the sva analysis of the expression data. 

(3) Get the resulting surrogate variables and include them in the models relating metabolites to expression levels. 

 

There is no theory to guarantee this would work (although probably a good methods paper in there if someone wants to write it), but its pretty clear it would capture most of the variation in the metabolite data and protect you from removing the signal while removing the batch effects. 

 

The only tricky part would be if the metabolites and expression were processed in batches together. Then you would be protecting batch effects from the metabolites in the expression data. 

 

Hope that helps. 

 

Jeff

ADD COMMENT
0
Entering edit mode

 

 

Thanks for the answer Jeff, 

The analysis you suggested makes sense. However I am not sure if I can apply it for the following reason:

As first step you are suggesting to do SVD on metabolite matrix and take first couple of SV that explain %80 of the variance. In my metabolite data it takes first 22 SV to reach the level where %80 of the variance is explained. I could use all these 22 SV as primary variables but I am afraid of over parameterising the model. The metabolite data I am using is standardised across samples and then the features if it makes any difference.. Below are the  cumulative variance explained by first 30 SVs.

 [1]  33.11876  41.41801  49.04494  54.30329  58.29662  61.36211  64.06303  65.89266  67.58878  69.18557
 [11]  70.56121  71.80918  72.98200  73.93578  74.81452  75.68446  76.50778  77.31598  78.08510  78.77824
 [21]  79.45655  80.07261  80.63256  81.17656  81.70586  82.19804  82.67386  83.14807  83.57942  83.99930

If I use 22 SVs as primary variables in my full model, number of unknown effect are estimated to be

412 for expression data

0 for log2(expression data)

2 for scaled(log2(expression data)) *scaling makes each gene to have 0 mean, unit variance

I don't know why the number of surrogate variable estimates are so different..  If I use svaseq with rpkm data I will get 412 surrogate variable estimates which is too high to use as covariables in downstream statistical analysis. If I use sva with scaled and log2 transformed rpkm data I will get 2 covariables to use which is much more feasible for downstream analysis  On the other hand I would like to justify why I choose the latter but I can't.

So overall;

1)Do you think using 22 SV is too much?

2)Would you use sva with scaled and transformed rpkm data? if so why?

Thanks a lot,

Reyhan

 

 

ADD REPLY

Login before adding your answer.

Traffic: 653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6