Question

SVA package - ERROR: nvobj = sva(edata, mod, mod0, n.sv=n.sv)

1

Entering edit mode

aina.jene ▴ 10

@ainajene-12984

Last seen 8.5 years ago

ISGlobal - Barcelona

Hello,

I am currently working on an RNA-seq ExpressionSet from TCGA and I want to proceed with a linear model analysis using voom, but first I need to remove the batch effect. In order to do so, I am using sva package.

My data is:

In this expression set, the dimensions are:

pheno: 164 3
edata: 20531 164

In pheno we have all samples in rows and each column represents: Call, Status and Id

Call: the individual can present “Loss of Chromosome Y”, “Normal” or “XYY”.
Status: stands for the origin of each sample “tumor” or “normal”.
Id: the id of each sample

Knowing that:

Full model is:

mod = model.matrix(~ Call + Status, data = pheno) [both, Call and Status are class factor]

Null model is just like the example in the vignette:

mod0 = model.matrix(~ 1, data = pheno)

Then, when I try to find the number of surrogate variables (ns <- num.sv(counts, mod, method="leek"), it turns out that there are 160 variable. If I change the method to "be", then the number of surrogate variables decrease to 1.

Finally, when I try the next function:

svobj <- sva(counts, mod, mod0, n.sv=ns)

I ALWAYS get the following error message:

" Number of significant surrogate variables is: 1

Iteration (out of 5): Error in density.default(x, adjust = adj): 'x' contains missing values "

I have already checked the data and there are no missing values at all, and I have also tried to remove those rows with values equal to 0 in order to reduce the dataset but it did not work either.

Could anybody help me to try and find a solution?

Thank you all beforehand,

Aina

sva removebatcheffect R • 2.7k views

ADD COMMENT • link updated 8.7 years ago by sina.nassiri ▴ 130 • written 8.7 years ago by aina.jene ▴ 10

score 1 · Accepted Answer · 2017-05-09

Aina,

You mentioned that you want to proceed with a linear model analysis using "voom" and also referred to your input matrix as "counts", so I'm assuming you have count data in hand. If that's the case, you would want to use svaseq() instead of sva().
In regard to formulating the full model, I just want to add that you need to be careful that you don't leave any potentially important biological factor out of equation. For example, in your full model you're assuming that "Call" and "Status" have a linear additive effect on gene expression. I don't know enough about your data, but could there be interaction between "Call" and "Status"? Keep in mind that any effects that are not explicitly accounted for in the full model may be removed by sva/svaseq as unwanted effects.

mod1 = model.matrix(~ Call + Status, data = pheno)

or

mod2 = model.matrix(~ Call + Status + Call:Status, data = pheno)

In calculating number of surrogate variables, "be" is the recommended default method based on permutation. "leek" is an alternative asymptotic approach, which may perform better only when dealing with large samples.
Finally, regarding the error message, it's hard to tell without having access to your code and input file.
One last point since you mentioned your data is from TCGA, are you familiar with MBatch? MBatch provides visualization tools to assess batch effects in TCGA data. You can also obtain data already adjusted for batch effects using different methods including ComBat from the sva package. It might be worth exploring.

Hope these help!