Question

Calling svaseq() multiple times

0

Entering edit mode

cats_dogs ▴ 20

@cats_dogs-15904

Last seen 6.7 years ago

Hi all,

Is it appropriate to call svaseq() multiple times and then only bind the last resulting surrogate variable to the design matrix in edgeR?

Our samples are really messy, having been collected from plants grown in different greenhouses, during different seasons, etc, which were inescapable constraints. I tried to account for the aforementioned potential sources of variation in prior design matrices, but they didn't result in differentially expressed genes called. I can't block because not every season, greenhouse, etc. is replicated in each condition. Furthermore, it's possible that these confounding variables that we know of are also interacting with confounding variables we don't know. I found that more differentially expressed genes are called with an unsupervised sva, and they seem "reasonable" based on the pathway that was disrupted. We are ordering more replicates to be sequenced, but until then, can I call svaseq() several times or will that result in bias and unusable results?

####design####
#these are biological replicates (different plants, different library preps), 
#so I think duplicateCorrelation() is inappropriate here, I could be wrong?

     wt mu2 mu1
wtA   1   0   0
wtB   1   0   0
wtC   1   0   0
wtD   1   0   0
mu1A  0   0   1
mu1B  0   0   1
mu1C  0   0   1
mu1D  0   0   1
mu2A  0   1   0
mu2B  0   1   0
mu2C  0   1   0
mu2D  0   1   0

#mod & mod0 have intercepted design: mod<-model.matrix(~Genotype); mod0<-model.matrix(~1)


####svaseq#####

svobj = svaseq(as.matrix(dge),mod,mod0)
modSv = cbind (mod, svobj$sv)
mod0Sv = cbind (mod0, svobj$sv)

svobj.1 = svaseq(as.matrix(dge), modSv, mod0Sv)
modSv.1 = cbind(modSv, svobj.1$sv)
mod0Sv.1 = cbind(mod0Sv, svobj.1$sv)

svobj.2 = svaseq(as.matrix(dge1), modSv.1, mod0Sv.1)
modSv.3 = cbind(modSv.1, svobj.2$sv) #for limma
mod0Sv.3 = cbind(mod0Sv.1, svobj.2$sv)

#is this legal?
mod.sv = cbind(mod, svobj.2$sv)
design.sv = cbind(design, svobjjj$sv)

Multiple calls of svaseq have resulted in the number of surrogate variables decreasing from 3 to 1 in this dataset.

Plugging design.sv into edgeR-QLF/mod.sv into limma-voomWithQualityWeights typically results in similar numbers of DE genes called. Plugging one of the earlier SVA calls results in no differentially expressed genes called. LRT and exactTest are more permissive and call more DE genes, but with the large differences in sample quality, I'm wary of relying on this. I'm going to also try RUVSeq and DESEq2 on these data but I think I'm going to have the same question, implementation-wise.

Thank you for your time!

edger sva svaseq limma • 2.2k views

ADD COMMENT • link updated 7.6 years ago by Ryan C. Thompson ★ 7.9k • written 7.6 years ago by cats_dogs ▴ 20

score 2 · Answer 1 · 2018-05-23

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 14 months ago

Icahn School of Medicine at Mount Sinai…

What is your goal in running sva multiple times? Are you just trying to estimate additional surrogate variables? SVA tries to estimate the appropriate number of surrogate variables, but If you want more surrogate variables, you can simply override its decision by setting n.sv. I'm not aware of any reason to run sva multiple times in this way.

As to the question of whether it is appropriate to keep estimating more surrogate variables until your results become significant, I'm fairly sure this will result in confirmation bias.

ADD COMMENT • link 7.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks for your response, Ryan! I was trying to iterate to convergence. I saw this post and got ideas: using duplicateCorrelation with limma+voom for RNA-seq data

I realize that svaseq() and duplicateCorrelation() don't do the same thing and that the resulting pvalues might mean "less" with regard to the original distributions from a statistical standpoint. But if some of these genes were predictive in the context of an experimental model and validated by qPCR, I wonder if it might be appropriate to report such a dataset explicitly in those terms and in that context. However, if it's statistically untenable, I don't want to go there.

P.S.
I want to run something like:

v <- voomWithQualityWeights(counts, design=mod) svobj <- svaseq(as.matrix(v), mod, mod0) modSv<- cbind(mod, svobj$sv) mod0Sv <-cbind (mod, svobj$sv)

v1 <- voomWithQualityWeights(counts, modSv) #or just voom, at this point, since quality weights are already incorporated? svobj.2 <- svaseq(as.matrix(v1), modSv, mod0Sv)

and then fit and do DE testing with limma. However, svaseq throws out a dimension error with as.matrix(v1), an EList object, but not with as.matrix(dge), a DGEList object. In desperation, I tried to extract sampleWeights from the Elist object as.vector() and input them into the DGEList object, but it threw out a dimension error. This is probably its own question, but it's why I didn't take the approach above.

ADD REPLY • link 7.6 years ago cats_dogs ▴ 20

1

Entering edit mode

I don't believe that sva is an algorithm that can "converge". You can keep adding additional surrogate variables until they eat up all the residual variation and your model has as many coefficients as you have samples, at which point you can no longer fit the model at all. Like I said, you can let sva guess at the appropriate number of surrogate variables, or you can pre-specify it.

You are getting an error when you run svaseq because you are using svaseq incorrectly. The first step in svaseq is to take a (moderated) log transform, but the data in v and v1 have already been log-transformed by voom. So you should just be using sva instead of svaseq.

As for your question about whether to use voomWithQualityWeights the second time, yes, you do need to do this, because neither counts nor modSv contains the weights. If anything, you can use voom the first time, since the weights are not used by sva anyway, so spending extra time calculating them is pointless.