Hi all,

Is it appropriate to call svaseq() multiple times and then only bind the last resulting surrogate variable to the design matrix in edgeR?

Our samples are really messy, having been collected from plants grown in different greenhouses, during different seasons, etc, which were inescapable constraints. I tried to account for the aforementioned potential sources of variation in prior design matrices, but they didn't result in differentially expressed genes called. I can't block because not every season, greenhouse, etc. is replicated in each condition. Furthermore, it's possible that these confounding variables that we know of are also interacting with confounding variables we don't know. I found that more differentially expressed genes are called with an unsupervised sva, and they seem "reasonable" based on the pathway that was disrupted. We are ordering more replicates to be sequenced, but until then, can I call svaseq() several times or will that result in bias and unusable results?

####design#### #these are biological replicates (different plants, different library preps), #so I think duplicateCorrelation() is inappropriate here, I could be wrong? wt mu2 mu1 wtA 1 0 0 wtB 1 0 0 wtC 1 0 0 wtD 1 0 0 mu1A 0 0 1 mu1B 0 0 1 mu1C 0 0 1 mu1D 0 0 1 mu2A 0 1 0 mu2B 0 1 0 mu2C 0 1 0 mu2D 0 1 0 #mod & mod0 have intercepted design: mod<-model.matrix(~Genotype); mod0<-model.matrix(~1) ####svaseq##### svobj = svaseq(as.matrix(dge),mod,mod0) modSv = cbind (mod, svobj$sv) mod0Sv = cbind (mod0, svobj$sv) svobj.1 = svaseq(as.matrix(dge), modSv, mod0Sv) modSv.1 = cbind(modSv, svobj.1$sv) mod0Sv.1 = cbind(mod0Sv, svobj.1$sv) svobj.2 = svaseq(as.matrix(dge1), modSv.1, mod0Sv.1) modSv.3 = cbind(modSv.1, svobj.2$sv) #for limma mod0Sv.3 = cbind(mod0Sv.1, svobj.2$sv) #is this legal? mod.sv = cbind(mod, svobj.2$sv) design.sv = cbind(design, svobjjj$sv)

Multiple calls of svaseq have resulted in the number of surrogate variables decreasing from 3 to 1 in this dataset.

Plugging design.sv into edgeR-QLF/mod.sv into limma-voomWithQualityWeights typically results in similar numbers of DE genes called. Plugging one of the earlier SVA calls results in no differentially expressed genes called. LRT and exactTest are more permissive and call more DE genes, but with the large differences in sample quality, I'm wary of relying on this. I'm going to also try RUVSeq and DESEq2 on these data but I think I'm going to have the same question, implementation-wise.

Thank you for your time!

Thanks for your response, Ryan! I was trying to iterate to convergence. I saw this post and got ideas: using duplicateCorrelation with limma+voom for RNA-seq data

I realize that

`svaseq()`

and`duplicateCorrelation()`

don't do the same thing and that the resulting pvalues might mean "less" with regard to the original distributions from a statistical standpoint. But if some of these genes were predictive in the context of an experimental model and validated by qPCR, I wonder if it might be appropriate to report such a dataset explicitly in those terms and in that context. However, if it's statistically untenable, I don't want to go there.P.S.

I want to run something like:

`v <- voomWithQualityWeights(counts, design=mod)`

svobj <- svaseq(as.matrix(v), mod, mod0)

modSv<- cbind(mod, svobj$sv)

mod0Sv <-cbind (mod, svobj$sv)

`v1 <- voomWithQualityWeights(counts, modSv) #or just voom, at this point, since quality weights are already incorporated?`

svobj.2 <- svaseq(as.matrix(v1), modSv, mod0Sv)

and then fit and do DE testing with

`limma`

. However,`svaseq`

throws out a dimension error with`as.matrix(v1)`

, an EList object, but not with`as.matrix(dge)`

, a DGEList object. In desperation, I tried to extract`sampleWeights`

from the`Elist`

object`as.vector()`

and input them into the`DGEList`

object, but it threw out a dimension error. This is probably its own question, but it's why I didn't take the approach above.I don't believe that sva is an algorithm that can "converge". You can keep adding additional surrogate variables until they eat up all the residual variation and your model has as many coefficients as you have samples, at which point you can no longer fit the model at all. Like I said, you can let sva guess at the appropriate number of surrogate variables, or you can pre-specify it.

You are getting an error when you run svaseq because you are using svaseq incorrectly. The first step in svaseq is to take a (moderated) log transform, but the data in v and v1 have already been log-transformed by voom. So you should just be using sva instead of svaseq.

As for your question about whether to use voomWithQualityWeights the second time, yes, you do need to do this, because neither counts nor modSv contains the weights. If anything, you can use voom the first time, since the weights are not used by sva anyway, so spending extra time calculating them is pointless.