Is it appropriate to call svaseq() multiple times and then only bind the last resulting surrogate variable to the design matrix in edgeR?
Our samples are really messy, having been collected from plants grown in different greenhouses, during different seasons, etc, which were inescapable constraints. I tried to account for the aforementioned potential sources of variation in prior design matrices, but they didn't result in differentially expressed genes called. I can't block because not every season, greenhouse, etc. is replicated in each condition. Furthermore, it's possible that these confounding variables that we know of are also interacting with confounding variables we don't know. I found that more differentially expressed genes are called with an unsupervised sva, and they seem "reasonable" based on the pathway that was disrupted. We are ordering more replicates to be sequenced, but until then, can I call svaseq() several times or will that result in bias and unusable results?
####design#### #these are biological replicates (different plants, different library preps), #so I think duplicateCorrelation() is inappropriate here, I could be wrong? wt mu2 mu1 wtA 1 0 0 wtB 1 0 0 wtC 1 0 0 wtD 1 0 0 mu1A 0 0 1 mu1B 0 0 1 mu1C 0 0 1 mu1D 0 0 1 mu2A 0 1 0 mu2B 0 1 0 mu2C 0 1 0 mu2D 0 1 0 #mod & mod0 have intercepted design: mod<-model.matrix(~Genotype); mod0<-model.matrix(~1) ####svaseq##### svobj = svaseq(as.matrix(dge),mod,mod0) modSv = cbind (mod, svobj$sv) mod0Sv = cbind (mod0, svobj$sv) svobj.1 = svaseq(as.matrix(dge), modSv, mod0Sv) modSv.1 = cbind(modSv, svobj.1$sv) mod0Sv.1 = cbind(mod0Sv, svobj.1$sv) svobj.2 = svaseq(as.matrix(dge1), modSv.1, mod0Sv.1) modSv.3 = cbind(modSv.1, svobj.2$sv) #for limma mod0Sv.3 = cbind(mod0Sv.1, svobj.2$sv) #is this legal? mod.sv = cbind(mod, svobj.2$sv) design.sv = cbind(design, svobjjj$sv)
Multiple calls of svaseq have resulted in the number of surrogate variables decreasing from 3 to 1 in this dataset.
Plugging design.sv into edgeR-QLF/mod.sv into limma-voomWithQualityWeights typically results in similar numbers of DE genes called. Plugging one of the earlier SVA calls results in no differentially expressed genes called. LRT and exactTest are more permissive and call more DE genes, but with the large differences in sample quality, I'm wary of relying on this. I'm going to also try RUVSeq and DESEq2 on these data but I think I'm going to have the same question, implementation-wise.
Thank you for your time!