Question: Applying sva is compatible with limma duplicateCorrelation?
0
24 months ago by
aec40
aec40 wrote:

Dear all,

I have a complex bulk RNA-seq dataset (from a human tissue) with multiple covariates (sex, city, age), different cell type composition, repeated measures (same patient sequenced 4 times --> 4 different locations of the tissue). I am applying sva to correct for different cell type compositions and other unwanted variation due to other factors. limma-voom for normalization and differential expression across groups.This is the code:

y=DGEList(counts=counts)
A<-rowSums(y$counts) isexpr<-A>500 y=y[isexpr,keep.lib.size=FALSE] dim(y) y=calcNormFactors(y) group=factor(info_new$group)
sex=factor(info_new$sex) city=factor(info_new$city)
age=cut(info_new$age,4) levels(age)=c(age1,age2,age3,age4) mod <- model.matrix(~sex+age+city+group) mod0 <- model.matrix(~sex+age+city) v=voom(y,mod) n.sv = num.sv(v$E,mod,method="leek")
sva_obj <- sva(v$E, mod, mod0, n.sv=n.sv) mod1 <- model.matrix(~sex+age+city+group+sva_obj$sv)
v <- voom(counts=y, design = mod1)
fit <- lmFit(v, mod1)
fit2<- eBayes(fit)
summary(decideTests(fit2))

My question is, does sva remove the subject effect as well and it is not necessary to apply duplicateCorrelation?

Here a simple version of the sample information (I have 92 RNA-seq samples, 63 patients, 7 cities, 28 groups, control group only from loc4):

 patient city sex age group 1 city1 M 52 disease1_tissue_loc1 1 city1 M 52 disease1_tissue_loc2 1 city1 M 52 disease1_tissue_loc3 1 city1 M 52 disease1_tissue_loc4 2 city1 F 28 disease2_tissue_loc1 2 city1 F 28 disease2_tissue_loc2 2 city1 F 28 disease2_tissue_loc3 2 city1 F 28 disease2_tissue_loc4 3 city1 M 38 disease2_tissue_loc1 3 city1 M 38 disease2_tissue_loc2 3 city1 M 38 disease2_tissue_loc3 3 city1 M 38 disease2_tissue_loc4 4 city1 F 61 disease1_tissue_loc1 4 city2 F 61 disease1_tissue_loc2 4 city2 F 61 disease1_tissue_loc3 4 city2 F 61 disease1_tissue_loc4 5 city2 F 29 control_tissue_loc4 6 city2 F 53 control_tissue_loc4 7 city2 M 43 control_tissue_loc4
modified 24 months ago by sina.nassiri70 • written 24 months ago by aec40
Answer: Applying sva is compatible with limma duplicateCorrelation?
0
24 months ago by
sina.nassiri70
Switzerland/Lausanne
sina.nassiri70 wrote:

In principle, if you don't include 'patient' in the full model (mod) it will be captured by SVA (assuming that it has a significant contribution to the overall variation observed). However, there are a few points to consider:

1- You have already included sex, age, and city in the null model during estimation of surrogate variables, therefore including them in the design formula alongside surrogate variables is redundant. What you want is:

mod1 <- model.matrix(~group+sva_obj$sv) 2- The goal of SVA is to estimate unknown batch that you have no knowledge of, but have reasons to believe they exist, for example after exploratory analysis using PCA or hierarchical clustering. If your only goal is to adjust for known factors in differential expression analysis, you can directly include them in your design formula as covariates. 3- The magic of voom lies in precision weights that are used to model the mean-variance trend, which you are not taking any advantage of in your estimation of surrogate variables. By using v$E you're only taking log-transformed counts. Whether a simple log-transformation is sufficient to adjust for the mean-variance trend is another point that you can check. If the answer is no, you may want to give sva-seq a shot, which directly takes RNAseq counts as input.

Hope these help.

1) so then what is better:

mod <- model.matrix(~sex+age+city+group)
mod0 <- model.matrix(~sex+age+city)
​mod1 <- model.matrix(~group+sva_obj$sv) or mod <- model.matrix(~group) mod0 <- model.matrix(~1) mod1 <- model.matrix(~group+sva_obj$sv)

2) I think it's compatible mixing known batch and unknown batch effects using SVA. The different cell type composition would be unknown unwanted variation for example.

3) It's a pity SVA can not deal with weights. I thought that sva-seq is similar to v$E. ADD REPLYlink modified 24 months ago • written 24 months ago by aec40 1) I'd go with the former, because in the latter you're throwing out potentially valuable information about study design. 2) Sure; I missed cell type composition. 3) You are right about svaseq being similar to using sva with v$E; I just double checked the source code. My bad!

ADD REPLYlink modified 24 months ago • written 24 months ago by sina.nassiri70

Do you think, then, this option is invalid?

y=DGEList(counts=counts)
A<-rowSums(y$counts) isexpr<-A>500 y=y[isexpr,keep.lib.size=FALSE] y=calcNormFactors(y) group=factor(info_new$group)
patient=factor(info_new$patient) mod <- model.matrix(~group) mod0 <- model.matrix(~1) v=voom(y,mod) n.sv = num.sv(v$E,mod,method="leek")
sva_obj <- sva(v$E, mod, mod0, n.sv=n.sv) mod1 <- model.matrix(~group+sva_obj$sv)
v <- voom(counts=y, design = mod1)
cor=duplicateCorrelation(v,mod1,block=patient)
fit=lmFit(v,mod1,block=patient, correlation=cor$consensus) fit2<- eBayes(fit) summary(decideTests(fit2)) ADD REPLYlink written 23 months ago by aec40 Hi sina, I'm not sure your first point is correct. The SVs represent the component of gene expression that is neither attributable to phenotype of interest or to the known batch / adjustment variables. So for the design formula, I would think that you would need to use the full model + the SVs : mod1 = cbind(mod,svobj$sv)

I completely agree with you on using full model + SVs, but in this case neither one of "sex","age", and "city" is a variable of interest. Am I missing something here?

Those are known adjustment variables which are being specified in the SVA model alongside the variable of interest:

mod <- model.matrix(~sex+age+city+group)

SVA will then construct surrogate variables to account for unknown sources of variation, i,e. those not accounted for in the model. For subsequent DE analysis, the model should then include the phenotype of interest + the known adjustment variables + the derived SVs (representing unknown batch effects):

mod1 <- model.matrix(~sex+age+city+group+sva_obj\$sv)

In your original answer you suggest leaving out sex+age+city on the basis that they are already modelled by the SVs, but that shouldn't be the case if SVA is being run with the full model as given above since they are left out of the SV estimation.

ADD REPLYlink modified 23 months ago • written 23 months ago by asaffa010