Question: Applying sva is compatible with limma duplicateCorrelation?
0
gravatar for aec
2.3 years ago by
aec50
aec50 wrote:

Dear all,

I have a complex bulk RNA-seq dataset (from a human tissue) with multiple covariates (sex, city, age), different cell type composition, repeated measures (same patient sequenced 4 times --> 4 different locations of the tissue). I am applying sva to correct for different cell type compositions and other unwanted variation due to other factors. limma-voom for normalization and differential expression across groups.This is the code:

y=DGEList(counts=counts)
A<-rowSums(y$counts)
isexpr<-A>500
y=y[isexpr,keep.lib.size=FALSE]
dim(y)
y=calcNormFactors(y)
group=factor(info_new$group)
sex=factor(info_new$sex)
city=factor(info_new$city)
age=cut(info_new$age,4)
levels(age)=c(age1,age2,age3,age4)
mod <- model.matrix(~sex+age+city+group)
mod0 <- model.matrix(~sex+age+city)
v=voom(y,mod)
n.sv = num.sv(v$E,mod,method="leek")
sva_obj <- sva(v$E, mod, mod0, n.sv=n.sv)
mod1 <- model.matrix(~sex+age+city+group+sva_obj$sv)
v <- voom(counts=y, design = mod1)
fit <- lmFit(v, mod1) 
fit2<- eBayes(fit)
summary(decideTests(fit2))

My question is, does sva remove the subject effect as well and it is not necessary to apply duplicateCorrelation?

Here a simple version of the sample information (I have 92 RNA-seq samples, 63 patients, 7 cities, 28 groups, control group only from loc4):

patient
city
sex
age
group
1
city1
M
52
disease1_tissue_loc1
1
city1
M
52
disease1_tissue_loc2
1
city1
M
52
disease1_tissue_loc3
1
city1
M
52
disease1_tissue_loc4
2
city1
F
28
disease2_tissue_loc1
2
city1
F
28
disease2_tissue_loc2
2
city1
F
28
disease2_tissue_loc3
2
city1
F
28
disease2_tissue_loc4
3
city1
M
38
disease2_tissue_loc1
3
city1
M
38
disease2_tissue_loc2
3
city1
M
38
disease2_tissue_loc3
3
city1
M
38
disease2_tissue_loc4
4
city1
F
61
disease1_tissue_loc1
4
city2
F
61
disease1_tissue_loc2
4
city2
F
61
disease1_tissue_loc3
4
city2
F
61
disease1_tissue_loc4
5
city2
F
29
control_tissue_loc4
6
city2
F
53
control_tissue_loc4
7
city2
M
43
control_tissue_loc4
ADD COMMENTlink modified 2.3 years ago by sina.nassiri90 • written 2.3 years ago by aec50
Answer: Applying sva is compatible with limma duplicateCorrelation?
0
gravatar for sina.nassiri
2.3 years ago by
sina.nassiri90
Switzerland/Lausanne
sina.nassiri90 wrote:

In principle, if you don't include 'patient' in the full model (mod) it will be captured by SVA (assuming that it has a significant contribution to the overall variation observed). However, there are a few points to consider:

1- You have already included sex, age, and city in the null model during estimation of surrogate variables, therefore including them in the design formula alongside surrogate variables is redundant. What you want is:

mod1 <- model.matrix(~group+sva_obj$sv)

2- The goal of SVA is to estimate unknown batch that you have no knowledge of, but have reasons to believe they exist, for example after exploratory analysis using PCA or hierarchical clustering. If your only goal is to adjust for known factors in differential expression analysis, you can directly include them in your design formula as covariates.

3- The magic of voom lies in precision weights that are used to model the mean-variance trend, which you are not taking any advantage of in your estimation of surrogate variables. By using v$E you're only taking log-transformed counts. Whether a simple log-transformation is sufficient to adjust for the mean-variance trend is another point that you can check. If the answer is no, you may want to give sva-seq a shot, which directly takes RNAseq counts as input.

Hope these help.

ADD COMMENTlink written 2.3 years ago by sina.nassiri90

Thanks for your comments sina.nassari. 

1) so then what is better:

mod <- model.matrix(~sex+age+city+group)
mod0 <- model.matrix(~sex+age+city)
​mod1 <- model.matrix(~group+sva_obj$sv)

or 

mod <- model.matrix(~group)
mod0 <- model.matrix(~1)
mod1 <- model.matrix(~group+sva_obj$sv)

2) I think it's compatible mixing known batch and unknown batch effects using SVA. The different cell type composition would be unknown unwanted variation for example.

3) It's a pity SVA can not deal with weights. I thought that sva-seq is similar to v$E.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by aec50

1) I'd go with the former, because in the latter you're throwing out potentially valuable information about study design.

2) Sure; I missed cell type composition.

3) You are right about svaseq being similar to using sva with v$E; I just double checked the source code. My bad!

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by sina.nassiri90

Do you think, then, this option is invalid?

y=DGEList(counts=counts)
A<-rowSums(y$counts)
isexpr<-A>500
y=y[isexpr,keep.lib.size=FALSE]
y=calcNormFactors(y)
group=factor(info_new$group)
patient=factor(info_new$patient)
mod <- model.matrix(~group)
mod0 <- model.matrix(~1)
v=voom(y,mod)
n.sv = num.sv(v$E,mod,method="leek")
sva_obj <- sva(v$E, mod, mod0, n.sv=n.sv)
mod1 <- model.matrix(~group+sva_obj$sv)
v <- voom(counts=y, design = mod1)
cor=duplicateCorrelation(v,mod1,block=patient)
fit=lmFit(v,mod1,block=patient, correlation=cor$consensus)
fit2<- eBayes(fit)
summary(decideTests(fit2))
ADD REPLYlink written 2.3 years ago by aec50

Hi sina, I'm not sure your first point is correct. The SVs represent the component of gene expression that is neither attributable to phenotype of interest or to the known batch / adjustment variables. So for the design formula, I would think that you would need to use the full model + the SVs :

mod1 = cbind(mod,svobj$sv)

ADD REPLYlink written 2.3 years ago by asaffa010

I completely agree with you on using full model + SVs, but in this case neither one of "sex","age", and "city" is a variable of interest. Am I missing something here?

ADD REPLYlink written 2.2 years ago by sina.nassiri90

Those are known adjustment variables which are being specified in the SVA model alongside the variable of interest:

mod <- model.matrix(~sex+age+city+group)

SVA will then construct surrogate variables to account for unknown sources of variation, i,e. those not accounted for in the model. For subsequent DE analysis, the model should then include the phenotype of interest + the known adjustment variables + the derived SVs (representing unknown batch effects):

mod1 <- model.matrix(~sex+age+city+group+sva_obj$sv)

In your original answer you suggest leaving out sex+age+city on the basis that they are already modelled by the SVs, but that shouldn't be the case if SVA is being run with the full model as given above since they are left out of the SV estimation.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by asaffa010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 203 users visited in the last hour