Question

Applying sva is compatible with limma duplicateCorrelation?

0

Entering edit mode

aec ▴ 90

@aec-9409

Last seen 5.6 years ago

Dear all,

I have a complex bulk RNA-seq dataset (from a human tissue) with multiple covariates (sex, city, age), different cell type composition, repeated measures (same patient sequenced 4 times --> 4 different locations of the tissue). I am applying sva to correct for different cell type compositions and other unwanted variation due to other factors. limma-voom for normalization and differential expression across groups.This is the code:

y=DGEList(counts=counts)
A<-rowSums(y$counts)
isexpr<-A>500
y=y[isexpr,keep.lib.size=FALSE]
dim(y)
y=calcNormFactors(y)
group=factor(info_new$group)
sex=factor(info_new$sex)
city=factor(info_new$city)
age=cut(info_new$age,4)
levels(age)=c(age1,age2,age3,age4)
mod <- model.matrix(~sex+age+city+group)
mod0 <- model.matrix(~sex+age+city)
v=voom(y,mod)
n.sv = num.sv(v$E,mod,method="leek")
sva_obj <- sva(v$E, mod, mod0, n.sv=n.sv)
mod1 <- model.matrix(~sex+age+city+group+sva_obj$sv)
v <- voom(counts=y, design = mod1)
fit <- lmFit(v, mod1) 
fit2<- eBayes(fit)
summary(decideTests(fit2))

My question is, does sva remove the subject effect as well and it is not necessary to apply duplicateCorrelation?

Here a simple version of the sample information (I have 92 RNA-seq samples, 63 patients, 7 cities, 28 groups, control group only from loc4):

patient	city	sex	age	group
1	city1	M	52	disease1_tissue_loc1
1	city1	M	52	disease1_tissue_loc2
1	city1	M	52	disease1_tissue_loc3
1	city1	M	52	disease1_tissue_loc4
2	city1	F	28	disease2_tissue_loc1
2	city1	F	28	disease2_tissue_loc2
2	city1	F	28	disease2_tissue_loc3
2	city1	F	28	disease2_tissue_loc4
3	city1	M	38	disease2_tissue_loc1
3	city1	M	38	disease2_tissue_loc2
3	city1	M	38	disease2_tissue_loc3
3	city1	M	38	disease2_tissue_loc4
4	city1	F	61	disease1_tissue_loc1
4	city2	F	61	disease1_tissue_loc2
4	city2	F	61	disease1_tissue_loc3
4	city2	F	61	disease1_tissue_loc4
5	city2	F	29	control_tissue_loc4
6	city2	F	53	control_tissue_loc4
7	city2	M	43	control_tissue_loc4

limma-voom sva duplicatecorrelation • 3.3k views

ADD COMMENT • link updated 8.5 years ago by sina.nassiri ▴ 130 • written 8.5 years ago by aec ▴ 90

score 0 · Answer 1 · 2017-07-25

0

Entering edit mode

sina.nassiri ▴ 130

@sinanassiri-10062

Last seen 6 weeks ago

Switzerland/Lausanne

In principle, if you don't include 'patient' in the full model (mod) it will be captured by SVA (assuming that it has a significant contribution to the overall variation observed). However, there are a few points to consider:

1- You have already included sex, age, and city in the null model during estimation of surrogate variables, therefore including them in the design formula alongside surrogate variables is redundant. What you want is:

mod1 <- model.matrix(~group+sva_obj$sv)

2- The goal of SVA is to estimate unknown batch that you have no knowledge of, but have reasons to believe they exist, for example after exploratory analysis using PCA or hierarchical clustering. If your only goal is to adjust for known factors in differential expression analysis, you can directly include them in your design formula as covariates.

3- The magic of voom lies in precision weights that are used to model the mean-variance trend, which you are not taking any advantage of in your estimation of surrogate variables. By using v$E you're only taking log-transformed counts. Whether a simple log-transformation is sufficient to adjust for the mean-variance trend is another point that you can check. If the answer is no, you may want to give sva-seq a shot, which directly takes RNAseq counts as input.

Hope these help.

ADD COMMENT • link 8.5 years ago sina.nassiri ▴ 130

0

Entering edit mode

Thanks for your comments sina.nassari.

1) so then what is better:

mod <- model.matrix(~sex+age+city+group)
mod0 <- model.matrix(~sex+age+city)
mod1 <- model.matrix(~group+sva_obj$sv)

or

mod <- model.matrix(~group)
mod0 <- model.matrix(~1)
mod1 <- model.matrix(~group+sva_obj$sv)

2) I think it's compatible mixing known batch and unknown batch effects using SVA. The different cell type composition would be unknown unwanted variation for example.

3) It's a pity SVA can not deal with weights. I thought that sva-seq is similar to v$E.

ADD REPLY • link 8.5 years ago aec ▴ 90

0

Entering edit mode

1) I'd go with the former, because in the latter you're throwing out potentially valuable information about study design.

2) Sure; I missed cell type composition.

3) You are right about svaseq being similar to using sva with v$E; I just double checked the source code. My bad!

ADD REPLY • link 8.5 years ago sina.nassiri ▴ 130

0

Entering edit mode

Do you think, then, this option is invalid?

y=DGEList(counts=counts)
A<-rowSums(y$counts)
isexpr<-A>500
y=y[isexpr,keep.lib.size=FALSE]
y=calcNormFactors(y)
group=factor(info_new$group)
patient=factor(info_new$patient)
mod <- model.matrix(~group)
mod0 <- model.matrix(~1)
v=voom(y,mod)
n.sv = num.sv(v$E,mod,method="leek")
sva_obj <- sva(v$E, mod, mod0, n.sv=n.sv)
mod1 <- model.matrix(~group+sva_obj$sv)
v <- voom(counts=y, design = mod1)
cor=duplicateCorrelation(v,mod1,block=patient)
fit=lmFit(v,mod1,block=patient, correlation=cor$consensus)
fit2<- eBayes(fit)
summary(decideTests(fit2))

ADD REPLY • link 8.5 years ago aec ▴ 90

0

Entering edit mode

Hi sina, I'm not sure your first point is correct. The SVs represent the component of gene expression that is neither attributable to phenotype of interest or to the known batch / adjustment variables. So for the design formula, I would think that you would need to use the full model + the SVs :

mod1 = cbind(mod,svobj$sv)

ADD REPLY • link 8.4 years ago asaffa01 • 0

0

Entering edit mode

I completely agree with you on using full model + SVs, but in this case neither one of "sex","age", and "city" is a variable of interest. Am I missing something here?

ADD REPLY • link 8.4 years ago sina.nassiri ▴ 130

0

Entering edit mode

Those are known adjustment variables which are being specified in the SVA model alongside the variable of interest:

mod <- model.matrix(~sex+age+city+group)

SVA will then construct surrogate variables to account for unknown sources of variation, i,e. those not accounted for in the model. For subsequent DE analysis, the model should then include the phenotype of interest + the known adjustment variables + the derived SVs (representing unknown batch effects):

mod1 <- model.matrix(~sex+age+city+group+sva_obj$sv)

In your original answer you suggest leaving out sex+age+city on the basis that they are already modelled by the SVs, but that shouldn't be the case if SVA is being run with the full model as given above since they are left out of the SV estimation.

ADD REPLY • link 8.4 years ago asaffa01 • 0