Dear all,
I have a complex bulk RNA-seq dataset (from a human tissue) with multiple covariates (sex, city, age), different cell type composition, repeated measures (same patient sequenced 4 times --> 4 different locations of the tissue). I am applying sva to correct for different cell type compositions and other unwanted variation due to other factors. limma-voom for normalization and differential expression across groups.This is the code:
y=DGEList(counts=counts) A<-rowSums(y$counts) isexpr<-A>500 y=y[isexpr,keep.lib.size=FALSE] dim(y) y=calcNormFactors(y) group=factor(info_new$group) sex=factor(info_new$sex) city=factor(info_new$city) age=cut(info_new$age,4) levels(age)=c(age1,age2,age3,age4) mod <- model.matrix(~sex+age+city+group) mod0 <- model.matrix(~sex+age+city) v=voom(y,mod) n.sv = num.sv(v$E,mod,method="leek") sva_obj <- sva(v$E, mod, mod0, n.sv=n.sv) mod1 <- model.matrix(~sex+age+city+group+sva_obj$sv) v <- voom(counts=y, design = mod1) fit <- lmFit(v, mod1) fit2<- eBayes(fit) summary(decideTests(fit2))
My question is, does sva remove the subject effect as well and it is not necessary to apply duplicateCorrelation?
Here a simple version of the sample information (I have 92 RNA-seq samples, 63 patients, 7 cities, 28 groups, control group only from loc4):
patient |
city |
sex |
age |
group |
1 |
city1 |
M |
52 |
disease1_tissue_loc1 |
1 |
city1 |
M |
52 |
disease1_tissue_loc2 |
1 |
city1 |
M |
52 |
disease1_tissue_loc3 |
1 |
city1 |
M |
52 |
disease1_tissue_loc4 |
2 |
city1 |
F |
28 |
disease2_tissue_loc1 |
2 |
city1 |
F |
28 |
disease2_tissue_loc2 |
2 |
city1 |
F |
28 |
disease2_tissue_loc3 |
2 |
city1 |
F |
28 |
disease2_tissue_loc4 |
3 |
city1 |
M |
38 |
disease2_tissue_loc1 |
3 |
city1 |
M |
38 |
disease2_tissue_loc2 |
3 |
city1 |
M |
38 |
disease2_tissue_loc3 |
3 |
city1 |
M |
38 |
disease2_tissue_loc4 |
4 |
city1 |
F |
61 |
disease1_tissue_loc1 |
4 |
city2 |
F |
61 |
disease1_tissue_loc2 |
4 |
city2 |
F |
61 |
disease1_tissue_loc3 |
4 |
city2 |
F |
61 |
disease1_tissue_loc4 |
5 |
city2 |
F |
29 |
control_tissue_loc4 |
6 |
city2 |
F |
53 |
control_tissue_loc4 |
7 |
city2 |
M |
43 |
control_tissue_loc4 |
Thanks for your comments sina.nassari.
1) so then what is better:
or
2) I think it's compatible mixing known batch and unknown batch effects using SVA. The different cell type composition would be unknown unwanted variation for example.
3) It's a pity SVA can not deal with weights. I thought that sva-seq is similar to v$E.
1) I'd go with the former, because in the latter you're throwing out potentially valuable information about study design.
2) Sure; I missed cell type composition.
3) You are right about svaseq being similar to using sva with v$E; I just double checked the source code. My bad!
Do you think, then, this option is invalid?
Hi sina, I'm not sure your first point is correct. The SVs represent the component of gene expression that is neither attributable to phenotype of interest or to the known batch / adjustment variables. So for the design formula, I would think that you would need to use the full model + the SVs :
mod1 = cbind(mod,svobj$sv)
I completely agree with you on using full model + SVs, but in this case neither one of "sex","age", and "city" is a variable of interest. Am I missing something here?
Those are known adjustment variables which are being specified in the SVA model alongside the variable of interest:
mod <- model.matrix(~sex+age+city+group)
SVA will then construct surrogate variables to account for unknown sources of variation, i,e. those not accounted for in the model. For subsequent DE analysis, the model should then include the phenotype of interest + the known adjustment variables + the derived SVs (representing unknown batch effects):
mod1 <- model.matrix(~sex+age+city+group+sva_obj$sv)
In your original answer you suggest leaving out sex+age+city on the basis that they are already modelled by the SVs, but that shouldn't be the case if SVA is being run with the full model as given above since they are left out of the SV estimation.