removing batch effect using DESeq and sva
Entering edit mode
bioyas ▴ 10
Last seen 2.8 years ago


I am working on RNA seq data with 39(13 samples with 3 replicates) samples. I would like to do a PCA analysis. Apparently, there is batch effect in data. I would like to remove these batch effects and see how my samples would cluster after removing batches. I am following the instruction from this paper.

here is my experiment design.

   Bio_rep  cond passage Clone  treatment            Names
1        2 extra    P148    B2 extra.P148 P148.extra.rep.2
2        3 extra    P148    B2 extra.P148 P148.extra.rep.3
3        2 intra      P7     B   intra.P7   P7.intra.rep.2
4        3 intra      P7     B   intra.P7   P7.intra.rep.3
5        2 extra      P7     B   extra.P7   P7.extra.rep.2
6        3 extra      P7     B   extra.P7   P7.extra.rep.3
7        1 intra      RH    RH   intra.RH   RH.intra.rep.1
8        2 intra      RH    RH   intra.RH   RH.intra.rep.2
9        3 intra      RH    RH   intra.RH   RH.intra.rep.3
10       1 extra      RH    RH   extra.RH   RH.extra.rep.1
11       2 extra      RH    RH   extra.RH   RH.extra.rep.2
12       3 extra      RH    RH   extra.RH   RH.extra.rep.3
13       2 intra     P11    B2  intra.P11  P11.intra.rep.2
14       3 intra     P11    B2  intra.P11  P11.intra.rep.3
15       1 extra     P85    B2  extra.P85  P85.extra.rep.1
16       1 extra    P148    B2 extra.P148 P148.extra.rep.1
17       1 intra      P7     B   intra.P7   P7.intra.rep.1
18       1 extra      P7     B   extra.P7   P7.extra.rep.1
19       1 intra    P148    B2 intra.P148 P148.intra.rep.1
20       1 intra     P85    B2  intra.P85  P85.intra.rep.1
21       1 intra     P11    B2  intra.P11  P11.intra.rep.1
22       1 extra     P11    B2  extra.P11  P11.extra.rep.1
23       2 extra     P11    B2  extra.P11  P11.extra.rep.2
24       3 extra     P11    B2  extra.P11  P11.extra.rep.3
25       2 intra     P85    B2  intra.P85  P85.intra.rep.2
26       3 intra     P85    B2  intra.P85  P85.intra.rep.3
27       2 extra     P85    B2  extra.P85  P85.extra.rep.2
28       3 extra     P85    B2  extra.P85  P85.extra.rep.3
29       2 intra    P148    B2 intra.P148 P148.intra.rep.2
30       3 intra    P148    B2 intra.P148 P148.intra.rep.3
31       1 extra     P35    B2  extra.P35  P35.extra.rep.1
32       2 extra     P35    B2  extra.P35  P35.extra.rep.2
33       3 extra     P35    B2  extra.P35  P35.extra.rep.3
34       1 extra     P55    B2  extra.P55  P55.extra.rep.1
35       2 extra     P55    B2  extra.P55  P55.extra.rep.2
36       3 extra     P55    B2  extra.P55  P55.extra.rep.3
37       1 extra    P210    B2 extra.P210 P210.extra.rep.1
38       2 extra    P210    B2 extra.P210 P210.extra.rep.2
39       3 extra    P210    B2 extra.P210 P210.extra.rep.3

I have created a DESeqDataSet and used rld() to stabilize the variation then plotted the heat maps and pca. the heat map clearly shows batch effects.

Then I used sva to remove those hidden effects and add the 3 surrogate variables to the design of DESeq object. and then re-plot the PCA and heatmap. But still I get the same plot. It seems like sva has not corrected the effects and unwanted variation.

# DESeq object 
dds <- DESeqDataSetFromMatrix(countData = countDataMatrix,
                              colData = coldata,
                              design = ~treatment)
dds <- estimateSizeFactors(dds)

# filtering the genes that have an average of greater than 10 across all samples
dds <- dds[rowSums(counts(dds)) > 10, ]

#transforming  dds object
rld <- rlog(dds, blind=FALSE)

sampleDists <- dist(t(assay(rld)))

sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")))(255)
         main = "euclidean dist from DESeq bject before removing bad samples")

pcadata <- plotPCA(rld, intgroup = c("cond", "passage"), returnData=TRUE)
percentVar <- round(100 * attr(pcadata, "percentVar"))

ggplot(pcadata, aes(PC1, PC2, color= treatment, shape=cond)) + geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  labs(title="PCA before removing batch effect")

#removing batch effects
dat <- counts(dds, normalized = TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ treatment, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
svseq <- svaseq(dat, mod = mod, mod0 = mod0, = 3)

stripchart(svseq$sv[,1] ~ dds$treatment,vertical=TRUE,main="SV1")
stripchart(svseq$sv[,2] ~ dds$treatment,vertical=TRUE,main="SV2")
stripchart(svseq$sv[,3] ~ dds$treatment,vertical=TRUE,main="SV3")

# DESeq object after removing batch
ddssva <- dds 
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
ddssva$SV3 <- svseq$sv[,3]
## add 3 surrogate variable to the dds design 
design(ddssva) <- ~ SV1 + SV2 + SV3 + treatment

# transformation of counts after sva
rldsva <- rlog(ddssva, blind=FALSE)
# sample dist after sva

sampleDists.sva <- dist( t( assay(rldsva) ) )
sampleDistMatrix.sva <- as.matrix( sampleDists.sva )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

DO you have any idea where an I doing wrong.

sva deseq2 batch effect PCA heatmap • 1.2k views
Entering edit mode
Last seen 1 day ago
United States

Check the FAQ in the vignette, there is one particularly relevant to this question.

Entering edit mode

Thanks for your message. I am looking at F1000 research GateWays to follow the paper. Unfortunately, I was not able to find the FAQ section. Could you please direct me to where I need to the FAQ section so that I can find the answer to my question. I appreciate your help.

Entering edit mode

You asked

Then I ... add the 3 surrogate variables to the design of DESeq object. and then re-plot the PCA and heatmap. But still I get the same plot

I’ll just link directly to the question itself:

Entering edit mode

Thanks for your reply. I look at the link and I am able to apply it to my data. But still I see some batches even after removing batches with the removeBatchEffect from Lima package. I have another question regarding SVA. I tries to define batches manually by looking at SV1 and SV2 (surrogate variables) and define batches based on negative and positive values of SVs. I am not sure if this is a good approach to use SVA coordinates to define batch manually. Do you have any ideas how can I use SVA coordinates to define batch? Thanks

Entering edit mode

That's more of an SVA question, so i'll hold back from answering

Entering edit mode


you are definitely right that's more about sva. I might need to post it with different topic and tags.


Entering edit mode

I may be missing something, but I don't see why you would want to dichotomize the SVs. You can use limma's removeBatchEffect to remove continuous covariates as well, just put them in the 'covariates' argument.

Entering edit mode


I have already used 2 SVs as a covariate matrix and passed it to removeBatchEffect() function. After doing so the samples clustering would be disrupted even worse than than before batch correction. I was wondering if this is due to the extra batch effect (still remained in data) or created bias by using SVA. I thought defining batches manually using the SVs could be an approach that removes batches without introducing bias to data. I was wondering if batch correction is working fine in my case.



Login before adding your answer.

Traffic: 718 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6