Question: removing batch effect using DESeq and sva
0
gravatar for bioyas
13 days ago by
bioyas0
USA
bioyas0 wrote:

Hello,

I am working on RNA seq data with 39(13 samples with 3 replicates) samples. I would like to do a PCA analysis. Apparently, there is batch effect in data. I would like to remove these batch effects and see how my samples would cluster after removing batches. I am following the instruction from this paper.

here is my experiment design.

coldata[,-c(1,2,7)]
   Bio_rep  cond passage Clone  treatment            Names
1        2 extra    P148    B2 extra.P148 P148.extra.rep.2
2        3 extra    P148    B2 extra.P148 P148.extra.rep.3
3        2 intra      P7     B   intra.P7   P7.intra.rep.2
4        3 intra      P7     B   intra.P7   P7.intra.rep.3
5        2 extra      P7     B   extra.P7   P7.extra.rep.2
6        3 extra      P7     B   extra.P7   P7.extra.rep.3
7        1 intra      RH    RH   intra.RH   RH.intra.rep.1
8        2 intra      RH    RH   intra.RH   RH.intra.rep.2
9        3 intra      RH    RH   intra.RH   RH.intra.rep.3
10       1 extra      RH    RH   extra.RH   RH.extra.rep.1
11       2 extra      RH    RH   extra.RH   RH.extra.rep.2
12       3 extra      RH    RH   extra.RH   RH.extra.rep.3
13       2 intra     P11    B2  intra.P11  P11.intra.rep.2
14       3 intra     P11    B2  intra.P11  P11.intra.rep.3
15       1 extra     P85    B2  extra.P85  P85.extra.rep.1
16       1 extra    P148    B2 extra.P148 P148.extra.rep.1
17       1 intra      P7     B   intra.P7   P7.intra.rep.1
18       1 extra      P7     B   extra.P7   P7.extra.rep.1
19       1 intra    P148    B2 intra.P148 P148.intra.rep.1
20       1 intra     P85    B2  intra.P85  P85.intra.rep.1
21       1 intra     P11    B2  intra.P11  P11.intra.rep.1
22       1 extra     P11    B2  extra.P11  P11.extra.rep.1
23       2 extra     P11    B2  extra.P11  P11.extra.rep.2
24       3 extra     P11    B2  extra.P11  P11.extra.rep.3
25       2 intra     P85    B2  intra.P85  P85.intra.rep.2
26       3 intra     P85    B2  intra.P85  P85.intra.rep.3
27       2 extra     P85    B2  extra.P85  P85.extra.rep.2
28       3 extra     P85    B2  extra.P85  P85.extra.rep.3
29       2 intra    P148    B2 intra.P148 P148.intra.rep.2
30       3 intra    P148    B2 intra.P148 P148.intra.rep.3
31       1 extra     P35    B2  extra.P35  P35.extra.rep.1
32       2 extra     P35    B2  extra.P35  P35.extra.rep.2
33       3 extra     P35    B2  extra.P35  P35.extra.rep.3
34       1 extra     P55    B2  extra.P55  P55.extra.rep.1
35       2 extra     P55    B2  extra.P55  P55.extra.rep.2
36       3 extra     P55    B2  extra.P55  P55.extra.rep.3
37       1 extra    P210    B2 extra.P210 P210.extra.rep.1
38       2 extra    P210    B2 extra.P210 P210.extra.rep.2
39       3 extra    P210    B2 extra.P210 P210.extra.rep.3

I have created a DESeqDataSet and used rld() to stabilize the variation then plotted the heat maps and pca. the heat map clearly shows batch effects.

Then I used sva to remove those hidden effects and add the 3 surrogate variables to the design of DESeq object. and then re-plot the PCA and heatmap. But still I get the same plot. It seems like sva has not corrected the effects and unwanted variation.

# DESeq object 
dds <- DESeqDataSetFromMatrix(countData = countDataMatrix,
                              colData = coldata,
                              design = ~treatment)
dds <- estimateSizeFactors(dds)
dds

# filtering the genes that have an average of greater than 10 across all samples
dds <- dds[rowSums(counts(dds)) > 10, ]
nrow(dds)

#transforming  dds object
rld <- rlog(dds, blind=FALSE)

sampleDists <- dist(t(assay(rld)))

sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors,
         main = "euclidean dist from DESeq bject before removing bad samples")

#PCA 
pcadata <- plotPCA(rld, intgroup = c("cond", "passage"), returnData=TRUE)
percentVar <- round(100 * attr(pcadata, "percentVar"))

ggplot(pcadata, aes(PC1, PC2, color= treatment, shape=cond)) + geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  labs(title="PCA before removing batch effect")

#removing batch effects
library("sva")
dat <- counts(dds, normalized = TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ treatment, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
svseq <- svaseq(dat, mod = mod, mod0 = mod0, n.sv = 3)


par(mfrow=c(3,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$treatment,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$treatment,vertical=TRUE,main="SV2")
abline(h=0)
stripchart(svseq$sv[,3] ~ dds$treatment,vertical=TRUE,main="SV3")
abline(h=0)

# DESeq object after removing batch
ddssva <- dds 
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
ddssva$SV3 <- svseq$sv[,3]
## add 3 surrogate variable to the dds design 
design(ddssva) <- ~ SV1 + SV2 + SV3 + treatment

# transformation of counts after sva
rldsva <- rlog(ddssva, blind=FALSE)
# sample dist after sva

sampleDists.sva <- dist( t( assay(rldsva) ) )
sampleDistMatrix.sva <- as.matrix( sampleDists.sva )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix.sva,
         clustering_distance_rows=sampleDists.sva,
         clustering_distance_cols=sampleDists.sva,
         col=colors)

DO you have any idea where an I doing wrong.

ADD COMMENTlink modified 13 days ago by Michael Love24k • written 13 days ago by bioyas0
Answer: removing batch effect using DESeq and sva
0
gravatar for Michael Love
13 days ago by
Michael Love24k
United States
Michael Love24k wrote:

Check the FAQ in the vignette, there is one particularly relevant to this question.

ADD COMMENTlink written 13 days ago by Michael Love24k

Thanks for your message. I am looking at F1000 research GateWays to follow the paper. Unfortunately, I was not able to find the FAQ section. Could you please direct me to where I need to the FAQ section so that I can find the answer to my question. I appreciate your help.

ADD REPLYlink written 12 days ago by bioyas0

You asked

Then I ... add the 3 surrogate variables to the design of DESeq object. and then re-plot the PCA and heatmap. But still I get the same plot

I’ll just link directly to the question itself:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-after-vst-are-there-still-batches-in-the-pca-plot

ADD REPLYlink written 12 days ago by Michael Love24k

Thanks for your reply. I look at the link and I am able to apply it to my data. But still I see some batches even after removing batches with the removeBatchEffect from Lima package. I have another question regarding SVA. I tries to define batches manually by looking at SV1 and SV2 (surrogate variables) and define batches based on negative and positive values of SVs. I am not sure if this is a good approach to use SVA coordinates to define batch manually. Do you have any ideas how can I use SVA coordinates to define batch? Thanks

ADD REPLYlink written 10 days ago by bioyas0

That's more of an SVA question, so i'll hold back from answering

ADD REPLYlink written 7 days ago by Michael Love24k

I may be missing something, but I don't see why you would want to dichotomize the SVs. You can use limma's removeBatchEffect to remove continuous covariates as well, just put them in the 'covariates' argument.

ADD REPLYlink written 7 days ago by Peter Langfelder2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 272 users visited in the last hour