Hello,

I am working on RNA seq data with 39(13 samples with 3 replicates) samples. I would like to do a PCA analysis. Apparently, there is batch effect in data. I would like to remove these batch effects and see how my samples would cluster after removing batches. I am following the instruction from this paper.

here is my experiment design.

```
coldata[,-c(1,2,7)]
Bio_rep cond passage Clone treatment Names
1 2 extra P148 B2 extra.P148 P148.extra.rep.2
2 3 extra P148 B2 extra.P148 P148.extra.rep.3
3 2 intra P7 B intra.P7 P7.intra.rep.2
4 3 intra P7 B intra.P7 P7.intra.rep.3
5 2 extra P7 B extra.P7 P7.extra.rep.2
6 3 extra P7 B extra.P7 P7.extra.rep.3
7 1 intra RH RH intra.RH RH.intra.rep.1
8 2 intra RH RH intra.RH RH.intra.rep.2
9 3 intra RH RH intra.RH RH.intra.rep.3
10 1 extra RH RH extra.RH RH.extra.rep.1
11 2 extra RH RH extra.RH RH.extra.rep.2
12 3 extra RH RH extra.RH RH.extra.rep.3
13 2 intra P11 B2 intra.P11 P11.intra.rep.2
14 3 intra P11 B2 intra.P11 P11.intra.rep.3
15 1 extra P85 B2 extra.P85 P85.extra.rep.1
16 1 extra P148 B2 extra.P148 P148.extra.rep.1
17 1 intra P7 B intra.P7 P7.intra.rep.1
18 1 extra P7 B extra.P7 P7.extra.rep.1
19 1 intra P148 B2 intra.P148 P148.intra.rep.1
20 1 intra P85 B2 intra.P85 P85.intra.rep.1
21 1 intra P11 B2 intra.P11 P11.intra.rep.1
22 1 extra P11 B2 extra.P11 P11.extra.rep.1
23 2 extra P11 B2 extra.P11 P11.extra.rep.2
24 3 extra P11 B2 extra.P11 P11.extra.rep.3
25 2 intra P85 B2 intra.P85 P85.intra.rep.2
26 3 intra P85 B2 intra.P85 P85.intra.rep.3
27 2 extra P85 B2 extra.P85 P85.extra.rep.2
28 3 extra P85 B2 extra.P85 P85.extra.rep.3
29 2 intra P148 B2 intra.P148 P148.intra.rep.2
30 3 intra P148 B2 intra.P148 P148.intra.rep.3
31 1 extra P35 B2 extra.P35 P35.extra.rep.1
32 2 extra P35 B2 extra.P35 P35.extra.rep.2
33 3 extra P35 B2 extra.P35 P35.extra.rep.3
34 1 extra P55 B2 extra.P55 P55.extra.rep.1
35 2 extra P55 B2 extra.P55 P55.extra.rep.2
36 3 extra P55 B2 extra.P55 P55.extra.rep.3
37 1 extra P210 B2 extra.P210 P210.extra.rep.1
38 2 extra P210 B2 extra.P210 P210.extra.rep.2
39 3 extra P210 B2 extra.P210 P210.extra.rep.3
```

I have created a DESeqDataSet and used rld() to stabilize the variation then plotted the heat maps and pca. the heat map clearly shows batch effects.

Then I used sva to remove those hidden effects and add the 3 surrogate variables to the design of DESeq object. and then re-plot the PCA and heatmap. But still I get the same plot. It seems like sva has not corrected the effects and unwanted variation.

```
# DESeq object
dds <- DESeqDataSetFromMatrix(countData = countDataMatrix,
colData = coldata,
design = ~treatment)
dds <- estimateSizeFactors(dds)
dds
# filtering the genes that have an average of greater than 10 across all samples
dds <- dds[rowSums(counts(dds)) > 10, ]
nrow(dds)
#transforming dds object
rld <- rlog(dds, blind=FALSE)
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors,
main = "euclidean dist from DESeq bject before removing bad samples")
#PCA
pcadata <- plotPCA(rld, intgroup = c("cond", "passage"), returnData=TRUE)
percentVar <- round(100 * attr(pcadata, "percentVar"))
ggplot(pcadata, aes(PC1, PC2, color= treatment, shape=cond)) + geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
labs(title="PCA before removing batch effect")
#removing batch effects
library("sva")
dat <- counts(dds, normalized = TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ treatment, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
svseq <- svaseq(dat, mod = mod, mod0 = mod0, n.sv = 3)
par(mfrow=c(3,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$treatment,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$treatment,vertical=TRUE,main="SV2")
abline(h=0)
stripchart(svseq$sv[,3] ~ dds$treatment,vertical=TRUE,main="SV3")
abline(h=0)
# DESeq object after removing batch
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
ddssva$SV3 <- svseq$sv[,3]
## add 3 surrogate variable to the dds design
design(ddssva) <- ~ SV1 + SV2 + SV3 + treatment
# transformation of counts after sva
rldsva <- rlog(ddssva, blind=FALSE)
# sample dist after sva
sampleDists.sva <- dist( t( assay(rldsva) ) )
sampleDistMatrix.sva <- as.matrix( sampleDists.sva )
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix.sva,
clustering_distance_rows=sampleDists.sva,
clustering_distance_cols=sampleDists.sva,
col=colors)
```

DO you have any idea where an I doing wrong.

Thanks for your message. I am looking at F1000 research GateWays to follow the paper. Unfortunately, I was not able to find the FAQ section. Could you please direct me to where I need to the FAQ section so that I can find the answer to my question. I appreciate your help.

You asked

I’ll just link directly to the question itself:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-after-vst-are-there-still-batches-in-the-pca-plot

Thanks for your reply. I look at the link and I am able to apply it to my data. But still I see some batches even after removing batches with the removeBatchEffect from Lima package. I have another question regarding SVA. I tries to define batches manually by looking at SV1 and SV2 (surrogate variables) and define batches based on negative and positive values of SVs. I am not sure if this is a good approach to use SVA coordinates to define batch manually. Do you have any ideas how can I use SVA coordinates to define batch? Thanks

That's more of an SVA question, so i'll hold back from answering

Hi,

you are definitely right that's more about sva. I might need to post it with different topic and tags.

Thanks

I may be missing something, but I don't see why you would want to dichotomize the SVs. You can use limma's removeBatchEffect to remove continuous covariates as well, just put them in the 'covariates' argument.

Hi,

I have already used 2 SVs as a covariate matrix and passed it to removeBatchEffect() function. After doing so the samples clustering would be disrupted even worse than than before batch correction. I was wondering if this is due to the extra batch effect (still remained in data) or created bias by using SVA. I thought defining batches manually using the SVs could be an approach that removes batches without introducing bias to data. I was wondering if batch correction is working fine in my case.

Thanks,