Question

Deseq2 - removing batch effect

0

Entering edit mode

vinisha • 0

@1e997722

Last seen 3.6 years ago

United States

Hi,

I have read blogs on how to remove batch effects in deseq2 and how to visualize it using limma voom remove batch effect PCA. I'm not sure if the batch effect was removed in my analysis. Please find the attached output, I'm skeptical because in heat map they seem to cluster separately (with respect to batches).

Analysis: 51 samples with two batches (Batch1 includes NYGC in the label and Batch2 doesn't include NYGC in the label). The two batches are rnaseq data from different sequencers.

PCA before removing batch effect

enter image description here

PCA after removing batch effect

PCAafter

Heat map with batch factor included in design formula design =~ Gender+ageRange+Batch+Cond

withBatch

Heat map without batch factor in design formula design =~ Gender+ageRange+Cond

withoutBatch

I'd be really great if anyone could validate my results. Thank you!

Best, Vinisha

BatchEffect DESeq2 • 3.3k views

ADD COMMENT • link updated 3.6 years ago by Kevin Blighe ★ 4.0k • written 3.6 years ago by vinisha • 0

score 0 · Answer 1 · 2021-04-22

0

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 24 days ago

Republic of Ireland

Hi Vinisha, please always show the code that you used, in this case, the code that you used to input your raw counts, normalise these, generate the first PCA bi-plot, batch correct, and then generate the second PCA bi-plot.

Regarding the heatmap, I would not expect any difference based on inclusion or exclusion of batch from the design, depending on how you are generating these heatmaps.

Kevin

ADD COMMENT • link 3.6 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Hi Kevin,

I've edited my post as it had an incorrect image. Kindly take a look again. Attaching the code below,

total_new$Cond <- factor(total_new$Cond, levels = c("ALS", "HC"))
total_new$ageRange <- factor(total_new$ageRange, levels = c("TRUE", "FALSE"))
total_new$Gender <- factor(total_new$Gender, levels = c("M", "F"))
total_new$c9 <- factor(total_new$c9, levels = c("Yes","No"))
total_new$Batch <- factor(total_new$Batch,levels= c("Batch2","Batch1"))

dds<-DESeqDataSetFromMatrix(countData = data_new, colData=total_new, design =~ Gender+ageRange+Cond)
dds <- dds[ rowMeans(counts(dds)) > 20, ]
dds <- estimateSizeFactors(dds)
dds <- DESeq(dds)
res <- results(dds, contrast=c("Cond","ALS","HC"))
res.out <- as.data.frame(res)

#using it only for PCA
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup= c("Batch"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color= Batch, label=name)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()

res <- res[!is.na(res$pvalue),]
resOrdered <- res[(res$pvalue < 0.05),]
ressig <- resOrdered[(resOrdered$log2FoldChange < -1.5) | (resOrdered$log2FoldChange > 1.5) ,]
res_sort <- ressig[order(-abs(ressig$log2FoldChange)),]
mat <- assay(dds_rlog[rownames(ressig),])
pheatmap(mat, cluster_rows = TRUE, cluster_cols= TRUE, scale="row", show_rownames = FALSE)

For batch effect removal I included batch in the design formula.

dds<-DESeqDataSetFromMatrix(countData = data_new, colData=total_new, design =~ Gender+ageRange+Batch+Cond)
.........
assay(vsd)<- limma::removeBatchEffect(assay(vsd), vsd$Batch)
plotPCA(vsd, "Batch")

ADD REPLY • link 3.6 years ago vinisha • 0

0

Entering edit mode

Ah! - now I see. Originally, it was as if literally nothing changed.

Although PC2 only accounts for 3% variation, it would be interesting to repeat everything after having remove the 'outlier' at the top of the original bi-plot. Also, is there anything else that could be confounding the differences between batch1 and batch2? Do you have statistical support for including Gender and ageRange in the design? If these are not contributing to any variation, then, by including them, you may inadvertently introduce bias that did not previously exist.

ADD REPLY • link 3.6 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

I did try removing the outliers,still no luck! So I ran the analysis separately(with respect to batches) and found Gender and ageRange to be a factor for clustering. That's the reason I included in the combined(Batch1 and Batch2) analysis. I also tried with only Batch in the design formula, they still clustered as per batches. Should I try a different way? Thank you for helping me out on this, Kevin!

Batch1 heatmap

enter image description here

Batch2 heatmap

enter image description here

ADD REPLY • link 3.6 years ago vinisha • 0

1

Entering edit mode

If the batch effect is inconsistent across genes, for whatever reason, then removeBatchEffect() will struggle to adjust for this. You could instead try ComBat-Seq to remove the batch effect at the raw count stage (prior to DESeq2), if you wish; otherwise, try to explore other sources of variation via my package, PCAtools::eigencorplot(): https://github.com/kevinblighe/PCAtools#an-eigencor-plot (I have a section on how to use this with DESeq2).

If all else fails, I would proceed with the analysis, possibly stratifying by batch, i.e., doing 2 separate analyses and then corroborating results across both.

ADD REPLY • link 3.6 years ago Kevin Blighe ★ 4.0k