Question

correction for batch effects and paired-design of RNA-seq samples

1

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 14 months ago

Palo Alto, CA, USA

Dear all,

please could you advise me on the following analysis, whether it is correct or optimal:

briefly, speaking, I have 3 pairs of RNA-seq samples (WT vs KO), that have been sequenced at different time points.

More precisely, let's name the samples : WT1; WT2 ; WT3; KO1 ; KO2 ; KO3,

where the pairs are (WT1, KO1), (WT2, KO2), (WT3, KO3) that have been sequenced at different times by different technologies (there is a batch effect, as shown also by PCA or MDS display).

When i do the differential analysis of gene expression, I take into account the pairs WT-KO, that in this case coincide with the batches. The question would be : is the R code below correct in using paired design and doing the batch correction, or shall you have any suggestions, please let me know ;)

library("edgeR") eset <- read.delim("RNAseq",row.names="Geneid")

group <- factor(c("wt","wt","wt","ko","ko","ko")) group <- relevel(group,ref="wt")

subject <- as.factor(c(1,2,3,1,2,3)) batch <- as.factor(c(1,2,3,1,2,3))

design <- model.matrix(~group+subject)

y <- DGEList(counts=eset, group=group) keep <- rowSums(cpm(y) > 0.5) >= 6 y <- y[keep,,keep.lib.sizes=FALSE]

y <- calcNormFactors(y)

### MDS before batch correction

logCPM <- cpm(y, log=TRUE, prior.count=5) plotMDS(logCPM, col=as.numeric(batch))

### MDS after batch correction

logCPM <- removeBatchEffect(logCPM, batch=batch) plotMDS(logCPM, col=as.numeric(batch))

#############

fit <- lmFit(logCPM, design) fit <- eBayes(fit,trend=TRUE, robust=TRUE)

pdf("after.batch.correction.plotSA") plotSA(fit) dev.off()

results <- topTable(fit, coef=2, adjust="fdr", number=Inf) write.table(results, file="results.edgeR.txt", sep="\t", eol="\n", row.names=TRUE, col.names=TRUE)

limma edgeR • 2.5k views

ADD COMMENT • link updated 7.8 years ago by Gordon Smyth 52k • written 7.8 years ago by Bogdan ▴ 670

score 3 · Accepted Answer · 2017-02-19

3

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

There is no need to do any batch correction. The baseline differences between subjects, including any batch effect, is already accounted for in the paired analysis.

ADD COMMENT • link 7.8 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, thank you very much for a quick and helpful reply ;) A question though please, although I see of course that the paired analysis accounts for the batch effects

would it be fine to keep these lines of code below (before lmFit (logCPMc, design)):

logCPMc <- removeBatchEffect(logCPM, batch=batch) plotMDS(logCPM, col=as.numeric(batch))

as the results seem to look better on the MDS plot after removeBatchEffect. thank you !

ADD REPLY • link 7.8 years ago Bogdan ▴ 670

2

Entering edit mode

You can use removeBatchEffect() for the MDS plot. That's the main purpose of the function, for plotting purposes.

ADD REPLY • link 7.8 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, thanks a lot for your help ;) Have a good and fruitful week ;)

ADD REPLY • link 7.8 years ago Bogdan ▴ 670

0

Entering edit mode

In this case, subject and batch are identical:

subject <- as.factor(c(1,2,3,1,2,3))
batch <- as.factor(c(1,2,3,1,2,3))

But how about the following example?

subject <- as.factor(c(1,2,3,1,2,3))
batch <- as.factor(c(1,1,1,2,2,1))

What is the best model design to correct batch effect and to predict DEG between WT and KO groups?