Question: Remove batch effect in small RNASeq study (SVA or others?)
1
5.2 years ago by
shirley zhang1.0k
shirley zhang1.0k wrote:

I have a RNASeq paired-end data from two batches (8 samples from batch1, and 7 samples from batch2). After alignment using TopHat2, then I got count using HTseq-count, and FPKM value via Cufflinks. A big batch effect can be viewed in PCA using both log10(raw count) and log10(FPKM) value.

I can NOT use the block factor in edgeR to remove batch effect since I need to first obtain residuals after adjusting for batch effect, then test the residuals for hundreds of thousands of SNPs (eQTL analysis).

My question is how to remove batch effect without using edgeR:

1. is SVA ok for such a small sample size (N=15)?
2. If SVA does not work, any other suggestions?

Many thanks,
Shirley

rnaseq alignment edger sva • 4.9k views
modified 4.5 years ago by Gordon Smyth37k • written 5.2 years ago by shirley zhang1.0k
Answer: Remove batch effect in small RNASeq study (SVA or others?)
3
5.2 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

Please give the code sequence you used to remove the batch effect and to make the PCA plot.

You should normalize the data before using removeBatchEffect(), and quantile is one possibility.

Gordon

ADD COMMENTlink modified 4.5 years ago • written 5.2 years ago by Gordon Smyth37k

Dear Dr. Smyth,

Thanks for your reply. Here is the code sequence I used to remove the batch effect and to make the PCA plot.

First,  getting raw counts for each feature per sample using htseq-count (~64K features by using Ensemble.gtf). Then, getting a count data.matrix by combing all samples together (64K
rows, and 14 columns). 8 samples from batch1, and 6 samples from batch2.

>count = cbind(count.s1, count.s2, ...., count.s14)

#remove the batch effect
library(edgeR)
batch = as.factor(c(rep("b1", 8), rep("b2", 6)))

logCPM <- cpm(count,log=TRUE,prior.count=5)
logCPM <- removeBatchEffect(logCPM, batch=batch)

#PCA
B.res = prcomp(logCPM, scale = TRUE)
pc.s = summary(.Bres)$importance[2,1:2] pc1.var = round(pc.s[["PC1"]],2) pc2.var = round(pc.s[["PC2"]],2) pdf(file = "all.count.logCPM.rmBatch.pca") plot(B.res$rotation[,1:2], main = maintitle,  xlab = paste("PC1(variance:",
pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance:",
pc2.var*100,"%)", sep =""))
dev.off()

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] edgeR_3.4.0  limma_3.18.7

Many thanks.
Shirley

ADD REPLYlink modified 4.5 years ago by Gordon Smyth37k • written 5.2 years ago by shirley zhang1.0k

Dear Shirley,

I don't think that scaling in prcomp() is appropriate here.

Better would be:

  library(edgeR)
dge <- DGEList(counts=count)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge,log=TRUE,prior.count=5)
plotMDS(logCPM)
logCPM <- removeBatchEffect(logCPM,batch=batch)
plotMDS(logCPM)

Best wishes
Gordon

ADD REPLYlink modified 4.5 years ago • written 5.2 years ago by Gordon Smyth37k

To add colours to the MDS plot:

  plotMDS(logCPM, col=as.numeric(batch))

Gordon

ADD REPLYlink modified 4.5 years ago • written 5.2 years ago by Gordon Smyth37k

Dear Dr. Smyth,

Thank you very much for taking your time to look through my codes, and also provided an more approciate code sequence. Thank you!

By following your codes, I think the batch effect was removed efficiently as shown in the attached figures. Do you agree?

Furthermore, I found a very useful paper you published in Nature Protocol 2013.
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

In the paper, you wrote:

d = DGEList(counts = count, group = samples\$condition)
d = calcNormFactors(d)
nc = cpm(d, normalized.lib.sizes = TRUE)

My question is:
In my case, should I add option "normalized.lib.sizes = TRUE" in cpm() after calling calcNormFactors() like the following:

dge <- DGEList(counts=count)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge,log=TRUE,prior.count=5, normalized.lib.sizes = TRUE)

Many thanks for your help,
Shirley

-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.plotMDS.pdf
Type: application/pdf
Size: 4598 bytes
Desc: not available
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.rmBatch.plotMDS.pdf
Type: application/pdf
Size: 4584 bytes
Desc: not available

ADD REPLYlink modified 4.5 years ago by Gordon Smyth37k • written 5.2 years ago by shirley zhang1.0k

Hi Shirley,

I beleive that "normalized.lib.sizes = TRUE" has become the default when calling the cpm function on a DGEList, so you should not need to specify it.

-Ryan

ADD REPLYlink modified 4.5 years ago by Gordon Smyth37k • written 5.2 years ago by Ryan C. Thompson7.3k

Dear Ryan,

I think you are right. As shown by ?cpm

## S3 method for class 'DGEList'
cpm(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25,
...)

BTW, do you have experience about how to set "prior.count", e.g. 0.25 vs. 5?

Many thanks,
Shirley

ADD REPLYlink modified 4.5 years ago by Gordon Smyth37k • written 5.2 years ago by shirley zhang1.0k

Dear Shirley,

I have only used fold changes based on prior counts for descriptive purposes. My understanding is that edgeR uses the counts as is to do the statistics (i.e. p-values), but adds in the prior count when computing the logFC/logCPM values to avoid taking the log of zero and to avoid excessively large fold changes in low-abundance genes. However, I do not have a sense of what values are appropriate and I usually stick with the default.

-Ryan

ADD REPLYlink modified 4.5 years ago by Gordon Smyth37k • written 5.2 years ago by Ryan C. Thompson7.3k

Shirley,

I specifically recommended to you a particular choice for prior.count that I thought would be appropriate for your purpose.

Gordon

ADD REPLYlink modified 4.5 years ago • written 5.2 years ago by Gordon Smyth37k

Thank you, Dr. Smyth. I really appreciate your constructive and thoughtful reply.

Ryan, thank you too for your comments.

Best,
Shirley

ADD REPLYlink modified 4.5 years ago by Gordon Smyth37k • written 5.2 years ago by shirley zhang1.0k
Answer: Remove batch effect in small RNASeq study (SVA or others?)
0
5.2 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

Dear Shirley,

I would probably do it like this:

   library(edgeR)
logCPM <- cpm(y,log=TRUE,prior.count=5)
logCPM <- removeBatchEffect(logCPM, batch=batch)

Best wishes
Gordon

ADD COMMENTlink modified 4.5 years ago • written 5.2 years ago by Gordon Smyth37k

Dear Dr. Smyth,

Thank you very much for your quick reply. I did as you suggested by first getting log CPM value, then call removeBatchEffect(). I found the PCA figure looks better than before, but there is still a batch effect.

I attached two PCA figures. One is based on log10(raw count) which is before calling cpm() and removeBatchEffect(). Another one is after.

Could you look at them and give me more suggestions. Will a quantile normalization across samples be a good idea since CPM() is still a normalization only within each sample??

Thanks again for your help,
Shirley

-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.log10.pca.pdf
Type: application/pdf
Size: 5034 bytes
Desc: not available
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.rmBatch.pca.pdf
Type: application/pdf
Size: 5120 bytes
Desc: not available

ADD REPLYlink modified 4.5 years ago by Gordon Smyth37k • written 5.2 years ago by shirley zhang1.0k