Question

Subject: Strange glitch in Voom curve resulting from batch effects

0

Entering edit mode

raf4 ▴ 30

@raf4-8249

Last seen 2.9 years ago

United States

Dear Bioconductor List,

I am analyzing an RNAseq experiment taken from two different batches. I have encountered
a strange voom plot and cannot get rid of it by taking batch effects into account as a co-variate. I will present the relevant information, and then ask some questions at the end.

All images can be found at:

http://imgur.com/a/4lsBa

Only 1 conditon in the first batch is included in the comparison. The target file (abridged) is:

FileName   Name   Target   Batch
JS006   JS006   b.dor.controlchow.pac   1
...
JS220   JS220   h.dor.acychow.pac   2

There was ome difference in the genes expressed between batch
1 and 2, so I merged the raw count matrices in R and changed
an NAs to zeros.

When I ran Limma-Voom with weights and without batch as a factor I got this strange
non-monotonic voom curve which is entitled:

barplot.acy_rnaseq.voomplot.bothbatches_nobatchfactor.png

When I ran just batch 2 I got an ordinary voom curve
which is entitled barplot.acy_rnaseq.voomplot.newbatchonly.png.

I therefore concluded that the glitch in the first graph is
due to batch effects.

I then tried batch adjustment in Limma, using batch as a
covariate.

MDSplot_bybatch_log_no_correction.pdf
shows that there is a pronounced batch effect.

I appleid a batch correction with the removeBatchEffect
command.
MDSplot_bybatch_log_correction.pdf shows that there is a pronounced batch effect.

I then tried Limma-Voom using Batch as a covariate and I got the same kind
of strange Voom plot (barplot.acy_rnaseq.voomplot.batchcorrect.png). Plus an error message:

> v<-voomWithQualityWeights(y,design=des,normalization="none",plot=TRUE)
Coefficients not estimable: h.dor.acychow.pac
Coefficients not estimable: h.dor.acychow.pac
Warning messages:
1: Partial NA coefficients for 15380 probe(s)
2: Partial NA coefficients for 15380 probe(s)

Here is a record of my session (abridged for charatcer limit)
then my questions:

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

> library(limma)
> library(edgeR)
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Mavericks 10.9.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] edgeR_3.16.5 limma_3.30.7

loaded via a namespace (and not attached):
[1] grid_3.3.2 locfit_1.5-9.1 lattice_0.20-34
> targets<-readTargets("targetsbatch.txt")

> Counts<-read.table("counts.txt",header=T,sep="\t")

> png("barplot.acy.png")
> barplot(colSums(Counts)*1e-6,names=1:25,ylab="library size (millions)")
> dev.off()
null device
1
> y<-DGEList(counts=Counts, genes=rownames(Counts))
> isexpr<-rowSums(cpm(y)>1) >= 5
> hasannot<-rowSumsis.na(y$genes))==0
> y<-y[isexpr & hasannot,]
> y<-calcNormFactors(y)
> # Now predict result with batch effects removed
> logCPM<-cpm(y,log=TRUE,prior.count=5)
> plotMDS(logCPM,labels=targets$Batch,col=c(rep("red",5),rep("blue",20)))
> logCPMadj<-removeBatchEffect(logCPM,batch=targets$Batch)
> plotMDS(logCPMadj,labels=targets$Batch,col=c(rep("red",5),rep("blue",20))
> f<-factor(targets$Target,levels=c("b.dor.controlchow.pac","e.dor.controlchow.noinjec","f.dor.acychow.noinjec","g.dor.acychow.veh","h.dor.acychow.pac"))
> batch<-factor(targets$Batch)
> des<-model.matrix(~batch+f)

> colnames(des)<- c("(Intercept)","batch2","e.dor.controlchow.noinjec","f.dor.acychow.noinjec","g.dor.acychow.veh","h.dor.acychow.pac")
> des
   (Intercept) batch2 e.dor.controlchow.noinjec f.dor.acychow.noinjec g.dor.acychow.veh h.dor.acychow.pac
1            1      0                         0                     0                 0                 0
...
25           1      1                         0                     0                 0                 1

> png("barplot.acy_rnaseq.voomplotnew.png")
> v<-voomWithQualityWeights(y,design=des,normalization="none",plot=TRUE)
Coefficients not estimable: h.dor.acychow.pac
Coefficients not estimable: h.dor.acychow.pac
Warning messages:
1: Partial NA coefficients for 15380 probe(s)
2: Partial NA coefficients for 15380 probe(s)
> dev.off()
null device
1
>

Questions:

1. Did I make a mistake in the script (especially the model) that led to the error message,if so, what, and how may I correct it
2. If not, is it possible to correct these samples using batch as a
co-variate or is doing so prevented by teh unbalanced design?
3. Should I simply filter out enough genes withg low counts so the curve fits?
4. Do you have any other suggestions (I understand from previous
posts that use of Combat for normalization is counterindicated with Voom).

Thank you as always,
Rich
Richard A. Friedman, PhD

limma voom batcheffect covariate • 2.8k views

ADD COMMENT • link updated 8.6 years ago by Gordon Smyth 53k • written 8.6 years ago by raf4 ▴ 30

0

Entering edit mode

Were your two batches of data quantitated using the same pipeline?

ADD REPLY • link 8.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Steve,

Thank you for our immediate reply. As far as I know, "Yes",

because it was done with the same protocol at the same core.

However, I will query the experimentalist directly.

Best wishes,

Rich

ADD REPLY • link 8.6 years ago raf4 ▴ 30

0

Entering edit mode

The suspicious thing for me is that you needed to do some manual munging/merging of the two datasets and account for NAs after that, which tells me that there were some genes that were quantitated on one side that weren't in the other (or vice versa).

All things being equal, if you're working with raw data coming out of the same pipeline, I would expect you to have a count matrix with the same rows (with the same identify), even if there were 0 counts for some genes, they should probably still be 0.

So, I'd double check that.

While I can't be sure, I imagine the cloud on the left might be those same genes that you had to fiddle 0's to NAs -- maybe.

In any event. Once you double check that, note that you must always filter out lowly expressed genes anyway when using voom.

ADD REPLY • link 8.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Steve,

Most of the genes were the same. I forget the number of NAs but

I can check. fewer than 1,000 I think. There are 5 replicates of one condition in the old batch

and 5 replicates each for 4 conditions in the new batch. The filter I used for low expressing genes is:

isexpr<-rowSums(cpm(y)>1) >= 5

What filter would you suggest?

But your points are well taken.

If no differnences show in the protocol, I will try limiting genes studied to those which

appear in both batches, and see if I still have a batch effect.

Does that sound reasonable?

Also, if you (or other readers) have any insight into how I got the error message in

with the voom command, apparently due to an error in the linear model,

I would greatly appreciate it.

Thanks and bets wishes,

Rich

ADD REPLY • link updated 8.6 years ago by Gordon Smyth 53k • written 8.6 years ago by raf4 ▴ 30

Gordon Smyth · Answer 1 · 2017-05-05

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

I suspect that there is a major difference in protocols or processing between the batches. A first step to exploring this would be to identify which genes are most DE between the batches (for the same group).

However you cannot use the first batch anyway. Batches always need to contain samples for more than one group in order to contribute any new information to an analysis after batch correction. Your first batch contains samples for one group only, so it cannot contribute any information. Any batch correction method will have the same problem.

Bottom line: just analyse the second batch by itself.

ADD COMMENT • link 8.6 years ago Gordon Smyth 53k

0

Entering edit mode

Steve and Gordon,

Thank you for your answers. I just remembered that the first batch was mapped to ENSEMBL and then translated to NCBI gene symbols. The second was mapped NCBI genes. That must be the problem. I will analyze them in a compatible way henceforth. If batch corrections are necessary, I will also include more than one condition from the first batch. I have to do another task today, but I iwll let you know how it comes out.

Bets wishes,

Rich

ADD REPLY • link updated 8.6 years ago by Gordon Smyth 53k • written 8.6 years ago by raf4 ▴ 30

1

Entering edit mode

OK, good luck. Personally I would realign all the FastQ files from scratch to make them compatible. Or, at very least, quantify both sets of BAM files to the same gene annotation.

ADD REPLY • link 8.6 years ago • updated 8.5 years ago Gordon Smyth 53k

1

Entering edit mode

Steve and Gordon,

I have since found out that the count files were indeed produced by the sequencing core here using different alignment algorithms and different assemblies (gtf files)! I have suggested that the investigator whom I am helping ask the core to realign the older read data with the newer method. If this is not done, I will do this myself.

Thank you both again,

Rich

ADD REPLY • link 8.5 years ago raf4 ▴ 30

1

Entering edit mode

Steve and Gordon,

When the reads were aligned with the same algorithm, the glitch disappeared.

Thanks and best wishes,

Rich

ADD REPLY • link 8.5 years ago raf4 ▴ 30

0

Entering edit mode

Great, glad to hear that and thank you for taking the time to follow up and close the loop on this.

ADD REPLY • link 8.5 years ago Steve Lianoglou ★ 13k