Question

removeBatchEffect -- warnings and results

1

Entering edit mode

pavlidisp ▴ 10

@pavlidisp-8874

Last seen 10.3 years ago

Greece

Dear Bioconductor list,

I have a couple of questions regarding batch effect removal:

Question 1

I have 8 RNA-seq samples to analyze and detect genes that respond to some treatment in different genotypes (wt vs ko). The genotype is described by the vector

a <- c(1,1,1,1,2,2,2,2), the treatment by the vector: b <- c(1,2,1,2,1,2,1,2)

and the batch effect (which is coming from the fact that the same animal has been used in 2 samples) by the vector

batch<- c(1,1,2,2,3,3,4,4)

The effect of the batch is quite strong as it is shown in the mds plot:

http://139.91.162.50/problems/fig1.png

Thus, it is clear that the genotype effect is confounded by the batch effect, and therefore I cannot remove it. I guess this is the meaning of the warning:

Coefficients not estimable: batch3
Warning message:
Partial NA coefficients for 13533 probe(s)

However, it seems that

logCPMc <- removeBatchEffect(logCPM, batch=batch, design=design.nobatch)

, where design.nobatch <- model.matrix(~a*b) can remove the batch effect despite this warning. The MDS plot is as expected:

http://139.91.162.50/problems/fig2.png

Should I trust this batch effect removal process despite the warning message? Why did it produce some result (and how is this result produced), since it cannot estimate the batch effect due to confounding?

Question2

I run batchEffectRemoval on logCPM values: i.e.

y <- DGEList(counts=d1, group = groups)
y <- calcNormFactors( y )
logCPM <- cpm(y, log=TRUE, prior.count=5)

logCPMc <- removeBatchEffect(logCPM, batch=batchVector2, design=design.nobatch)

Is this correct, or should I use the y data structure instead? BTW when I use the y instead of logCPM, I get the following error message:

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
'data' must be of a vector type, was 'NULL'

Thanks a lot in advance,

pavlos

batch effect edgeR • 4.8k views

ADD COMMENT • link 10.3 years ago pavlidisp ▴ 10

score 2 · Answer 1 · 2015-09-25

2

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 2 hours ago

The city by the bay

The removeBatchEffect function is only intended to remove the batch effect for purposes of visualization. You shouldn't use it for your model fits or significance analyses. Instead, I would suggest fitting a model that includes the batch effect, while still considering genotype-specific effects of treatment:

genotype <- factor(c(1,1,1,1,2,2,2,2))
treatment <- factor(c(1,2,1,2,1,2,1,2))
batch <- factor(c(1,1,2,2,3,3,4,4))
design <- model.matrix(~0 + batch + genotype:treatment)
design <- design[,-(5:6)] # To get to full column rank

The first four coefficients represent the mouse-specific blocking factors, and aren't really interesting to us. The second-last coefficient represents the effect of treatment 2 over treatment 1 in genotype 1, while the last coefficient represents the corresponding treatment effect in genotype 2. You can drop these coefficients individually to get the effect of treatment in each genotype, or you can use a contrast vector like c(0, 0, 0, 0, 1, -1) to test for differences in treatment effects between genotypes.

ADD COMMENT • link 10.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks Aaron,

I hadn't thought about the design matrix you suggest. Thanks a lot. This will work. Can this design matrix however capture effects of the genotype and treatment independent of each other? But perhaps of the confounding between genotype and batch it could be the best that I can do

ADD REPLY • link 10.3 years ago pavlidisp ▴ 10

0

Entering edit mode

Another option would be to explore the duplicateCorrelation approach with voom. This will allow you to fit a model where the genotype and treatment effects are represented by separate coefficients, i.e., ~genotype + treatment. However, it has its own drawbacks - namely, it assumes that all genes are correlated to the same consensus correlation (which probably isn't true, each gene would have its own correlation depending on its variance); and it will absorb mouse-to-mouse variability into the variance estimate, which will result in conservativeness. In contrast, putting the batch effect in the design matrix will absorb this variability into the coefficients rather than the sample variance.

ADD REPLY • link 10.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Hi Aaron,

I have similar problem as main post. Instead of genotype and treatment, I have cell type and time point, both of which are part of experimental design. The replicate prep batch are the batch here.

Just for visualization purpose, I used normalizeVSN() result as input for removeBatchEffect

design = model.matrix(~reps+cell*time, data=y$samples) 
cpmVSN_batchCorrected <- removeBatchEffect(cpmVSN, batch=reps, design=design)

I got similar warning as

Coefficients not estimable: batch1 batch2  
Warning message: Partial NA coefficients for 115622 probe(s)

I compared cpmVSN_batchCorrected and cpmVSN value, but no value was change, thus basically batchEffect removal failed. Do you know what caused the batch correction failure? I just want to remove the batch co-efficients effect for PCA plot (visualization).

ADD REPLY • link 6.9 years ago sckinta • 0

0

Entering edit mode

Please post a new question rather than resurrecting old threads.

ADD REPLY • link 6.9 years ago Aaron Lun ★ 29k

score 1 · Answer 2 · 2015-09-25

If you're doing an RNA Seq I'd expect you're going to do one of two analyses; differential transcript expression, or differential gene expression. If you have a batch effect present (although it sounds more like a 'donor' effect), you can use an additive model to account for the effect in fold change / P value calculations.

Trying to remove the batch effect probably isn't the best way to tackle the problem, it's more the done thing to account for it in your model designs.

Differential Gene Expression : DESeq2 - Additive Model ~Treatment + Batch

Differential Transcript Expression: Sleuth - Similar model as above.

score 0 · Answer 3 · 2015-09-25

Hi Andrew,

thanks a lot for your fast reply and for advices. I will try both deseq2 and sleuth. For differential expression analysis I'm modeling the response variable using the edgeR function

fit <- glmFit( d1, design)

where design <- model.matrix(~batch + genotype*treatment)

Again, the result is the same because the genotype and the batch is confounded. This I can understand.

Then, I thought to do an MDS and/or PCA plot just to check how strong the batch effect is and if I can remove it or take it into account. What I cannot understand is how removeBatchEffect "seems" to remove the effect (i.e. fig2.png in the original post) despite the confounding effect and the warning messages. And also, should I trust these results?