Check removeBatchEffect effectiveness
1
0
Entering edit mode
massimo • 0
@35c0d58e
Last seen 38 minutes ago
Denmark

Hello,

I am analyzing a longitudinal RNA-seq dataset with a large number of subjects (~ 2000) with several batch effects (technical covariates) I would like to adjust for in the context of an exploratory analysis of normalized log (pseudo)counts. The batch effects are both categorical, such as sequencing center, operator, library preparation date, and continuous (insert size, mean GC, etc...). I have run removeBatchEffect from limma accounting for all these covariates and including timepoint in the design matrix.

adj_tmm <- removeBatchEffect(log1p(tmm), batch=batch$a, batch2=batch$b, batch3=batch$c, covariates=batch[,4:6], design)

I am reproducing a previous analysis not performed by me to test the correlation between the covariates and the principal components (PCs) calculated on the normalized values, in my case before and after batch correction, since the association was reported to be significant with all the covariates taken into account. My doubt arises from the fact that after performing removeBatchEffect there is not a clear trend of loss of significance and/or correlation between the covariates and post-adjustment PCs. However, there seems to be a centering of the boxplots as should it be expected from the subtraction of batch effects, supported also from comparing the heatmaps. Is this behaviour expected and can proceed with the analysis or further investigation is required about the batch correction process?

I am not a biostatistician but for comparison, by fitting a basic lm() function modelling the same count matrix and accounting for all the covariates and then comparing their correlation with the residuals of the model, there is no more a significant correlation (at least with the continous covariates).

Thank you for the help Massimo

RNAseq limma removeBatchEffect • 558 views
ADD COMMENT
0
Entering edit mode

insert size, mean GC, etc...

This is not a classical "batch" in the common sense, at least I've never seen this to be corrected for explicitely. In my head this is nested with library prep batches. Maybe estimate surrogate variables with something like sva to see if an early SV captures the relevant unwanted variation rather than trying to correct for this individually. The advantage of this is that early SVs might capture both known batches and other hidden unwanted variation that cannot be assigned to a variable a priori. Especially with human data this is often necessary. Longitudinal data are unfortunately riddled by inevitable batches, so SVs might help dampening this a bit. The downside is that the number of SVs is arbitrary and needs to be defined by the user, but I would at least explore this option to see how observed variation compares to known covariates.

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

Your call to removeBatchEffect() is not correct because that function does not have an argument called batch3, meaning that the batch factor `batch$c$ has been ignored. If you have many batch effects, then the correct way to proceed is

removeBatchEffect(y, covariates=design.batch, design=design.timepoint)

where design.batch is the columns of the full design matrix corresponding to batch effects or covariates and design.timepoint is the design matrix for the conditions of interest, that you would have if there were no batch effects or covariates. With this approach, you can adjust for any number of batch factors. For the categorical batch effects, design.batch should contain indicator variables.

In any case, removeBatchEffect() is designed to remove batch effects while preserving the treatment differences of interest. If the covariates are correlated with the treatment conditions (timepoint in this case), then it is possible for the adjusted log-expression values to still have some correlation with the covariates. This is intended behaviour.

BTW, we do not recommend the use of log1p to compute logCPM values. Better to use cpm with log=TRUE, see edgeR::cpm() results in negative values from raw counts of 0 for a recent discussion.

ADD COMMENT
0
Entering edit mode

Thank you for the response. Unfortunately, if I create a design.batch object with the categorical batch effects and rerun removeBatchEffect, I see a warning: Partial NA coefficients for all probes and several coefficients are not estimable. It seems, from the linear model fitting, that the same coefficients are linearly dependent. Removing them from design.batch solves the issue.

ADD REPLY
0
Entering edit mode

Such a warning message normally does not cause any problems because removeBatchEffect will remove any redundant batch variables automatically.

However, your comments suggest to me that you are doing something wrong. I can't help you or diagnose the problem unless you show by way of code exactly what you did. Note in particular that design.batch is not a complete design matrix but should only be "the columns of the full design matrix corresponding to the batch effects".

ADD REPLY
0
Entering edit mode

Sorry for the delay, here it is my code which displays the warning message:

Normalization:

countsdge <- DGEList(counts, timepoint, genes)
keep <- filterByExpr(countsdge)
countsdge <- countsdge[keep,, keep.lib.size = FALSE]
countsdge <- normLibSizes(countsdge, method = "TMM")
TMM <- cpm(countsdge, log = TRUE, prior.count = 3)

Batch matrix (which transforms categorical batch effects in dummy variables and mantains numerical covariates):

design_batcheffects <- model.matrix(~0 + numeric(cov1) + numeric(cov2) + numeric(cov3) + factor(batch1) + factor(batch2) + factor(batch3), batch_df)

Design:

design_timepoint <-  model.matrix(~ factor(timepoint), batch_df)

Remove batch effect:

adj_TMM <- limma::removebatchEffect(TMM, covariates= design_batch, design=design_timepoint)
ADD REPLY

Login before adding your answer.

Traffic: 1201 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6