Batch Effect in DESeq2, how to control, how to remove it, how to address biological question
1
1
Entering edit mode
Mozart ▴ 20
@mozart-20625
Last seen 19 months ago

Hello there, let's start from scratch: In my experience (about 2 years of bioinformatics), I've always had to work on samples coming that, unfortunately, come from different batches (i.e.experiments performed in different days and/or samples coming from different hospitals). In my lab, a postdoc showed me how to deal with batch effect by using SVA package if the source of bias is unknown. I guess it is a brilliant idea to work with that, but if I don't go wrong this is not the best tool to cope with that and plus I may know the source of confounding (in other words, I know that samples are generated in different days/come from different hospitals).

My question is more technical and very likely linked to my inexperience. By slavishly following postdoc recommendations, I have always used this SVA package even though - as previously mentioned - it doesn't seem to be so convincing for my purposes. In other words, I would like to deal with this problem in a definite way; I have read a lot of threads opened both in Bioconductor and Biostars but there are a lot of question marks in my head.

1. Is designing of a model matrix a good way to start, by any chance? I have zero ideas on where to start and if you could help me with some advice, that would be grand! I really need your help guys, cause I am absolutely alone now and don't know who to ask. Before this post was created I have also started a discussion on Biostars and the conclusion was to ask some of the smart guys here.

2. My understanding is that in a linear model, gene expression is given by the sum of modelled variables (i.e. biological/batch factors) you have to take into account. So, in the simplest scenario:

gene expression ~ factor_1 + batch


Now, let's say I have the following condition (where batches I, II and III are different days of the year):

              factor_1                         batch
sample1       neutrophils_cancer_type1         I
sample2       neutrophils_cancer_type1         II
sample3       neutrophils_cancer_type2         I
sample4       neutrophils_cancer_type2         III
sample5       neutrophils_cancer_type1         I
sample6       healthy                          II
sample7       healthy                          II
sample8       healthy                          III
sample9       neutrophils_cancer_type2         II


Important note: I am using Kallisto to perform pseudoaligment and then DESeq2 via tximport (i.e. to generate a txi.kallisto.tsv count matrix that will be used as input for DESeq2). So, once you've generated your SampleTable, if your samples come from the same batch I know that you are ready to go with the following:

sampleTable$batch = factor(c("I","II","I","III","I","II","II","III","II")) dds = DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTable, ~batch+condition)  1. One should have taken into account the fact that in your experiment there's a batch effect and thus genes that are differentially expressed are truly dependent by biological effect. Correct? Someone in Biostars told me I cannot use limma:removeBatchEffect() for differential expression analysis, neither in Limma nor in DESeq2. In fact according to Limma guys: This function is not intended to be used prior to linear modelling. For linear modelling, it is better to include the batch factors in the linear model 1. batch factors? what are they sorry? probably because, according to the guy who helped me in Biostars, doing differential expression analysis on batch-effect corrected data may lead to higher false-positive and false-negative results (but I don't have a background to say whether or not this speculation is correct). Thus, this allow me to conclude that the only thing I can do is removing batch effect in rlog- or vst-transformed dataset as a visualisation tool in my experiment. 2. Concluding, it's not clear to me if controlling for a batch effect means that I am telling DESeq2 to model gene expression considering that there are different batches in my analysis. Is the same if one says that I have removed the batch effect? Otherwise, in terms of biological results (low false positives/negatives and likelihood to truly address biological question), what’s the difference between controlling for batch effect and removing it from the analysis? What’s the usefulness of controlling for a bias in your experiment rather than removing the source of its bias? I think, at this stage, is just my misunderstanding! I know it's a long thread but I am really trying to understand a pivotal thing here. Thanks, deseq2 batch effect batch • 8.3k views ADD COMMENT 0 Entering edit mode This is really just a DESeq2 question. When you tag limma, you sent an email to the limma developers as well about your post. I’d like to encourage users to not send out emails to numerous package authors if their question really targets one of the packages specifically. ADD REPLY 0 Entering edit mode Sorry, I was not aware of this. I have just updated tags here. Hope this won't compromise the help from DESeq2 dev. ADD REPLY 0 Entering edit mode I'll be able to read and provide feedback tomorrow. ADD REPLY 0 Entering edit mode Thank you very much. I'm in your hands. ADD REPLY 3 Entering edit mode @mikelove Last seen 1 day ago United States Beginning from the top, you should generally use ~batch + condition as we do in the vignette, because the default for most functions is to look at the last variable in the design (although you can override by specifying which coefficient). And yes, just putting the batch in the design helps to control for changes in the counts due to batch, while you test for associations due to condition. You do not adjust the counts with removeBatchEffect but just put batch in the design. This person is using the term "batch factors" to mean a factor variable named batch. I don't follow any of the other points being carried over from Biostars. Putting batch in the design formula is our recommendation when batches are known and when they are not confounded with the condition. For more intuition on what ~batch + condition is doing in a linear model (or generalized linear model), I'd recommend checking out Rafa and my material on linear modeling: http://rafalab.github.io/pages/harvardx.html ADD COMMENT 0 Entering edit mode Hi Michael, and thanks for the useful hints. I will definitely check the link provided. And yes, just putting the batch in the design helps to control for changes in the counts due to batch, while you test for associations due to condition. I still don't understand if, by including batch covariate in the design matrix (sorry if I am writing this sentence wrong), I am 'changing' my results (i.e my list of differentially expressed genes) versus not including this variable in my model. The likelihood of getting more true positives is higher because I am telling DESeq2 to take into account for this supplementary source of variation. Correct? ADD REPLY 0 Entering edit mode This gets into how linear models work. I’ve posted a link to some online course material, beyond that it’s beyond the scope of this software support site. Another option would be for you to contact a local statistician who could help work through the mechanics of linear models with you. It’s very valuable to have this insight but the best is to either take some courses or read applied books on linear models. ADD REPLY 0 Entering edit mode Yes, I do agree with you Michael. Surely, the best thing to do is taking some courses for this, at the moment, I only have a smattering of stats not being my background. I hope this time my question will sound more DESeq2-related as I am still unsure whether or not putting batch in the design formula (that is, your recommendation) I can control for batch effect meaning that I can work around this issue (high likelihood to get true positives) or it is something that should be assessed case by case (especially by doing a downstream wet lab validation such as, RTqPCR). It may be a naive perspective, please don't blame myself, but it seems strange to me believing that miracle can be made by simply adding this batch variable in the design formula. On the other hand, if I have a clear, strong phenotype compared to a control one, I should be able to pick up some of the genes that truly change regardless of any batch effect. If my questions do nothing but confirming my inexperience in the field, please excuse me. I will try and seek around some local statistician as suggested, in that case. regards. ADD REPLY 0 Entering edit mode Hi Michael, sorry for re-upping this post but I may want to ask you further clarification about how to deal with this batch effect in DESeq2. My understanding tells me that when we want to control the batch effect in differential expression analysis with just need to include batch factor in the design matrix; on the contrary, in order to visualise our experiment we can use limma's remove batch effect function. first question, looking at previous cognate posts i think one should use the following string of code:  sampleTable <- data.frame(condition = factor(c("A","B","A","B","A","B"))) sampleTable$batch = factor(c('1','1','2','2','3','3'))

rownames(sampleTable) <- colnames(txi.kallisto.tsv$counts) dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTable, ~condition) dds <- DESeq(dds) res = results(dds) vsd <- vst(dds) plotPCA(vsd, "batch") assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
plotPCA(vsd, "batch")


then, after using this string of code I end up with a table like this. As you can appraise, values are pretty strange...I would say 'specular' for samples per batch. Se how, for sample1, PC1 is exactly opposite to PC1 for sample2. Similar story for PC2 and so on.

                     PC1        PC2       batch
sample1  -3.5344562  2.9484894 1
sample2   3.5344562 -2.9484894 1
sample3   4.5476438  5.4235637 2
sample4  -4.5476438 -5.4235637 2


second question, should we use a previously batch effect-controlled dds object (i.e. one with ~batch+condition) or should we use a dds object not corrected for batch effect (i.e. just ~condition)? I am experiencing the aforementione issue, either way.

0
Entering edit mode

when we want to control the batch effect in differential expression analysis with just need to include batch factor in the design matrix; on the contrary, in order to visualise our experiment we can use limma's remove batch effect function.

yes. Here's the code we provide:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-after-vst-are-there-still-batches-in-the-pca-plot

2) In order to calculate the VST, you provide a design, which is used to estimate dispersion. I would use the full design ~batch + condition. The confusing part is that this does not remove the batch effects, it just uses that information to properly estimate the dispersion. So I would use the full design, and use blind=FALSE when running vst, but then follow the link above for removing the batch variation for plotting.

0
Entering edit mode

oh, sorry Michael, I have just noticed that I got my first question truncated for some reasons. Would you mind to re-read the above post, please? thanks

0
Entering edit mode

I don't know why you're getting the strange effect with sample 1 and 2. If these were typical RNA-seq data you're importing, I have no reason to think why this would ever happen. Can you explore a bit more? Does it look like this before you use removeBatchEffect? What is the dimensionality of dds?

0
Entering edit mode

dim (dds)
returns  ~30000 rows and 12 columns


by not including batch variable in the design (i.e. just ~condition in the dds object) and doing the following

vsd <- vst(dds)
x = assay(vsd)


I have some funny results, for example very similar number for some genes across the whole experiments or even identical ones. (e.g. gene A gets value '3.567840' in all the samples!). So there's something strange even before the correction. Any ideas about the problem, here?

0
Entering edit mode

That value is probably what a zero count is mapped to, not an issue for any downstream analysis.

What’s the PCA plot look like before and after remove batch effect? Do you see the mirroring in both?

1
Entering edit mode

Sorry for resurrecting this old post but I may still have the same problem here.

Briefly, by doing this:

sampleTable <- data.frame(condition = factor(c("A","B","A","B","A","B")))
sampleTable$batch = factor(c('1','1','2','2','3','3')) rownames(sampleTable) <- colnames(txi.kallisto.tsv$counts)

dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTable, ~condition)
dds <- DESeq(dds)
res = results(dds)

vsd <- vst(dds)
plotPCA(vsd, "batch")
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd\$batch)
plotPCA(vsd, "batch")


I get a strange 'mirror effect' when visualising PCA (dots from different groups - but from the same batch - are exactly in the same position but inverted, as looking one another at the mirror).

                 PC1        PC2       batch
sample1  -3.5344562  2.9484894 1
sample2   3.5344562 -2.9484894 1
sample3   4.5476438  5.4235637 2
sample4  -4.5476438 -5.4235637 2
...


Not sure why but it looks really artificial to me.

Finally, is SVA package recommended just for visualisation purposes? Sorry for naive questions, not an expert.

1
Entering edit mode

Not sure about the PCA artifact you see in your data.

Re: SVA take a look at their documentation, and maybe post a new thread about the purpose of that package.

0
Entering edit mode

Thanks again. Hope to fix that problem soon.

0
Entering edit mode

Sorry, I only see this mirroring effect only after using Limma::removeBatchEffect function.