Batch correction by using ComBat function in sva package
2
0
Entering edit mode
@rajeshparmar4-22077
Last seen 17 months ago
UCLA

I am working on RNA-seq data analysis, I have 2 known batches (two different days for RNA sequencing) for my analysis. I run the ComBat function to remove the batch effect on normalized dds from DESeq2 or unnormalized dds. Now, I am unable to do create the DESeqDataSet from combatedata for further analysis. When I am using DESeqDataSetFromMatrix for creating DESeq2 data object by using combatedata, I am getting error: Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are negative

deseq2 combat Knownbatcheffect Tutorial • 4.0k views
2
Entering edit mode
Yuqing Zhang ▴ 20
@yuqing-zhang-6999
Last seen 5 months ago
China

Hi,

I work with Dr. Evan Johnson who developed ComBat. We are currently working on a new version of the function, called ComBat-Seq, which addresses this issue. ComBat-Seq is able to remove batch effects while maintaining the integer counts, so that the cleaned data can be used as inputs to DESeq2. ComBat-Seq functions are available at the GitHub below:

https://github.com/zhangyuqing/ComBat-seq

Please feel free to try it out, and let me know if you have any questions (at yuqingz@bu.edu).

Thanks, Yuqing

0
Entering edit mode
@kevin
Last seen 2 hours ago
Republic of Ireland

The error is thrown because a negative count value makes no sense in RNA-sequencing data analysis. DESeq2 expects positive integer count values.

The negative values are produced by Combat as it attempts to correct for batch. Remember, though, that Combat was developed for microarray data, which is typically measured on the log [base 2] scale.

Your best option is to avoid the use of Combat and to instead include batch as a covariate in your design formula. By doing this, the effect of batch will be taken into account when your test statistics are derived. The regression model essentially treats it as a covariate and makes adjustment thus. This may not work, however, if the batch effect is large and inconsistent.

You will find many questions on this topic on both Biostars and Bioconductor, if you search using your search engine of choice. If you need to, later, directly adjust your expression levels for batch in your workflow, then may I suggest the use of limma::removeBatchEffect(), as exemplified here: Why after VST are there still batches in the PCA plot?.

Kevin

0
Entering edit mode

Thank you so much Kevin for your quick response. I had already used the limma::removeBatchEffect() for removing batch effect in two groups.

## Adjusting for batch effects with VST

vsd <- vst(dds)

plotPCA(vsd, "Batch")

assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$Batch) plotPCA(vsd, "Batch") Now, in PCA plot I am able to see the batch correction in my two batches. Can you suggest this is the right strategy? ADD REPLY 0 Entering edit mode Hey, that seems fine - do not worry. Just to check: you are no longer using ComBat anywhere in your workflow? ADD REPLY 0 Entering edit mode Thank you so much. Right now I am not using the ComBat function for my analysis. I am also using the sva package for correcting the batch. The script I used is: # in order to use SVA to remove any effect on the counts from our surrogate variables ddssva <- dds ddssva$SV1 <- svseq$sv[,1] ddssva$SV2 <- svseq$sv[,2] design(ddssva) <- ~ SV1 + SV2 + Batch + outcome_txt ## surrogate variables by running DESeq with the new design ddssva <- DESeq(ddssva) ## to see results resultsNames(ddssva) Can you give your suggestions regarding this?? ADD REPLY 0 Entering edit mode Do you suspect that there are surrogate variables in your data, though?; how much variance is explained by SV1 and SV2; is modeling batch alone not sufficient (and then removing the effect via limma::removeBatchEffect())? ADD REPLY 0 Entering edit mode Hello Kevin, After adjusting the batch effect by using VST. I am not getting differential gene expression. vsd <- vst(dds) plotPCA(vsd, "Batch") assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$Batch)

plotPCA(vsd, "Batch")

after this script. How I will get differential gene expression? Again I have to run DESeq2 or not? If, yes then how I will create the dds object by using batch corrected matrix?

0
Entering edit mode

You should have conducted a differential expression analysis before you run those commands that you have pasted. You do not have to re-create the dds object with the batch-corrected matrix.

To recap:

1. create the dds object with the original raw count matrix and include batch in your design formula. Your design formula may look like, for example, ~ condition + batch
2. conduct a differential expression analysis in the normal way for your factor of interest (e.g., condition) - the effect of batch will be 'adjusted' when the statistical inferences are made due to the fact that batch is included in the design formula)
3. run the commands above (the ones with removeBatchEffect()) if you want to use your data matrix downstream for other purposes.
0
Entering edit mode

Thank you so much Kevin.

0
Entering edit mode

Hello, Whenever I am using the SVA to remove the batch effect, I am getting a higher number of the differential genes as compared to without the removal of batch effects. By using this script on same data:

ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + Batch + outcome_txt
ddssva <- DESeq(ddssva)


Do you think this approach is right for getting my differential gene expression?

0
Entering edit mode

Can you please reformat your code to make it more presentable? Edit your post, highlight the code, and then click on the 101 010 button. Also, be sure that you have searched for the answer to your question before re-posting here. Thank you!

0
Entering edit mode

Thank you for editing. What evidence have you that there are 2 extra statistically significant surrogate variables in your data? You would not want to 'over-adjust' when making statistical inferences, which is what you may be doing.

To re-capitulate my point: doing the following should be sufficient to account for batch:

~ outcome_txt + Batch


By including extra surrogate variables without major justification, you may actually be eliminating the very effect that you want to measure.

0
Entering edit mode

Thank you so much for your response Kevin, OK, then this is not the right way to do the same. Whenever I am using this script and I am getting a higher number of differential gene expression and log2Foldchange is quite high as compare to without sva.

The following script is the right way or not??

    ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + Batch
ddssva <- DESeq(ddssva)


Thanks, Raj

1
Entering edit mode

...but how are you going to perform a differential expression analysis to test for outcome_txt with a formula like this: ~ SV1 + SV2 + Batch? You have to include, in your design formula, the condition / trait, the sub-groups of which, you want to test for differential expression.

At this point, I can only ask that you read over my original answer and all of my subsequent comments. It is apparent that you are not fully understanding the process of batch adjustment / correction - you are not even answering the questions that I am putting back to you.

0
Entering edit mode

Yes, I am not able to fully understand the process of batch corrections. I don't have so much experience regarding this, I just started working on RNA seq data. I am trying to analyze my RNA -seq data in which two known batches are there for two dates of experiments.

In which, my condition is outcome_txt, whenever I am using SVA, I am getting a high number of differential expression of genes between my groups.

I am just asking, whether by using sva in my analysis, I am doing right analysis or not?? my script for analysis condition:outcome_txt, batches:Batch(1&2) is:

ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + Batch + outcome_txt
ddssva <- DESeq(ddssva)

1
Entering edit mode

Oh, I know, but, my question to you regarding the inclusion of the surrogate variables (SV1 and SV2) was this: what evidence do you have that justifies the inclusion of these surrogate variables? You already have your batch variable included.

My worry was, that, the unnecessary inclusion of variables in the design formula can result in an 'over-adjustment' when deriving the statistical inferences for your main factor / condition of interest, i.e., outcome_txt.

Over the years, I have seen many researchers becoming somewhat 'obsessed' about batch-correction and adjusting for unknown artifacts in their data (I was like this, too).

The best approach to mitigate these extraneous effects is, of course, a good experimental design, something which appears to be lacking in many studies.

0
Entering edit mode

Hello, Can you suggest the right approach for this kind of analysis? Experimental plan:

1. Two groups in which I want differential gene expression: outcome_txt

2. Two batches for the two different experimental dates: Batch

It will be very helpful for me If you provide an exact pipeline for the analysis of RNA seq data.

Thanks Rajesh