Question

Removing batch effect by sva or combat and subsequent correlation analysis by WGCNA

0

Entering edit mode

chiwut.wong • 0

@df10c8f5

Last seen 3 months ago

United States

This is a relatively theoretical question given my naiveness in reading the math details from the article for both SVA and WGCNA:

My main goal is to do a correlation analysis between genes via WGCNA (not limited to use this package from R) on 5 existing RNAseq datasets from different groups. Therefore, this is also a meta-analysis. I found that the batch effect is obvious (shown by a simple PCA plot). I am thinking to remove the batch effects by SVA (combat-seq).

My question is whether removing the batch effects by combat-seq (or any other means) will artificially introduce some correlation between the genes? In other words, whether the algorithm used by these batch effect removal tools will impute the gene expression using "correlation-related" ways, thus make every gene correlates to each other?

I do anticipate removing batch effects will strengthen the p-value of correlation for some true positive genes, but I am also afraid batch effect removal will generate many artifacts. Because my application is not to get the differentially expressed genes but to do analysis based on correlation.

Obviously, I may just mess up the concept between imputing missing value and batch effect correction. Some explanation of the differences between these two "data cleaning" tools will be greatly appreciated. Thanks!

WGCNA impute sva • 1.6k views

ADD COMMENT • link updated 2.9 years ago by sebastian.lobentanzer ▴ 50 • written 2.9 years ago by chiwut.wong • 0

score 0 · Answer 1 · 2021-05-21

This is a relatively practical answer to your relatively theoretical question, and unfortunately, in my experience the answer is "depends". The outcome mainly depends on how different these five datasets are to begin with. If they are reasonably different (imo likely if dealing with data from different groups using different technologies and sample origin), my bet would be not on introduction of too many false positives, but rather on a substantial portion of the batch effects remaining after correction.

There is, afaik, no imputation involved in what you are trying to do. Imputation is the process of generating artificial data in the case of missingness that do not "get in the way" of the planned analyses of the real data. An easy case of imputation is e.g. using the median of a feature for NA values of that feature. Usually that's not the problem in sequencing experiments. Batch correction on the other hand is the process of making expression landscapes of multiple experiments (or chips, etc.), that may differ because of technical reasons, more similar. An elementary prerequisite of batch correction is some sort of continuity between the batches. For instance, when planning a multi-chip sequencing of treatment vs control, one should balance the chips such that on each chip there are samples from both groups. In the extreme case, imagine you had two chips of which one has only controls, the other has only treatment, and there is a technical batch effect because you sequenced them a couple of days apart. Then you give these expression data to combat_seq and expect it to correct the batch effect (but not the treatment effect) between the two chips. How is the algorithm supposed to "know" which differences between the chips are due to batch effects, and which are due to treatment? Short answer, it can't know that. This is the sort of problem you may want to look into for your analyses, specifically.

Did you already attempt the batch correction on the five datasets and try to visualise the results to see whether the batches are acceptably similar after correction?