Question

What does evidence of batch effect in bulk rnaseq look like?

0

Entering edit mode

BioinfGuru ▴ 70

@yagalbi-11519

Last seen 16 months ago

Ireland

Hi all,

I've posted this on biostars alos, as I m not sure it is appropriate on here. Let me know if this quesiton needs removing

About the data: I have 5 tissues, over 100 samples , and 2 variables of interest: RFI (High, Low) and Trial (1, 2). The trial variable is basically a surrogate for genotype, as the main difference between trials is the genotype of the animals. All samples were collected, and then processed in the lab by the same person. I don't know the sex of each animal (but that can be obtained from the data with a bit of work). I have no other batch information.

My question: I don't want to apply sva to model a hidden batch until I am confident there actually is a hidden batch. The problem is, I need guidance to know what evidence of a hidden batch looks like. I have read that hidden batches should be evident after exploratory data analysis. For clarification, I'm showing plenty of EDA images here to help my own understanding of replies.

Thank you all in advance, Kenneth

Exploratory Data Analysis results PCA separates intestinal tissues from liver, and kidney very well, with 1 outlier that has now been removed but there is no clear separation between Ileum and Jejunum even when intestinal tissues are plotted without liver and muscle:

All 5 tissues 3 Intestinal tissues only

Within individual tissues, PCAs are showing some clustering by variables of interest but I don;t see any extra groups, or groups of samples sitting way off by themselves (which I think would be evidence of a hidden batch effect):

duodenum

The heatmaps however are where I need a bit of guidance. Duodenal tissue is clustering weakly by trial, but Ileum, Jejunum, and Muscle show strong clusters not attributable to the variables of interest. Can I consider this evidence of a hidden batch in those tissues or could they just be biological signal that is stronger than the variables of interest? Should I use sva on these tissues or not?

duodenum ileum jejunum liver muscle

surrogatevariables DESeq2 BatchEffect bulkrna-seq sva-seq • 2.5k views

ADD COMMENT • link updated 17 months ago by jessica.anderson ▴ 10 • written 19 months ago by BioinfGuru ▴ 70

score 1 · Answer 1 · 2024-07-29

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Check out variancePartition:

https://www.bioconductor.org/packages/release/bioc/html/variancePartition.html

This in combination with known or inferred batch can be informative about the extent of batch effects.

ADD COMMENT • link 18 months ago Michael Love 43k

score 1 · Answer 2 · 2024-07-30

1

Entering edit mode

jessica.anderson ▴ 10

@61775469

Last seen 20 days ago

United States

You should generally expect clustering based on biological differences (like tissue type), so from the PCA plots, there doesn't seem to be a strong batch effect. The heatmaps do seem to show some clustering based on the trial (which seems like it would be the batch effect that would be easiest to explain based on the info you have provided about the experiment), but also seem to show some clustering by low/high (remember that on a heatmap the position of the branches is arbitrary, so the top branch vs bottom branch can pivot at the location where the branches diverge. Doing this you can start to see how more of the low/highs would cluster next to each other. Maybe still not as strong as would be nice to rule out a batch effect, but it seems to me that any batch effect you have may be pretty weak).

What you can do, is use a batch exploration tool, like BatchQC to compare exploratory analysis on your original data set as well as on batch corrected datasets. If it is appropriate to use a batch correction, you should see stronger correlations in your variables of interest after applying batch correction models rather than on your raw data set. BatchQC is an R shiny package designed specifically so that you can compare results when trying to decide the best way to move forward with downstream analysis and includes many exploratory tools (including PCA and heatmaps).

ADD COMMENT • link 18 months ago jessica.anderson ▴ 10

0

Entering edit mode

any batch effect you have may be pretty weak

Thank you. I actually ended up running SVA (be and leek methods) on the data and reproduced the plots. In the end, there was negligible difference regardless of the number of SVs included. So yes, I think if any batch effect is there, it is weak. (I had the impression that running SVA would brute force the data to cluster by my variables of interest... clearly not.) The results of deseq2 are also reflecting the same pattern... almost no degs in muscle, plenty of degs due to trial in intestinal tissues, plenty of degs in liver due to rfi.

So considering the negligible affect SVA (as evidence of weak/no batch effect) and the deseq2 results: I think the original plot for muscle, for example, is just showing that rfi/trial not having an important affect.... normal biological processes are influencing clustering far stronger.

I now think the original plots above are not due to technical artifacts, but rather rfi/trial having less influence than normal biological processes e.g. muscle.

ADD REPLY • link 18 months ago BioinfGuru ▴ 70

1

Entering edit mode

No, that's not what you should expect. SVA is meant to remove excess variability that is not explained by the expected groups, not force data to fulfill your preconceived notions of what the groups should be.

As Jessica already noted, there isn't much evidence for batch effects. But I do see some evidence for possible sample mis-labeling. You have some jejunum and ileum samples partying with the duodenum samples, which seems a bit suspect, given how cleanly separated the groups are. If this were my analysis I would be wondering about those.

ADD REPLY • link 18 months ago James W. MacDonald 68k

0

Entering edit mode

Thanks James. The ileum sample was removed, there was other evidence of an issue with that sample (it isn't in the heatmaps), but I left in the 2 Jejunum samples. You might notice on that second PCA with the 3 intestinal tissues, even though it is a wide spread, they still cluster by trial with the other Trial 1 jejunum samples. So there is some ambiguity there. It is tempting to remove those 2 samples, but I feel like I would be cherry-picking.

ADD REPLY • link 18 months ago BioinfGuru ▴ 70

1

Entering edit mode

I am not suggesting you should remove them, necessarily. But it's common enough in my line of work to see what look like out of place samples that end up being samples that got mis-labeled somewhere along the line. Having samples that appear to be that far out of place just makes me think sample swap, but sometimes it's just weird samples that you might have to down-weight if you were using limma-voom. I don't think that's a thing with DESeq2 though.

ADD REPLY • link 18 months ago James W. MacDonald 68k

score 0 · Answer 3 · 2024-07-30

You should generally expect clustering based on biological differences (like tissue type), so from the PCA plots, there doesn't seem to be a strong batch effect. The heatmaps do seem to show some clustering based on the trial (which seems like it would be the batch effect that would be easiest to explain based on the info you have provided about the experiment), but also seem to show some clustering by low/high (remember that on a heatmap the position of the branches is arbitrary, so the top branch vs bottom branch can pivot at the location where the branches diverge. Doing this you can start to see how more of the low/highs would cluster next to each other. Maybe still not as strong as would be nice to rule out a batch effect, but it seems to me that any batch effect you have may be pretty weak).

What you can do, is use a batch exploration tool, like BatchQC to compare exploratory analysis on your original data set as well as on batch corrected datasets. If it is appropriate to use a batch correction, you should see stronger correlations in your variables of interest after applying batch correction models rather than on your raw data set. BatchQC is an R shiny package designed specifically so that you can compare results when trying to decide the best way to move forward with downstream analysis and includes many exploratory tools (including PCA and heatmaps).