Question

Batch effects removal across mRNA seq studies addressing different biological question: Best approach?

0

Entering edit mode

Kuldeep • 0

@e74a635e

Last seen 3 months ago

United States

Hi! I have a question that I have been asking myself for a long time and after reading different forums on best approaches to take bath effects into consideration in the analysis workflow, I am still left perplexed.

I have two different datasets from my own lab: 1) mRNAseq datasets which come from after purification of mRNA from captured polysome fractions followed by sequencing . I did polynomial factionation on Conditions: "Control" and "Treated".

2) Second dataset is typical mRNAseq datasets measuring cellular abundance of mRNA in Condition: "Control" and "Treated" (different group of animals than dataset 1)

Samples in both the datasets are collected across 24hr cycle at 6 fixed time points in the day at the interval of 4hrs.

My goal is to identify genes which are differentially expressed or not expressed in both datasets to be ale to comprehend at what post transcriptional level gene expression regulation is impacted (cellular abundance or polysome loading) for identified genes.

DATASET1 has two batch A and B. Each batch has 6 control samples 6 corresponding samples (all control and matched control samples for the time of tissue collection were sent together to make it a balanced batch+sample study)

DATASET2 has one batch comprised of 18 paired control and treated sample

The PCA plot for dataset 1 shows 42% biological variance between Control and Treated samples (Control and Treated sample clustered distinctly separate) after using design = ~Batch + Condition. And I did the same for DATASET2 I used design = ~Condition, and I can see Control and Treated samples clustered separately.

My confusion is whether do I need to perform batch correction/ removal first across the dataset using package like combatseq to address the question that I am asking. In my trivial understanding of statistical modeling , batch effects are estimation of variance in the genes expression from the mean across the batches. But since these samples are from animals , how can we differentiate between true batch effect from true biological signal. It is not uncommon to find biological variation between animals of the same treatment group.

After giving it a lot of thought I see two ways analyzing:

Approach first, which in my head is the right way is to analyze both studies separately accounting modeling for (batch effects in Dataset 1), running differential analysis separately and then I can sort genes based on adj Pval in both the datasets and then find common gene identifiers or unique identifiers to each dataset.

Second approach is to assemble count matrix from from datasets and use design = ~ batch + Condition (I will have 2 batches in total if I do this approach). Obtain normalized reads and then use them to run DEseq2 separately and follow the path as in my first approach from here on.

Any advice and insights would really be much appreciated.

DESeq2 • 282 views

ADD COMMENT • link 3 months ago Kuldeep • 0

score 0 · Answer 1 · 2024-01-31

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 1 day ago

United States

The simplest way I would approach this is to analyze the two data at separately and then compare the log fold changes. Preferably using robust log fold changes with lfcShrink.

Check PCA of each dataset separately to assess whether you need to control for low rank technical variation affecting many genes.