Hi All,
This question might already be asked. If so, please inform me the link and close this forum. However, if not, here is my questions. So I have performed 2 different RNA seq experiments in the same cell lines. However, these 2 experiments were done in different time and also with different conditions (treatment, concentration, etc). In the 2nd experiment (which was done latter), I included several identical samples to the first experiment as the batch control. The problem is, when I perform PCA in these identical samples, I can clearly see the batch effect (two different clouds with 2 different colors (batches)). This batch effect appears in 2 different datasets merge methods. First method, I combined the count data from 2 different experiments then normalized the combined data (CPM) and another method, I normalized (CPM) the data separately then combined the log norm. I haven't tried to calculate log2FC yet so far. My question is, what is the most appropriate strategy to combine 2 different datasets to diminish batch effects. Should I combine the raw data then perform normalization and log2FC calculation altogether (which I plan to do) or do it separately and combine the data at log2FC level? I use Deseq2 function to calculate log2FC, is it possible to add for instance "batch" in the design, even though I already add "meanID" (which is actually the specific ID consisting of the sample condition and batch ID for each sample)? Thanks in advance for the answers.
Lukas
Dear Dr. Michael Love,
I tried to add batches in my design. However, I got this error : “the model matrix is not full rank, so the model cannot be fit as specified.” I was thinking this is because the name of one condition (meanID) already contains the name of the batch itself. My meanID name is : compoundconcentrationbatch. Another possibility based on your tutorial is that the identical sample tested from the 2nd batch aren't equal to the first batch. I only add 10 identical samples to the 2nd batch, the rest of the samples are different. Do you think it's still possible to check the batch effect in this situation? I wasn't planning to calculate the average, I was planning to calculate the log2FC separately then check the batch effect with PCA, etc. At the normalized count level, I see the batch effect. Currently I am working on the log2FC level to see the batch effect. Thanks for the answers.
Can you write out the column data? I'm having a hard time parsing your text. E.g.:
I do apologize if my explanation is confusing. I also often have a hard time to write down things. So in my metadata file, I make a new column called meanID. Inside the meanID is the name of the condition "" and the batch. So for instance, I have condition : control in batch 1, then in the meanID I write : control1. Then I use this meanID column in the design, design = ~meanID. Maybe if I split this meanID into 2 columns, I can calculate one log2FC values from these 2 batches by changing my design into design = ~batch + condition. BTW, upon PCA analysis at the log2FC level, seems the batch effect is gone since I don't see any clear separation between 2 identical samples from 2 batches. However, I am not sure if my approach is correct.
I’m sure I’m going to have to just ask again for a sample table. This is also in the posting guide as a helpful way to efficiently communicate your experimental design.
Here is the example of the table :
Here, I use meanID as the design. So that for the treatment ActA, I got 2 log2FC values. The purpose is to see if the ActA in those 2 batches show interbatch variation.
Sorry, I can’t help you here, I need full information, and you’re just giving me snippets. I’m also pretty busy right now with teaching. I’d recommend to consult with a statistician at your institute.
I understand. Thanks for your time.