I am doing RNA-seq analysis using limma and edgeR, I have 48 samples and 2 batches. I built MDS plots (see below), and now my question is do I have a batch effect I need to correct for or not? My understanding is if batches cluster together on MDS/PCA plot, it's an evidence that batch effects are present, am I right?
There is clear separation by batch in your MDS plots, so yes, you should include a batch term in your model. In fact, I would generally default to including the batch term anyway, rather than waiting until I did a MDS plot to make the decision. It costs little to have an additional batch term but can protect you from misleading results, especially if the conditions are not evenly distributed across batches. This is most relevant for subtle batch effects that might not have shown up in the first few dimensions of a MDS plot.
On that note, you didn't mention what your other experimental factors are. I can only hope that they are not confounded with the batch, otherwise you're stuffed.
P.S. Modelling the batch effect is not the same as correcting it. The former involves estimating the batch effect as part of the linear model and is preferred for any statistically rigorous DE analysis, the latter should only be used (e.g., with removeBatchEffect) for visualization and clustering.
Thank you so much for your answer! I have 16 different cell conditions, and 3 replicates for each. 3 replicates are performed in different cell donors (3 donors; 1 replicate for each condition in each donor). My goal is to compare conditions between each other using DGE. Here how the MDS plot looks like (3,4 dimensions, first one colored by batch, second - by donor):
So I even have a clear separation by donor. So should I include donor factor to linear model as well?
P.S. So I don't need to correct for batches prior the DGE analysis with limma right? I just include batch factor to linear model?
Well, if you get a different number of DE genes, then clearly it had an effect.
is there any way to visualize the data again after voom or lmFit/eBayes?
For any given gene, you can plot the distribution of residuals for each donor. You should see that they are more-or-less normally distributed around zero if the model properly accounts for the donor effect.
Thank you so much for your answer! I have 16 different cell conditions, and 3 replicates for each. 3 replicates are performed in different cell donors (3 donors; 1 replicate for each condition in each donor). My goal is to compare conditions between each other using DGE. Here how the MDS plot looks like (3,4 dimensions, first one colored by batch, second - by donor):
So I even have a clear separation by donor. So should I include donor factor to linear model as well?
P.S. So I don't need to correct for batches prior the DGE analysis with limma right? I just include batch factor to linear model?
But how do I know, if it worked? is there any way to visualize the data again after voom or lmFit/eBayes?
Well, if you get a different number of DE genes, then clearly it had an effect.
For any given gene, you can plot the distribution of residuals for each donor. You should see that they are more-or-less normally distributed around zero if the model properly accounts for the donor effect.