21 months ago by
A PCA plot simply shows you the largest differences between samples, so 'not aligning well' can mean more than one thing. For example, it may be that there is lots of technical variability that is obscuring the biological differences between your samples. But this is a matter of degree!
If you have really large changes between samples for a lot of genes, but larger technical variability due to batches or whatever, then the technical variability can obscure the biological variability (which usually shows up in higher principal components). In this case, using something like RUVSeq or svaseq from the sva package can help control for the unwanted technical variability.
However, if you have consistent, but real differences between samples in just a few genes, then the 'normal' variability that one might expect is often predominant in a PCA plot. This (IMO) doesn't necessarily mean you have to do something to 'fix' the data. With any adjustments to the data you always run the risk that you may be capturing some of your real biological variability with a surrogate variable, and thereby reducing your abilities to see the real changes that exist.
My point is that there is no free lunch here. Any adjustment you make to fix perceived faults in your data may well erase real signal. So I usually try to figure out if I really do have a problem, and if I can identify the source of the problem first.
As to correcting for SE and PE data, if they were run in separate batches (you seem to imply that these data were all run together, although I am inferring that from you saying 'the same stage, library preparation, and species' , which may not mean what I think), then you would simply fit a batch effect in your model. But it is pretty uncommon in my experience for samples to be run using the same library preparation, but sequenced differently.
Perhaps this is just a compilation of a bunch of different samples from different labs? If that is the case, you really shouldn't just be piling them all into one analysis. You would be better off doing separate analyses and then using something like the GeneMeta package to do a meta-analysis.