Entering edit mode
MOHAMMAD • 0@MOHAMMAD-24781
Last seen 2.2 years ago
I have two RNA-seq count datasets as following:
dataset A contains 3 samples and 3 controls
dataset B contains 81 samples with no controls
what is the best workflow to handle the preprocessing in this case:
A- remove batch-effect (for merged dataset) >>>>> quantile Normalization.
B- quantile Normalization (for merged dataset) >>>>> batch-effext removal.
C- quantile Normalization (for each dataset separately) >>>>>>> batch-effect removal.
Thank you in advance.
No controls in B means that controls are nested with dataset, and therefore you cannot correct anything. Also, since you have many more samples in B than in A the B samples would probably dominate whetever effect the A samples have, so it comes down essentially to samplesB vs controlsA, which as said above is confounded by study. Summary: This comparison is probably not meaningful as any DEGs you see could be entirely due to the batch effect which you cannot remove with this setup. samplesA vs controlsA is what you can do or try to define some kinds of subtypes in samplesB and see whether you can find differences between these. Depends on your project whether this makes sense.
Edit: Can you clarify how dataset A is different from B? Is A and B from the same lab, with same kits being used and only the time of experiment is different, or is this from completely independent studies?
I'm not entirely as pessimistic here. You do have some controls in one group that will at least let you parse the variation a bit. Though you will want to be very careful about over-interpreting any results that you get.
My normal workflow is to do quantile normalization across the merged data set and then batch effect removal. Typically batch effect removal techniques assume specific models on the systematic error that may be violated by quantile normalization. This is a good paper on when to use quantile normalization: https://www.biorxiv.org/content/10.1101/012203v1.full.pdf