Dear all,
I plan to use TCGA RNAseq data for my analysis, since there are 2 datasets (legacy and harmonized), I am deciding which one to be used. For me, the harmonized one seems to be more standardized (please correct me if I am wrong). As we all know, batch effect is really big issue. My question are:
1) For harmonized data, does it already corrected for batch effect? In fact, I actually tried plotting by PCA but I have not found any confounding pattern by either sequencing centers or the platform (HiSeq or GA incase of colon cancer which some of the samples were sequenced by GA platform). So, before I go ahead in the analysis, I want to make sure that they have already been corrected.
2) If not, does it really necessary for correcting and which method would be a potential way to correct?
Thank you very much,
The data processing steps are here: https://docs.gdc.cancer.gov/Data/BioinformaticsPipelines/ExpressionmRNA_Pipeline/
In short, if you obtain the HT-seq count files from the GDC, then you can assume that nothing has been done to account for batch. On the other hand, if you obtain data via some third-party source, like cBioPortal, TCGAbiolinks, etc., then check with those individual sources to see what extra processing (if any) they performed.
Edit: if you want to check for sources of bias in the data, then aim to perform surrogate variable analysis. You can then either account for these via regression modeling, or directly adjust your expression data to eliminate these effects. There are many questions both here and on Biostars about this particular topic.