I am finding a significant discrepnacy with RNA-seq expression between TCGA harmonized and legacy datasets for SKCM for a specific gene (SOX10),
I have downloaded the TCGA SKCM using the TCGAbiolinks package using the codes below.
Following investigation of the SOX10 gene, it is highly expressed in most samples in the legacy dataset however it is not expressed (0 counts) for every sample in the harmonized dataset.
Would anyone be able to help me on why this may be the case for this particular SOX10 gene? For other genes i have looked at, there is a very strong correlation (pearsons correlation of >0.999) between the 2 datasets.
# harmonized data download query.counts.SKCM <- GDCquery(project = "TCGA-SKCM", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts", legacy = FALSE) # legacy data download query.counts.SKCM.legacy <- GDCquery(project = "TCGA-SKCM", platform = "Illumina HiSeq", data.category = "Gene expression", data.type = "Gene expression quantification", file.type = "results", legacy = TRUE) # Data download GDCdownload(query.counts.SKCM) GDCdownload(query.counts.SKCM.legacy) # GDCprepare download SKCM.count.prep <- GDCprepare(query = query.counts.SKCM, save = TRUE, save.filename = "SKCM.rda", summarizedExperiment = TRUE) SKCM.count.prep.legacy <- GDCprepare(query = query.counts.SKCM.legacy, save = TRUE, save.filename = "SKCM_legacy.rda", summarizedExperiment = TRUE)