Difference in TCGA RNAseq expression values between GDC harmonized and legacy data
Entering edit mode
Antonio Ahn ▴ 10
Last seen 6 days ago
University of Otago


I am finding a significant discrepnacy with RNA-seq expression between TCGA harmonized and legacy datasets for SKCM for a specific gene (SOX10),

I have downloaded the TCGA SKCM using the TCGAbiolinks package using the codes below.

Following investigation of the SOX10 gene, it is highly expressed in most samples in the legacy dataset however it is not expressed (0 counts) for every sample in the harmonized dataset.

Would anyone be able to help me on why this may be the case for this particular SOX10 gene? For other genes i have looked at, there is a very strong correlation (pearsons correlation of >0.999) between the 2 datasets.

# harmonized data download
query.counts.SKCM <- GDCquery(project = "TCGA-SKCM",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts",
                  legacy = FALSE)

# legacy data download
query.counts.SKCM.legacy <- GDCquery(project = "TCGA-SKCM",
                  platform = "Illumina HiSeq",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification", 
                  file.type = "results",
                  legacy = TRUE)

# Data download

# GDCprepare download

SKCM.count.prep  <- GDCprepare(query = query.counts.SKCM, save = TRUE, save.filename = "SKCM.rda", summarizedExperiment = TRUE)

SKCM.count.prep.legacy  <- GDCprepare(query = query.counts.SKCM.legacy, save = TRUE, save.filename = "SKCM_legacy.rda", summarizedExperiment = TRUE)



GDC TCGA tcgabiolinks • 616 views
Entering edit mode
Dario Strbenac ★ 1.5k
Last seen 1 hour ago

Because the reference genome and data processing methods are different. In older TCGA data sets, the reads are mapped to hg19. However, in Genomic Data Commons, the reads are mapped to hg38. Also, RSEM was previously used to estimate the abundances of genes, which includes the estimation of the location of a read mapping to multiple locations. Now, a read can map up to 20 locations in the genome and be counted multiple times. There is no gold-standard measurement available for these samples, so it's impossible to know which data processing variety is better overall or if the published results in journals can be reproduced with the data produced by the newer data processing. SOX10 should have high read counts in many melanoma samples, though. You should write to the data coordination centre and tell them about it.


Login before adding your answer.

Traffic: 399 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6