Difference in TCGA RNAseq expression values between GDC harmonized and legacy data
1
1
Entering edit mode
Antonio Ahn ▴ 10
@antonio-ahn-10629
Last seen 10 weeks ago
University of Otago

Hello,

I am finding a significant discrepnacy with RNA-seq expression between TCGA harmonized and legacy datasets for SKCM for a specific gene (SOX10),

I have downloaded the TCGA SKCM using the TCGAbiolinks package using the codes below.

Following investigation of the SOX10 gene, it is highly expressed in most samples in the legacy dataset however it is not expressed (0 counts) for every sample in the harmonized dataset.

Would anyone be able to help me on why this may be the case for this particular SOX10 gene? For other genes i have looked at, there is a very strong correlation (pearsons correlation of >0.999) between the 2 datasets.

# harmonized data download
query.counts.SKCM <- GDCquery(project = "TCGA-SKCM",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts",
                  legacy = FALSE)

# legacy data download
query.counts.SKCM.legacy <- GDCquery(project = "TCGA-SKCM",
                  platform = "Illumina HiSeq",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification", 
                  file.type = "results",
                  legacy = TRUE)

# Data download
GDCdownload(query.counts.SKCM)
GDCdownload(query.counts.SKCM.legacy)

# GDCprepare download

SKCM.count.prep  <- GDCprepare(query = query.counts.SKCM, save = TRUE, save.filename = "SKCM.rda", summarizedExperiment = TRUE)

SKCM.count.prep.legacy  <- GDCprepare(query = query.counts.SKCM.legacy, save = TRUE, save.filename = "SKCM_legacy.rda", summarizedExperiment = TRUE)

 

 

GDC TCGA tcgabiolinks • 695 views
ADD COMMENT
2
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 7 days ago
Australia

Because the reference genome and data processing methods are different. In older TCGA data sets, the reads are mapped to hg19. However, in Genomic Data Commons, the reads are mapped to hg38. Also, RSEM was previously used to estimate the abundances of genes, which includes the estimation of the location of a read mapping to multiple locations. Now, a read can map up to 20 locations in the genome and be counted multiple times. There is no gold-standard measurement available for these samples, so it's impossible to know which data processing variety is better overall or if the published results in journals can be reproduced with the data produced by the newer data processing. SOX10 should have high read counts in many melanoma samples, though. You should write to the data coordination centre and tell them about it.

ADD COMMENT

Login before adding your answer.

Traffic: 240 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6