Question: Difference in TCGA RNAseq expression values between GDC harmonized and legacy data
1
gravatar for Antonio Ahn
4 months ago by
Antonio Ahn10
University of Otago
Antonio Ahn10 wrote:

Hello,

I am finding a significant discrepnacy with RNA-seq expression between TCGA harmonized and legacy datasets for SKCM for a specific gene (SOX10),

I have downloaded the TCGA SKCM using the TCGAbiolinks package using the codes below.

Following investigation of the SOX10 gene, it is highly expressed in most samples in the legacy dataset however it is not expressed (0 counts) for every sample in the harmonized dataset.

Would anyone be able to help me on why this may be the case for this particular SOX10 gene? For other genes i have looked at, there is a very strong correlation (pearsons correlation of >0.999) between the 2 datasets.

# harmonized data download
query.counts.SKCM <- GDCquery(project = "TCGA-SKCM",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts",
                  legacy = FALSE)

# legacy data download
query.counts.SKCM.legacy <- GDCquery(project = "TCGA-SKCM",
                  platform = "Illumina HiSeq",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification", 
                  file.type = "results",
                  legacy = TRUE)

# Data download
GDCdownload(query.counts.SKCM)
GDCdownload(query.counts.SKCM.legacy)

# GDCprepare download

SKCM.count.prep  <- GDCprepare(query = query.counts.SKCM, save = TRUE, save.filename = "SKCM.rda", summarizedExperiment = TRUE)

SKCM.count.prep.legacy  <- GDCprepare(query = query.counts.SKCM.legacy, save = TRUE, save.filename = "SKCM_legacy.rda", summarizedExperiment = TRUE)

 

 

tcga tcgabiolinks gdc • 129 views
ADD COMMENTlink modified 4 months ago by Dario Strbenac1.4k • written 4 months ago by Antonio Ahn10
Answer: Difference in TCGA RNAseq expression values between GDC harmonized and legacy da
1
gravatar for Dario Strbenac
4 months ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:

Because the reference genome and data processing methods are different. In older TCGA data sets, the reads are mapped to hg19. However, in Genomic Data Commons, the reads are mapped to hg38. Also, RSEM was previously used to estimate the abundances of genes, which includes the estimation of the location of a read mapping to multiple locations. Now, a read can map up to 20 locations in the genome and be counted multiple times. There is no gold-standard measurement available for these samples, so it's impossible to know which data processing variety is better overall or if the published results in journals can be reproduced with the data produced by the newer data processing. SOX10 should have high read counts in many melanoma samples, though. You should write to the data coordination centre and tell them about it.

ADD COMMENTlink written 4 months ago by Dario Strbenac1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 228 users visited in the last hour