Question

Difference in TCGA RNAseq expression values between GDC harmonized and legacy data

1

Entering edit mode

Antonio Ahn ▴ 10

@antonio-ahn-10629

Last seen 3.2 years ago

University of Otago

Hello,

I am finding a significant discrepnacy with RNA-seq expression between TCGA harmonized and legacy datasets for SKCM for a specific gene (SOX10),

I have downloaded the TCGA SKCM using the TCGAbiolinks package using the codes below.

Following investigation of the SOX10 gene, it is highly expressed in most samples in the legacy dataset however it is not expressed (0 counts) for every sample in the harmonized dataset.

Would anyone be able to help me on why this may be the case for this particular SOX10 gene? For other genes i have looked at, there is a very strong correlation (pearsons correlation of >0.999) between the 2 datasets.

# harmonized data download
query.counts.SKCM <- GDCquery(project = "TCGA-SKCM",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts",
                  legacy = FALSE)

# legacy data download
query.counts.SKCM.legacy <- GDCquery(project = "TCGA-SKCM",
                  platform = "Illumina HiSeq",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification", 
                  file.type = "results",
                  legacy = TRUE)

# Data download
GDCdownload(query.counts.SKCM)
GDCdownload(query.counts.SKCM.legacy)

# GDCprepare download

SKCM.count.prep  <- GDCprepare(query = query.counts.SKCM, save = TRUE, save.filename = "SKCM.rda", summarizedExperiment = TRUE)

SKCM.count.prep.legacy  <- GDCprepare(query = query.counts.SKCM.legacy, save = TRUE, save.filename = "SKCM_legacy.rda", summarizedExperiment = TRUE)

GDC TCGA tcgabiolinks • 2.7k views

ADD COMMENT • link updated 2.9 years ago by cute scientist • 0 • written 6.3 years ago by Antonio Ahn ▴ 10

0

Entering edit mode

cute scientist • 0

@a525628a

Last seen 2.1 years ago

United States

Glad to see the question about SOX10 gene expression by legacy (hg19) and harmonized (hg38) in TCGA SKCM datasets. I adapted the code and did a test in April, 2022. Yes, a lot of things has changed, including the TCGAbiolinks package，TCGA GDC data, and wrinkles on my face. Thanks a lot for looking at other genes and ensuring there is a very strong correlation (pearsons correlation of >0.999) between the 2 datasets. I also noticed that there is another description about the legacy and harmonized TCGA data.

The expression values used here are both some kinds of raw counts:

in legacy (hg19): "Estimated count by RSEM"
in harmonized (hg38): "STAR - Counts"

Here is my code:

rm(list = ls())
ptm <- proc.time()
# proc.time() - ptm
options(stringsAsFactors = F)

# tidyverse tools
library("tidyverse")
tidyverse_packages()
library("magrittr")

# BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
# BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

library("TCGAbiolinks")
library("SummarizedExperiment")

# Download and get the expression matrix data.

# harmonized data download
query.counts.SKCM.harmonized <- GDCquery(
    project = "TCGA-SKCM",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts",
    # barcode = c("TCGA-EB-A5VV-06A-11R-A32P-07", "TCGA-EE-A3AD-06A-11R-A18S-07"),
    legacy = FALSE
)

# legacy data download
query.counts.SKCM.legacy <- GDCquery(
    project = "TCGA-SKCM",
    platform = "Illumina HiSeq",
    experimental.strategy = "RNA-Seq",
    data.category = "Gene expression",
    data.type = "Gene expression quantification",
    file.type = "results",
    # barcode = c("TCGA-EB-A5VV-06A-11R-A32P-07", "TCGA-EE-A3AD-06A-11R-A18S-07"),
    legacy = TRUE
)

# Data download
GDCdownload(query.counts.SKCM.harmonized, method = "client", directory = file.path(getwd(), "GDCdata"))
GDCdownload(query.counts.SKCM.legacy, method = "client", directory = file.path(getwd(), "GDCdata"))

# GDCprepare download
SKCM.count.prep.harmonized  <- GDCprepare(query = query.counts.SKCM.harmonized, summarizedExperiment = TRUE)
SKCM.count.prep.legacy  <- GDCprepare(query = query.counts.SKCM.legacy, summarizedExperiment = TRUE)

# Get the expression matrix, gene x sample.
harmonized_matrix <- assay(SKCM.count.prep.harmonized)
legacy_matrix <- assay(SKCM.count.prep.legacy)

# Ensure the sample names are coherent between the hg19 and h38
stopifnot(identical(sort(colnames(harmonized_matrix)), sort(colnames(legacy_matrix))))

# Get the SOX10 expression vector in hg19 and hg38, respectively.

get_SOX10_gene_expression_harmonized <- function(the_matrix) {

    gene_name_vector <- the_matrix %>% rownames %>% str_replace(., pattern = ".[0-9]+$", replacement = "")
    sample_name_vector <- colnames(the_matrix)
    subset_matrix <- the_matrix[gene_name_vector == "ENSG00000100146", sort(sample_name_vector), drop = F]
    gene_expresion_vector <- subset_matrix[1, , drop = T]
}

get_SOX10_gene_expression_legacy <- function(the_matrix) {

    gene_name_vector <- the_matrix %>% rownames
    sample_name_vector <- colnames(the_matrix)
    subset_matrix <- the_matrix[grepl(pattern = "SOX10|6663", gene_name_vector), sort(sample_name_vector), drop = F]
    gene_expresion_vector <- subset_matrix[1, , drop = T]
}

SOX10_expression_harmonized_vector <- harmonized_matrix %>% get_SOX10_gene_expression_harmonized
SOX10_expression_legacy_vector <- legacy_matrix %>% get_SOX10_gene_expression_legacy

# Visualize the SOX10 expression.

pdf(file = "SOX10_expression.pdf")

par(mfrow = c(2, 2))
options(scipen = 999)

# histogram
hist(SOX10_expression_legacy_vector, main = "SOX10 legacy (hg19)", xlab = "Estimated count by RSEM")

# histogram
hist(SOX10_expression_harmonized_vector, main = "SOX10 harmonized (hg38)", xlab = "STAR - Counts")

# scatter plot
plot(x = SOX10_expression_legacy_vector,
     y = SOX10_expression_harmonized_vector,
     cex = 0.5,
     frame.plot = F,
     xlab = "SOX10 expression legacy (hg19)",
     ylab = "SOX10 expression harmonized (hg38)",
     main = "SOX10 legacy vs harmonized")
# Add fit lines
# regression line (y ~ x)
abline(lm(SOX10_expression_harmonized_vector ~ SOX10_expression_legacy_vector), col = "green")
# lowess line (x, y)
lines(lowess(SOX10_expression_legacy_vector, SOX10_expression_harmonized_vector), col = "violet")

# multiple-line graph
plot(sort(SOX10_expression_legacy_vector),
     type = "o",
     cex = 0.5,
     pch = 1,
     col = "red",
     lty = 1,
     frame.plot = F,
     xlab = "samples",
     ylab = "expression",
     main = "SOX10 legacy and harmonized")
lines(SOX10_expression_harmonized_vector[names(sort(SOX10_expression_legacy_vector))],
      type = "o",
      lty = 1,
      cex = 0.5,
      pch = 2,
      col = "blue")
# add legend
legend("bottomright",
       col = c("red", "blue"),
       pch = c(1, 2),
       lty = 1,
       legend = c("legacy (hg19)", "harmonized (hg38)"),
       bty = "n"
       )

dev.off()

proc.time() - ptm

Here is the plot of the result:

SOX10 expression visualization

From the graph, we can see that the expression values of SOX gene in legacy (hg19) and harmonized (hg38) look very normal and consistent. The expression values of SOX gene in harmonized (hg38) seem not zero count for all samples in my test results.

ADD COMMENT • link 2.9 years ago cute scientist • 0

score 2 · Accepted Answer · 2018-11-21

Because the reference genome and data processing methods are different. In older TCGA data sets, the reads are mapped to hg19. However, in Genomic Data Commons, the reads are mapped to hg38. Also, RSEM was previously used to estimate the abundances of genes, which includes the estimation of the location of a read mapping to multiple locations. Now, a read can map up to 20 locations in the genome and be counted multiple times. There is no gold-standard measurement available for these samples, so it's impossible to know which data processing variety is better overall or if the published results in journals can be reproduced with the data produced by the newer data processing. SOX10 should have high read counts in many melanoma samples, though. You should write to the data coordination centre and tell them about it.