Glad to see the question about SOX10 gene expression by legacy (hg19) and harmonized (hg38) in TCGA SKCM datasets. I adapted the code and did a test in April, 2022. Yes, a lot of things has changed, including the TCGAbiolinks package,TCGA GDC data, and wrinkles on my face. Thanks a lot for looking at other genes and ensuring there is a very strong correlation (pearsons correlation of >0.999) between the 2 datasets. I also noticed that there is another description about the legacy and harmonized TCGA data.
The expression values used here are both some kinds of raw counts:
- in legacy (hg19): "Estimated count by RSEM"
- in harmonized (hg38): "STAR - Counts"
Here is my code:
rm(list = ls())
ptm <- proc.time()
options(stringsAsFactors = F)
library("tidyverse")
tidyverse_packages()
library("magrittr")
library("TCGAbiolinks")
library("SummarizedExperiment")
query.counts.SKCM.harmonized <- GDCquery(
project = "TCGA-SKCM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts",
legacy = FALSE
)
query.counts.SKCM.legacy <- GDCquery(
project = "TCGA-SKCM",
platform = "Illumina HiSeq",
experimental.strategy = "RNA-Seq",
data.category = "Gene expression",
data.type = "Gene expression quantification",
file.type = "results",
legacy = TRUE
)
GDCdownload(query.counts.SKCM.harmonized, method = "client", directory = file.path(getwd(), "GDCdata"))
GDCdownload(query.counts.SKCM.legacy, method = "client", directory = file.path(getwd(), "GDCdata"))
SKCM.count.prep.harmonized <- GDCprepare(query = query.counts.SKCM.harmonized, summarizedExperiment = TRUE)
SKCM.count.prep.legacy <- GDCprepare(query = query.counts.SKCM.legacy, summarizedExperiment = TRUE)
harmonized_matrix <- assay(SKCM.count.prep.harmonized)
legacy_matrix <- assay(SKCM.count.prep.legacy)
stopifnot(identical(sort(colnames(harmonized_matrix)), sort(colnames(legacy_matrix))))
get_SOX10_gene_expression_harmonized <- function(the_matrix) {
gene_name_vector <- the_matrix %>% rownames %>% str_replace(., pattern = ".[0-9]+$", replacement = "")
sample_name_vector <- colnames(the_matrix)
subset_matrix <- the_matrix[gene_name_vector == "ENSG00000100146", sort(sample_name_vector), drop = F]
gene_expresion_vector <- subset_matrix[1, , drop = T]
}
get_SOX10_gene_expression_legacy <- function(the_matrix) {
gene_name_vector <- the_matrix %>% rownames
sample_name_vector <- colnames(the_matrix)
subset_matrix <- the_matrix[grepl(pattern = "SOX10|6663", gene_name_vector), sort(sample_name_vector), drop = F]
gene_expresion_vector <- subset_matrix[1, , drop = T]
}
SOX10_expression_harmonized_vector <- harmonized_matrix %>% get_SOX10_gene_expression_harmonized
SOX10_expression_legacy_vector <- legacy_matrix %>% get_SOX10_gene_expression_legacy
pdf(file = "SOX10_expression.pdf")
par(mfrow = c(2, 2))
options(scipen = 999)
hist(SOX10_expression_legacy_vector, main = "SOX10 legacy (hg19)", xlab = "Estimated count by RSEM")
hist(SOX10_expression_harmonized_vector, main = "SOX10 harmonized (hg38)", xlab = "STAR - Counts")
plot(x = SOX10_expression_legacy_vector,
y = SOX10_expression_harmonized_vector,
cex = 0.5,
frame.plot = F,
xlab = "SOX10 expression legacy (hg19)",
ylab = "SOX10 expression harmonized (hg38)",
main = "SOX10 legacy vs harmonized")
abline(lm(SOX10_expression_harmonized_vector ~ SOX10_expression_legacy_vector), col = "green")
lines(lowess(SOX10_expression_legacy_vector, SOX10_expression_harmonized_vector), col = "violet")
plot(sort(SOX10_expression_legacy_vector),
type = "o",
cex = 0.5,
pch = 1,
col = "red",
lty = 1,
frame.plot = F,
xlab = "samples",
ylab = "expression",
main = "SOX10 legacy and harmonized")
lines(SOX10_expression_harmonized_vector[names(sort(SOX10_expression_legacy_vector))],
type = "o",
lty = 1,
cex = 0.5,
pch = 2,
col = "blue")
legend("bottomright",
col = c("red", "blue"),
pch = c(1, 2),
lty = 1,
legend = c("legacy (hg19)", "harmonized (hg38)"),
bty = "n"
)
dev.off()
proc.time() - ptm
Here is the plot of the result:
![SOX10 expression visualization](/media/images/a362de9a-8d11-4d97-a249-005a4a49)
From the graph, we can see that the expression values of SOX gene in legacy (hg19) and harmonized (hg38) look very normal and consistent. The expression values of SOX gene in harmonized (hg38) seem not zero count for all samples in my test results.