Hi. I am doing a fairly simple promoter analysis using the genome assembly hg38. The problem that it seems that the gene PCK2 with Entrez ID 5106 is missing from the TxDb.Hsapiens.UCSC.hg38.knownGene annotation. Here that gene in NCBI: https://www.ncbi.nlm.nih.gov/gene/?term=5106
For verification in the below code snippet here is the NCBI page for Myc: https://www.ncbi.nlm.nih.gov/gene/4609
This R code illustrates that the gene is missing:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(AnnotationDbi)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb_old <- TxDb.Hsapiens.UCSC.hg19.knownGene
myc <- 4609
pck2 <- 5106
# both genes are here in the old annotation
g_ranges_myc_old <- genes(txdb_old)[genes(txdb_old)$gene_id %in% myc]
print(g_ranges_myc_old)
g_ranges_pck2_old<- genes(txdb_old)[genes(txdb_old)$gene_id %in% pck2]
print(g_ranges_pck2_old)
#pck2 is missing in the newer annotation
g_ranges_myc <- genes(txdb)[genes(txdb)$gene_id %in% myc]
print(g_ranges_myc)
g_ranges_pck2 <- genes(txdb)[genes(txdb)$gene_id %in% pck2]
print(g_ranges_pck2)
If the gene is in fact missing why would that be? Could the problem be the naive way that I'm accessing the Granges?
Thank you so much! Do you know where I can learn more about the sequence "chr14KZ208920v1fix"? Which of these gene locations should I trust more? Do these two locations imply that humans have two copies of this gene on chromosome 14? Or is the one with 'fix' in it better because someone 'fixed' the old one?
Right now hg38 matches GRCh38.p12. Patch sequences get added to every new revision of the original GRCh38 assembly. They describe corrections to the original sequences.
So no, the fact that a gene is mapped to a "patch" sequence in addition to a chromosome doesn't mean that humans have 2 copies of the gene. It only means that the gene happens to be located in the patched region. And UCSC is providing its location with respect to the original (unmodified) chromosome sequence and also with respect to the patched region. This is confirmed by the fact that the range on chr14 and the range on chr14KZ208920v1fix have the same width (16545).
So what we see is just an artifact of data representation.
Thank you so much! Do you know where I can learn more about the sequence "chr14KZ208920v1fix"? Which of these gene locations should I trust more? Do these two locations imply that humans have two copies of this gene on chromosome 14? Or is the one with 'fix' in it better because someone 'fixed' the old one?
There is information at UCSC, for example. See under Assembly details
Hi,
chr14KZ208920v1fix is a "patch" sequence. It's called HG1_PATCH at NCBI:
Right now hg38 matches GRCh38.p12. Patch sequences get added to every new revision of the original GRCh38 assembly. They describe corrections to the original sequences.
So no, the fact that a gene is mapped to a "patch" sequence in addition to a chromosome doesn't mean that humans have 2 copies of the gene. It only means that the gene happens to be located in the patched region. And UCSC is providing its location with respect to the original (unmodified) chromosome sequence and also with respect to the patched region. This is confirmed by the fact that the range on chr14 and the range on chr14KZ208920v1fix have the same width (16545).
So what we see is just an artifact of data representation.
That being said the default behavior of the
genes()
extractor is admittedly confusing and we are considering changing it. See https://github.com/Bioconductor/GenomicFeatures/pull/20Hope this helps,
H.