Help with Missing Gene from hg38
1
0
Entering edit mode
jam526 • 0
@jam526-22332
Last seen 2.3 years ago

Hi. I am doing a fairly simple promoter analysis using the genome assembly hg38. The problem that it seems that the gene PCK2 with Entrez ID 5106 is missing from the TxDb.Hsapiens.UCSC.hg38.knownGene annotation. Here that gene in NCBI: https://www.ncbi.nlm.nih.gov/gene/?term=5106 For verification in the below code snippet here is the NCBI page for Myc: https://www.ncbi.nlm.nih.gov/gene/4609

This R code illustrates that the gene is missing:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(AnnotationDbi)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb_old <- TxDb.Hsapiens.UCSC.hg19.knownGene

myc <- 4609
pck2 <- 5106

# both genes are here in the old annotation
g_ranges_myc_old <- genes(txdb_old)[genes(txdb_old)$gene_id %in% myc] print(g_ranges_myc_old) g_ranges_pck2_old<- genes(txdb_old)[genes(txdb_old)$gene_id  %in%  pck2]
print(g_ranges_pck2_old)

#pck2 is missing in the newer annotation
g_ranges_myc <- genes(txdb)[genes(txdb)$gene_id %in% myc] print(g_ranges_myc) g_ranges_pck2 <- genes(txdb)[genes(txdb)$gene_id  %in%  pck2]
print(g_ranges_pck2)


If the gene is in fact missing why would that be? Could the problem be the naive way that I'm accessing the Granges?

2
Entering edit mode
@kaylainterdonato-17327
Last seen 3 months ago
United States

I think there are some similar questions posted here and here.

Try setting single.strand.genes.only = FALSE, the structure will be a bit different but you will see PCK2.

> regions <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene, single.strand.genes.only = FALSE)
> names(regions)[1:5]
[1] "1"     "10"    "100"   "1000"  "10000"
> ids <- c("MYC" = "4609", "PCK2" = "5106")
> ids %in% names(regions)
[1] TRUE TRUE
> which(names(regions) == "4609")
[1] 16632
> which(names(regions) == "5106")
[1] 17228
> regions[16632]
GRangesList object of length 1:
$4609 GRanges object with 1 range and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr8 127735434-127742951 + ------- seqinfo: 595 sequences (1 circular) from hg38 genome > regions[17228] GRangesList object of length 1:$5106
GRanges object with 2 ranges and 0 metadata columns:
seqnames            ranges strand
<Rle>         <IRanges>  <Rle>
[1]                chr14 24094053-24110598      +
[2] chr14_KZ208920v1_fix     395031-411576      +
-------
seqinfo: 595 sequences (1 circular) from hg38 genome


Hope this helps!

0
Entering edit mode

Thank you so much! Do you know where I can learn more about the sequence "chr14KZ208920v1fix"? Which of these gene locations should I trust more? Do these two locations imply that humans have two copies of this gene on chromosome 14? Or is the one with 'fix' in it better because someone 'fixed' the old one?

0
Entering edit mode

There is information at UCSC, for example. See under Assembly details

0
Entering edit mode

Hi,

chr14KZ208920v1fix is a "patch" sequence. It's called HG1_PATCH at NCBI:

library(GenomeInfoDb)

## getChromInfoFromUCSC() is only available in BioC devel for now.
hg38_chrom_info <- getChromInfoFromUCSC("hg38", map.NCBI=TRUE)
hg38_chrom_info <- hg38_chrom_info[c("chrom", "NCBI.SequenceName", "NCBI.SequenceRole", "NCBI.GenBankAccn", "NCBI.RefSeqAccn")]

hg38_chrom_info[1:4, ]
#   chrom NCBI.SequenceName  NCBI.SequenceRole NCBI.GenBankAccn NCBI.RefSeqAccn
# 1  chr1                 1 assembled-molecule       CM000663.2    NC_000001.11
# 2  chr2                 2 assembled-molecule       CM000664.2    NC_000002.12
# 3  chr3                 3 assembled-molecule       CM000665.2    NC_000003.12
# 4  chr4                 4 assembled-molecule       CM000666.2    NC_000004.12

table(hg38_chrom_info$NCBI.SequenceRole) # assembled-molecule alt-scaffold unlocalized-scaffold # 25 261 42 # unplaced-scaffold pseudo-scaffold fix-patch # 127 0 70 # novel-patch # 70 as.list(hg38_chrom_info[hg38_chrom_info$chrom == "chr14_KZ208920v1_fix", ])
# $chrom # [1] "chr14_KZ208920v1_fix" # #$NCBI.SequenceName
# [1] "HG1_PATCH"
#
# $NCBI.SequenceRole # [1] fix-patch # 7 Levels: assembled-molecule alt-scaffold ... novel-patch # #$NCBI.GenBankAccn
# [1] "KZ208920.1"
#
# \$NCBI.RefSeqAccn
# [1] "NW_018654722.1"


Right now hg38 matches GRCh38.p12. Patch sequences get added to every new revision of the original GRCh38 assembly. They describe corrections to the original sequences.

So no, the fact that a gene is mapped to a "patch" sequence in addition to a chromosome doesn't mean that humans have 2 copies of the gene. It only means that the gene happens to be located in the patched region. And UCSC is providing its location with respect to the original (unmodified) chromosome sequence and also with respect to the patched region. This is confirmed by the fact that the range on chr14 and the range on chr14KZ208920v1fix have the same width (16545).

So what we see is just an artifact of data representation.

That being said the default behavior of the genes() extractor is admittedly confusing and we are considering changing it. See https://github.com/Bioconductor/GenomicFeatures/pull/20

Hope this helps,

H.