Question

Help with Missing Gene from hg38

0

Entering edit mode

jam526 • 0

@jam526-22332

Last seen 4.6 years ago

Hi. I am doing a fairly simple promoter analysis using the genome assembly hg38. The problem that it seems that the gene PCK2 with Entrez ID 5106 is missing from the TxDb.Hsapiens.UCSC.hg38.knownGene annotation. Here that gene in NCBI: https://www.ncbi.nlm.nih.gov/gene/?term=5106 For verification in the below code snippet here is the NCBI page for Myc: https://www.ncbi.nlm.nih.gov/gene/4609

This R code illustrates that the gene is missing:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(AnnotationDbi)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb_old <- TxDb.Hsapiens.UCSC.hg19.knownGene

myc <- 4609
pck2 <- 5106

# both genes are here in the old annotation
g_ranges_myc_old <- genes(txdb_old)[genes(txdb_old)$gene_id  %in%  myc]
print(g_ranges_myc_old)
g_ranges_pck2_old<- genes(txdb_old)[genes(txdb_old)$gene_id  %in%  pck2] 
print(g_ranges_pck2_old)


#pck2 is missing in the newer annotation
g_ranges_myc <- genes(txdb)[genes(txdb)$gene_id  %in%  myc]
print(g_ranges_myc)
g_ranges_pck2 <- genes(txdb)[genes(txdb)$gene_id  %in%  pck2] 
print(g_ranges_pck2)

If the gene is in fact missing why would that be? Could the problem be the naive way that I'm accessing the Granges?

annotation genome hg38 TxDb.Hsapiens.UCSC.hg38.knownGene BSgenome.Hsapiens.UCSC.hg38 • 1.2k views

ADD COMMENT • link updated 4.8 years ago by kayla.interdonato ▴ 160 • written 4.8 years ago by jam526 • 0

score 2 · Accepted Answer · 2020-02-12

2

Entering edit mode

kayla.interdonato ▴ 160

@kaylainterdonato-17327

Last seen 10 months ago

United States

I think there are some similar questions posted here and here.

Try setting single.strand.genes.only = FALSE, the structure will be a bit different but you will see PCK2.

> regions <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene, single.strand.genes.only = FALSE)
> names(regions)[1:5]
[1] "1"     "10"    "100"   "1000"  "10000"
> ids <- c("MYC" = "4609", "PCK2" = "5106")
> ids %in% names(regions)
[1] TRUE TRUE
> which(names(regions) == "4609")
[1] 16632
> which(names(regions) == "5106")
[1] 17228
> regions[16632]
GRangesList object of length 1:
$`4609`
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr8 127735434-127742951      +
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

> regions[17228]
GRangesList object of length 1:
$`5106`
GRanges object with 2 ranges and 0 metadata columns:
                  seqnames            ranges strand
                     <Rle>         <IRanges>  <Rle>
  [1]                chr14 24094053-24110598      +
  [2] chr14_KZ208920v1_fix     395031-411576      +
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

Hope this helps!

ADD COMMENT • link 4.8 years ago kayla.interdonato ▴ 160

0

Entering edit mode

Thank you so much! Do you know where I can learn more about the sequence "chr14KZ208920v1fix"? Which of these gene locations should I trust more? Do these two locations imply that humans have two copies of this gene on chromosome 14? Or is the one with 'fix' in it better because someone 'fixed' the old one?

ADD REPLY • link 4.8 years ago jam526 • 0

0

Entering edit mode

There is information at UCSC, for example. See under Assembly details

ADD REPLY • link 4.8 years ago James W. MacDonald 67k

0

Entering edit mode

Hi,

chr14KZ208920v1fix is a "patch" sequence. It's called HG1_PATCH at NCBI:

library(GenomeInfoDb)

## getChromInfoFromUCSC() is only available in BioC devel for now.
hg38_chrom_info <- getChromInfoFromUCSC("hg38", map.NCBI=TRUE)
hg38_chrom_info <- hg38_chrom_info[c("chrom", "NCBI.SequenceName", "NCBI.SequenceRole", "NCBI.GenBankAccn", "NCBI.RefSeqAccn")]

hg38_chrom_info[1:4, ]
#   chrom NCBI.SequenceName  NCBI.SequenceRole NCBI.GenBankAccn NCBI.RefSeqAccn
# 1  chr1                 1 assembled-molecule       CM000663.2    NC_000001.11
# 2  chr2                 2 assembled-molecule       CM000664.2    NC_000002.12
# 3  chr3                 3 assembled-molecule       CM000665.2    NC_000003.12
# 4  chr4                 4 assembled-molecule       CM000666.2    NC_000004.12

table(hg38_chrom_info$NCBI.SequenceRole)
#   assembled-molecule         alt-scaffold unlocalized-scaffold 
#                   25                  261                   42 
#    unplaced-scaffold      pseudo-scaffold            fix-patch 
#                  127                    0                   70 
#          novel-patch 
#                   70 

as.list(hg38_chrom_info[hg38_chrom_info$chrom == "chr14_KZ208920v1_fix", ])
# $chrom
# [1] "chr14_KZ208920v1_fix"
#
# $NCBI.SequenceName
# [1] "HG1_PATCH"
#
# $NCBI.SequenceRole
# [1] fix-patch
# 7 Levels: assembled-molecule alt-scaffold ... novel-patch
#
# $NCBI.GenBankAccn
# [1] "KZ208920.1"
#
# $NCBI.RefSeqAccn
# [1] "NW_018654722.1"

Right now hg38 matches GRCh38.p12. Patch sequences get added to every new revision of the original GRCh38 assembly. They describe corrections to the original sequences.

So no, the fact that a gene is mapped to a "patch" sequence in addition to a chromosome doesn't mean that humans have 2 copies of the gene. It only means that the gene happens to be located in the patched region. And UCSC is providing its location with respect to the original (unmodified) chromosome sequence and also with respect to the patched region. This is confirmed by the fact that the range on chr14 and the range on chr14KZ208920v1fix have the same width (16545).

So what we see is just an artifact of data representation.

That being said the default behavior of the genes() extractor is admittedly confusing and we are considering changing it. See https://github.com/Bioconductor/GenomicFeatures/pull/20

Hope this helps,

H.

ADD REPLY • link 4.8 years ago Hervé Pagès 16k