PDCD1 missing from TxDb.Hsapiens.UCSC.hg38.knownGene?
2
0
Entering edit mode
enricoferrero ▴ 660
@enricoferrero-6037
Last seen 3.1 years ago
Switzerland

Can somebody help me understand why PDCD1 is missing from TxDb.Hsapiens.UCSC.hg38.knownGene (but not from TxDb.Hsapiens.UCSC.hg19.knownGene)?

library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

geneid <- mapIds(org.Hs.eg.db, "PDCD1", "ENTREZID", "SYMBOL")

hg19.genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
geneid %in% hg19.genes$gene_id
# TRUE

hg38.genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
geneid %in% hg38.genes$gene_id
# FALSE

The gene is clearly part of the GRCh38/hg38 genome assembly:

Thank you.

txdb txdb.hsapiens.ucsc.hg19.knowngene txdb.hsapiens.ucsc.hg38.knowngene org.hs.eg.db annotation • 3.8k views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

It may have to do with the alternate location that exists in GRCh38

> txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
> txs["5133"]
GRangesList object of length 1:
$5133
GRanges object with 6 ranges and 2 metadata columns:
                 seqnames                 ranges strand |     tx_id     tx_name
                    <Rle>              <IRanges>  <Rle> | <integer> <character>
  [1]                chr2 [241849881, 241852980]      - |     15141  uc010fzs.3
  [2]                chr2 [241849881, 241858906]      - |     15142  uc002wcq.4
  [3]                chr2 [241849881, 241858906]      - |     15143  uc010fzt.3
  [4] chr2_KI270776v1_alt [    61979,     65078]      - |     92941  uc032rbb.1
  [5] chr2_KI270776v1_alt [    61979,     71004]      - |     92942  uc032rbc.1
  [6] chr2_KI270776v1_alt [    61979,     71004]      - |     92943  uc032rbd.1

-------
seqinfo: 455 sequences (1 circular) from hg38 genome

> txs2 <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txs2["5133"]
GRangesList object of length 1:
$5133
GRanges object with 3 ranges and 2 metadata columns:
      seqnames                 ranges strand |     tx_id     tx_name
         <Rle>              <IRanges>  <Rle> | <integer> <character>
  [1]     chr2 [242792033, 242795132]      - |     13057  uc010fzs.3
  [2]     chr2 [242792033, 242801058]      - |     13058  uc002wcq.4
  [3]     chr2 [242792033, 242801058]      - |     13059  uc010fzt.3

-------
seqinfo: 93 sequences (1 circular) from hg19 genome

I'm not sure you can give a single gene location for something that may occur in more than one place.

ADD COMMENT
1
Entering edit mode

Yes, that might be the reason, thanks. 

Based on your answer, I was able to rectify this in this way:

library(org.Hs.eg.db)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

geneid <- mapIds(org.Hs.eg.db, "PDCD1", "ENTREZID", "SYMBOL")

hg38.genes <- genes(keepStandardChromosomes(TxDb.Hsapiens.UCSC.hg38.knownGene))
geneid %in% hg38.genes$gene_id
# TRUE

Btw, this doesn't strike me as a sane default behaviour.

ADD REPLY
1
Entering edit mode

This is one of those things that reasonable people can argue about for, like, forever, and still not come up with a wholly acceptable compromise.

We sort of did something similar with the annotation packages in the past, where if there were a one-to-many match, we returned an NA, because we figured it wasn't our decision to say which returned value was 'the right one'. These days the select function returns everything, and we are all like 'you are on your own, dude', but mapIds by default just returns the first of any multiple match, hiding from those allergic to help pages the fact that there may be more matches. There is a cogent argument for either default, and I tend to want to err on the side of more information rather than less, but then you get people wanting to know what they should do with all this unwanted extra complexity.

ADD REPLY
2
Entering edit mode
enricoferrero ▴ 660
@enricoferrero-6037
Last seen 3.1 years ago
Switzerland

Use keepStandardChromosomes():

library(org.Hs.eg.db)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

geneid <- mapIds(org.Hs.eg.db, "PDCD1", "ENTREZID", "SYMBOL")

hg38.genes <- genes(keepStandardChromosomes(TxDb.Hsapiens.UCSC.hg38.knownGene))
geneid %in% hg38.genes$gene_id
# TRUE
ADD COMMENT

Login before adding your answer.

Traffic: 622 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6