Question: POU5F1(OCT4) /ENTREZID: 5460 is not in the TxDb.Hsapiens.UCSC.hg19.knownGene
gravatar for tangming2005
5 months ago by
United States
tangming200590 wrote:


I was trying to get the coordinates of some genes, and was surprised that POU5F1 gene is not in the txdb.

Thank you for looking into this.


txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
hg19.genes<- genes(txdb)


## note that dplyr and AnnotationDbi both have a function called select
## use dplyr::select when use dplyr

gene_symbol<- AnnotationDbi::select(, keys=hg19.genes$gene_id, 
                                    columns="SYMBOL", keytype="ENTREZID")

"POU5F1" %in% gene_symbol$SYMBOL


AnnotationDbi::select(, keys="POU5F1", 
                                    columns= "ENTREZID", keytype= "SYMBOL")



#Error: subscript contains invalid names

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_0.7.0                             AnnotationHub_2.6.0                                
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.0                  AnnotationDbi_1.36.0                   
 [7] Biobase_2.34.0                          GenomicRanges_1.26.1                    GenomeInfoDb_1.10.1                    
[10] IRanges_2.8.1                           S4Vectors_0.12.1                        BiocGenerics_0.20.0                    
[13] here_0.0-6                             

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8                   BiocInstaller_1.24.0          XVector_0.14.0                bitops_1.0-6                 
 [5] tools_3.3.1                   zlibbioc_1.20.0               biomaRt_2.30.0                digest_0.6.10                
 [9] tibble_1.3.3                  RSQLite_1.0.0                 lattice_0.20-34               rlang_0.1.1                  
[13] Matrix_1.2-7.1                shiny_0.14.1                  DBI_0.5-1                     curl_2.6                     
[17] rtracklayer_1.34.1            httr_1.2.1                    knitr_1.14                    Biostrings_2.42.1            
[21] rprojroot_1.2                 grid_3.3.1                    glue_1.1.1                    R6_2.2.0                     
[25] XML_3.98-1.5                  readxl_0.1.1                  BiocParallel_1.8.0            magrittr_1.5                 
[29] Rsamtools_1.26.1              backports_1.0.5               htmltools_0.3.5               GenomicAlignments_1.10.0     
[33] assertthat_0.1                SummarizedExperiment_1.4.0    xtable_1.8-2                  mime_0.5                     
[37] interactiveDisplayBase_1.12.0 httpuv_1.3.3                  RCurl_1.95-4.8          
ADD COMMENTlink modified 5 months ago by jma199130 • written 5 months ago by tangming200590
gravatar for jma1991
5 months ago by
jma199130 wrote:

It's called POU5F1B instead of POU5F1:


txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
hg19.genes <- genes(txdb)


gene_symbol <- AnnotationDbi::select(, keys=hg19.genes$gene_id, columns="SYMBOL", keytype="ENTREZID")

"POU5F1B" %in% gene_symbol$SYMBOL
ADD COMMENTlink written 5 months ago by jma199130

Thanks! I have a list of genes which I can not find in the txdb.


where can I find the alternative names?



ADD REPLYlink written 5 months ago by tangming200590
> select(, c("DAXX","HLA-A","SHOX","TRIM27"), "ENTREZID", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
1   DAXX     1616
2  HLA-A     3105
3   SHOX     6473
4 TRIM27     5987
ADD REPLYlink written 5 months ago by James W. MacDonald45k

thanks. looks like I can not find DAXX, HLA-A, SHOX and TRIM27 in TxDb.Hsapiens.UCSC.hg19.knownGene even using the other names. but POU5F1B is not shown either

AnnotationDbi::select(, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), keytype = "SYMBOL", columns = "ALIAS")
'select()' returned 1:many mapping between keys and columns
1  POU5F1   OCT3
2  POU5F1   OCT4
3  POU5F1  OTF-3
4  POU5F1   OTF3
5  POU5F1   OTF4
6  POU5F1  Oct-3
7  POU5F1  Oct-4
8  POU5F1 POU5F1
9    DAXX  BING2
10   DAXX   DAP6
11   DAXX   EAP1
12   DAXX   DAXX
13  HLA-A   HLAA
14  HLA-A  HLA-A
15   SHOX   GCFX
16   SHOX   PHOG
18   SHOX     SS
19   SHOX   SHOX
20 TRIM27    RFP
21 TRIM27  RNF76
22 TRIM27 TRIM27

using the ENTREZID.

ADD REPLYlink modified 5 months ago • written 5 months ago by tangming200590

What you show and what you say are sort of orthogonal things. Do note that using select on the package isn't the same thing as doing (like, anything) with the TxDb.Hsapiens.UCSC.hg19.knownGene package. So I am unsure what the code you show is supposed to illustrate, exactly.

But this brings up the question of "what is a gene, and how should one define it". Currently, we use a relatively simplistic definition of 'the region between the start of the transcripts from a gene to the end of the transcripts for that gene'. Which is sort of problematic for some genes. For example, some of the lincRNA genes can be found in two places on a given chromosome, so you end up with the (unlikely) gene region that encompasses some huge proportion of the chromosome.

This is also a problematic definition for - as it so happens - all of the genes you are talking about. So let's see:

> txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> library(

> toget <- mapIds(, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), "ENTREZID", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
> toget
"5460" "1616" "3105" "6473" "5987"

So there are the Entrez Gene IDs for those genes. Now let's subset the GRanges of our genes:

> gns[toget]
Error: subscript contains invalid names
> toget %in% names(gns)

How can this be? Well, let's look at the transcripts:

> toget %in% names(txs)

So we have transcripts for all those genes, but no genes? That seems...wrong? This next bit of code is a bit complex, primarily to clean it up to more easily show what is happening:

> lapply(txs[toget], function(x) as.character(unique(seqnames(x))))
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

[1] "chr6"          "chr6_cox_hap2" "chr6_dbb_hap3" "chr6_mcf_hap5"
[5] "chr6_qbl_hap6"

[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_dbb_hap3"
[5] "chr6_mann_hap4" "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

[1] "chrX" "chrY"

[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

Every one of those genes is found on more than one chromosome (or unplaced scaffold), in which case our definition of a 'gene' fails, because you can't define a region between the start and stop of the transcripts if those transcripts are on different chromosomes!

ADD REPLYlink written 5 months ago by James W. MacDonald45k

Hi James, thanks very much for this detailed answer. now I remember a related question you answered me some time ago C: TxDb.Hsapiens.UCSC.hg19.knownGene misannotation of a gene

again, your help is much appreciated. learned a lot!


ADD REPLYlink written 5 months ago by tangming200590

Actually I'm a bit confused now.. Neither the gene symbol or alias returns the canonical OCT4/POU5F1 gene:


genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
symbols <- mapIds(, keys = genes$gene_id, "SYMBOL", keytype = "ENTREZID")
alias <- mapIds(, keys = genes$gene_id, "ALIAS", keytype = "ENTREZID")

grep("POU", symbols, value = TRUE)
grep("POU", alias, value = TRUE)

grep("OCT", symbols, value = TRUE)
grep("OCT", alias, value = TRUE)


ADD REPLYlink written 5 months ago by jma199130

Why are you confused? I just explained in some detail why you won't get that gene doing what you have done.


ADD REPLYlink written 5 months ago by James W. MacDonald45k

Sorry, so are you saying OCT4/POU5F1 has transcripts on different chromosomes?

ADD REPLYlink written 5 months ago by jma199130

No, that's not what I said. Here are the locations for the gene you care about:

[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

And all of those things are 1) Chromosome 6, and 2) a bunch of unplaced or haplotype scaffolds. If you don't know what that means, you should listen to what Dan has to say.

ADD REPLYlink written 5 months ago by James W. MacDonald45k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 352 users visited in the last hour