Search
Question: POU5F1(OCT4) /ENTREZID: 5460 is not in the TxDb.Hsapiens.UCSC.hg19.knownGene
0
gravatar for tangming2005
7 weeks ago by
tangming200590
United States
tangming200590 wrote:

Hi,

I was trying to get the coordinates of some genes, and was surprised that POU5F1 gene is not in the txdb.

Thank you for looking into this.

Tommy

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
hg19.genes<- genes(txdb)

library(AnnotationDbi)
library("org.Hs.eg.db")

## note that dplyr and AnnotationDbi both have a function called select
## use dplyr::select when use dplyr

gene_symbol<- AnnotationDbi::select(org.Hs.eg.db, keys=hg19.genes$gene_id, 
                                    columns="SYMBOL", keytype="ENTREZID")

"POU5F1" %in% gene_symbol$SYMBOL

#FALSE

AnnotationDbi::select(org.Hs.eg.db, keys="POU5F1", 
                                    columns= "ENTREZID", keytype= "SYMBOL")

#5460

hg19.genes["5460",]

#Error: subscript contains invalid names


> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_0.7.0                             AnnotationHub_2.6.0                     org.Hs.eg.db_3.4.0                     
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.0                  AnnotationDbi_1.36.0                   
 [7] Biobase_2.34.0                          GenomicRanges_1.26.1                    GenomeInfoDb_1.10.1                    
[10] IRanges_2.8.1                           S4Vectors_0.12.1                        BiocGenerics_0.20.0                    
[13] here_0.0-6                             

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8                   BiocInstaller_1.24.0          XVector_0.14.0                bitops_1.0-6                 
 [5] tools_3.3.1                   zlibbioc_1.20.0               biomaRt_2.30.0                digest_0.6.10                
 [9] tibble_1.3.3                  RSQLite_1.0.0                 lattice_0.20-34               rlang_0.1.1                  
[13] Matrix_1.2-7.1                shiny_0.14.1                  DBI_0.5-1                     curl_2.6                     
[17] rtracklayer_1.34.1            httr_1.2.1                    knitr_1.14                    Biostrings_2.42.1            
[21] rprojroot_1.2                 grid_3.3.1                    glue_1.1.1                    R6_2.2.0                     
[25] XML_3.98-1.5                  readxl_0.1.1                  BiocParallel_1.8.0            magrittr_1.5                 
[29] Rsamtools_1.26.1              backports_1.0.5               htmltools_0.3.5               GenomicAlignments_1.10.0     
[33] assertthat_0.1                SummarizedExperiment_1.4.0    xtable_1.8-2                  mime_0.5                     
[37] interactiveDisplayBase_1.12.0 httpuv_1.3.3                  RCurl_1.95-4.8          
ADD COMMENTlink modified 7 weeks ago by jma199130 • written 7 weeks ago by tangming200590
1
gravatar for jma1991
7 weeks ago by
jma199130
jma199130 wrote:

It's called POU5F1B instead of POU5F1:

library("TxDb.Hsapiens.UCSC.hg19.knownGene")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
hg19.genes <- genes(txdb)

library("AnnotationDbi")
library("org.Hs.eg.db")

gene_symbol <- AnnotationDbi::select(org.Hs.eg.db, keys=hg19.genes$gene_id, columns="SYMBOL", keytype="ENTREZID")

"POU5F1B" %in% gene_symbol$SYMBOL
ADD COMMENTlink written 7 weeks ago by jma199130

Thanks! I have a list of genes which I can not find in the txdb.

DAXX, HLA-A, SHOX and TRIM27.

where can I find the alternative names?

Tommy

 

ADD REPLYlink written 7 weeks ago by tangming200590
1
> select(org.Hs.eg.db, c("DAXX","HLA-A","SHOX","TRIM27"), "ENTREZID", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
   ALIAS ENTREZID
1   DAXX     1616
2  HLA-A     3105
3   SHOX     6473
4 TRIM27     5987
ADD REPLYlink written 7 weeks ago by James W. MacDonald45k

thanks. looks like I can not find DAXX, HLA-A, SHOX and TRIM27 in TxDb.Hsapiens.UCSC.hg19.knownGene even using the other names. but POU5F1B is not shown either

AnnotationDbi::select(org.Hs.eg.db, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), keytype = "SYMBOL", columns = "ALIAS")
'select()' returned 1:many mapping between keys and columns
   SYMBOL  ALIAS
1  POU5F1   OCT3
2  POU5F1   OCT4
3  POU5F1  OTF-3
4  POU5F1   OTF3
5  POU5F1   OTF4
6  POU5F1  Oct-3
7  POU5F1  Oct-4
8  POU5F1 POU5F1
9    DAXX  BING2
10   DAXX   DAP6
11   DAXX   EAP1
12   DAXX   DAXX
13  HLA-A   HLAA
14  HLA-A  HLA-A
15   SHOX   GCFX
16   SHOX   PHOG
17   SHOX  SHOXY
18   SHOX     SS
19   SHOX   SHOX
20 TRIM27    RFP
21 TRIM27  RNF76
22 TRIM27 TRIM27

using the ENTREZID.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by tangming200590

What you show and what you say are sort of orthogonal things. Do note that using select on the org.Hs.eg.db package isn't the same thing as doing (like, anything) with the TxDb.Hsapiens.UCSC.hg19.knownGene package. So I am unsure what the code you show is supposed to illustrate, exactly.

But this brings up the question of "what is a gene, and how should one define it". Currently, we use a relatively simplistic definition of 'the region between the start of the transcripts from a gene to the end of the transcripts for that gene'. Which is sort of problematic for some genes. For example, some of the lincRNA genes can be found in two places on a given chromosome, so you end up with the (unlikely) gene region that encompasses some huge proportion of the chromosome.

This is also a problematic definition for - as it so happens - all of the genes you are talking about. So let's see:

> txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> library(org.Hs.eg.db)

> toget <- mapIds(org.Hs.eg.db, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), "ENTREZID", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
> toget
POU5F1   DAXX  HLA-A   SHOX TRIM27
"5460" "1616" "3105" "6473" "5987"

So there are the Entrez Gene IDs for those genes. Now let's subset the GRanges of our genes:

> gns[toget]
Error: subscript contains invalid names
> toget %in% names(gns)
[1] FALSE FALSE FALSE FALSE FALSE

How can this be? Well, let's look at the transcripts:

> toget %in% names(txs)
[1] TRUE TRUE TRUE TRUE TRUE

So we have transcripts for all those genes, but no genes? That seems...wrong? This next bit of code is a bit complex, primarily to clean it up to more easily show what is happening:

> lapply(txs[toget], function(x) as.character(unique(seqnames(x))))
$`5460`
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

$`1616`
[1] "chr6"          "chr6_cox_hap2" "chr6_dbb_hap3" "chr6_mcf_hap5"
[5] "chr6_qbl_hap6"

$`3105`
[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_dbb_hap3"
[5] "chr6_mann_hap4" "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

$`6473`
[1] "chrX" "chrY"

$`5987`
[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"


Every one of those genes is found on more than one chromosome (or unplaced scaffold), in which case our definition of a 'gene' fails, because you can't define a region between the start and stop of the transcripts if those transcripts are on different chromosomes!

ADD REPLYlink written 7 weeks ago by James W. MacDonald45k

Hi James, thanks very much for this detailed answer. now I remember a related question you answered me some time ago C: TxDb.Hsapiens.UCSC.hg19.knownGene misannotation of a gene

again, your help is much appreciated. learned a lot!

Tommy

ADD REPLYlink written 7 weeks ago by tangming200590

Actually I'm a bit confused now.. Neither the gene symbol or alias returns the canonical OCT4/POU5F1 gene:

library("org.Hs.eg.db")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")

genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
symbols <- mapIds(org.Hs.eg.db, keys = genes$gene_id, "SYMBOL", keytype = "ENTREZID")
alias <- mapIds(org.Hs.eg.db, keys = genes$gene_id, "ALIAS", keytype = "ENTREZID")

grep("POU", symbols, value = TRUE)
grep("POU", alias, value = TRUE)

grep("OCT", symbols, value = TRUE)
grep("OCT", alias, value = TRUE)

 

ADD REPLYlink written 7 weeks ago by jma199130

Why are you confused? I just explained in some detail why you won't get that gene doing what you have done.


 

ADD REPLYlink written 7 weeks ago by James W. MacDonald45k

Sorry, so are you saying OCT4/POU5F1 has transcripts on different chromosomes?

ADD REPLYlink written 7 weeks ago by jma199130

No, that's not what I said. Here are the locations for the gene you care about:

$`5460`
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

And all of those things are 1) Chromosome 6, and 2) a bunch of unplaced or haplotype scaffolds. If you don't know what that means, you should listen to what Dan has to say.

ADD REPLYlink written 7 weeks ago by James W. MacDonald45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 151 users visited in the last hour