Question

Revisiting gene name inclusion in TxDB

0

Entering edit mode

k.vitting.seerup ▴ 120

@kvittingseerup-7956

Last seen 8 months ago

European Union

According to the GenomicFeatures annotation and this post it was a conscious decision to not include gene_name in the TxDb objects from the GenomicFeatures package.

Is the decision of not including gene names something that could be brought up again?

I think we can all agree that in the end the majority of end users would like gene names associated with their analysis as these ids are what supply the link to biological knowledge for most people. This - along with the fact that one of the main advantages of Bioconductor is to make data integration easy and seamless - the decision to omit gene names seems a bit out of character?

Apart from this, I also think it is a bit of a shame given the recent push towards us automating away many of the low level data handling hurdles associated with bioinformatics via packages such as tximeta.

Lastly it does seem like the ensembledb package TxDB object contains gene_names so adding them to the regular TxDb would further streamline BioC.

Looking forward to hear your thoughts.

Cheers

Kristoffer

Bioconductor GenomicFeatures TxDB • 1.0k views

ADD COMMENT • link updated 2.5 years ago by fabian.grammes • 0 • written 3.5 years ago by k.vitting.seerup ▴ 120

score 0 · Answer 1 · 2020-11-06

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

If you want to do mappings that involve a gene ID you can use an OrganismDb package.


> library(Homo.sapiens)
Loading required package: OrganismDbi
Loading required package: GO.db

Loading required package: org.Hs.eg.db

Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
Warning message:
package 'OrganismDbi' was built under R version 4.0.3 
> columns(Homo.sapiens)
 [1] "ACCNUM"       "ALIAS"        "CDSCHROM"     "CDSEND"       "CDSID"       
 [6] "CDSNAME"      "CDSSTART"     "CDSSTRAND"    "DEFINITION"   "ENSEMBL"     
[11] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
[16] "EVIDENCEALL"  "EXONCHROM"    "EXONEND"      "EXONID"       "EXONNAME"    
[21] "EXONRANK"     "EXONSTART"    "EXONSTRAND"   "GENEID"       "GENENAME"    
[26] "GO"           "GOALL"        "GOID"         "IPI"          "MAP"         
[31] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[36] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "TERM"        
[41] "TXCHROM"      "TXEND"        "TXID"         "TXNAME"       "TXSTART"     
[46] "TXSTRAND"     "TXTYPE"       "UCSCKG"       "UNIGENE"      "UNIPROT"

ADD COMMENT • link 3.5 years ago James W. MacDonald 65k

0

Entering edit mode

I do now about this aporach (and it is also mentioned in the post I link to above). Unfortunately this will not generalize as 1) org.db does not exist for all species and 2) it require user input which is what I (and others) would like to automate away. 3) you might often run into problems with different transcriptome assembly versions (e.g. new genes in less well studied species).

Lastly it seems strange to not import gene_names from e.g. GTF files since the information is there (and actually are imported but not used).

ADD REPLY • link 3.4 years ago k.vitting.seerup ▴ 120

0

Entering edit mode

I strongly support the request by @kvittingseerup. I'm working a lot with non-model organisms where OrgDB's are not available, including the gene names would make my life much easier.

ADD REPLY • link 2.5 years ago fabian.grammes • 0