Revisiting gene name inclusion in TxDB
1
0
Entering edit mode
@kvittingseerup-7956
Last seen 5 months ago
European Union

According to the GenomicFeatures annotation and this post it was a conscious decision to not include gene_name in the TxDb objects from the GenomicFeatures package.

Is the decision of not including gene names something that could be brought up again?

I think we can all agree that in the end the majority of end users would like gene names associated with their analysis as these ids are what supply the link to biological knowledge for most people. This - along with the fact that one of the main advantages of Bioconductor is to make data integration easy and seamless - the decision to omit gene names seems a bit out of character?

Apart from this, I also think it is a bit of a shame given the recent push towards us automating away many of the low level data handling hurdles associated with bioinformatics via packages such as tximeta.

Lastly it does seem like the ensembledb package TxDB object contains gene_names so adding them to the regular TxDb would further streamline BioC.

Looking forward to hear your thoughts.

Cheers

Kristoffer

Bioconductor GenomicFeatures TxDB • 498 views
0
Entering edit mode
@james-w-macdonald-5106
Last seen 13 hours ago
United States

If you want to do mappings that involve a gene ID you can use an OrganismDb package.


> library(Homo.sapiens)

Warning message:
package 'OrganismDbi' was built under R version 4.0.3
> columns(Homo.sapiens)
[1] "ACCNUM"       "ALIAS"        "CDSCHROM"     "CDSEND"       "CDSID"
[6] "CDSNAME"      "CDSSTART"     "CDSSTRAND"    "DEFINITION"   "ENSEMBL"
[11] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"
[16] "EVIDENCEALL"  "EXONCHROM"    "EXONEND"      "EXONID"       "EXONNAME"
[21] "EXONRANK"     "EXONSTART"    "EXONSTRAND"   "GENEID"       "GENENAME"
[26] "GO"           "GOALL"        "GOID"         "IPI"          "MAP"
[31] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"
[36] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "TERM"
[41] "TXCHROM"      "TXEND"        "TXID"         "TXNAME"       "TXSTART"
[46] "TXSTRAND"     "TXTYPE"       "UCSCKG"       "UNIGENE"      "UNIPROT"

0
Entering edit mode

I do now about this aporach (and it is also mentioned in the post I link to above). Unfortunately this will not generalize as 1) org.db does not exist for all species and 2) it require user input which is what I (and others) would like to automate away. 3) you might often run into problems with different transcriptome assembly versions (e.g. new genes in less well studied species).

Lastly it seems strange to not import gene_names from e.g. GTF files since the information is there (and actually are imported but not used).

0
Entering edit mode

I strongly support the request by @kvittingseerup. I'm working a lot with non-model organisms where OrgDB's are not available, including the gene names would make my life much easier.