Question

Setting package name in makeTxDbPackage

1

Entering edit mode

Diego Diez ▴ 760

@diego-diez-4520

Last seen 5.2 years ago

Japan

I am creating a TxDb annotation package, and I have noticed that even if the organism information used to create the TxDb contains more information than the species and genus, that will not be used in the package name. For example, I create a TxDb object:

url <- "http://plasmodb.org/common/downloads/Current_Release/PbergheiANKA/gff/data/PlasmoDB-24_PbergheiANKA.gff"

txdb <- makeTxDbFromGFF(
    file = url,
    dataSource = "PlasmoDB 24",
    organism = "Plasmodium berghei ANKA",
    chrominfo = <chrominfo>
    )

Then, I try to create a package with e.g.:

makeTxDbPackage(txdb = txdb,
                  version = "1.0.0",
                  maintainer = "Diego Diez <diego10ruiz@gmail.com>",
                  author = "Diego Diez",
                  destDir = "output/sequence"

which works, but the package is called "TxDb.Pberghei". Inside the DESCRIPTION file I can see the whole organism information has been correctly added:

Package: TxDb.Pberghei
Title: Annotation package for TxDb object(s)
Description: Exposes an annotation databases generated from Plasmodium berghei ANKA data from PlasmoDB release 24 by exposing these as TxDb objects
Version: 1.0.0
Author: Diego Diez
Maintainer: Diego Diez 
Depends: GenomicFeatures (>= 1.20.1)
Imports: GenomicFeatures, AnnotationDbi
License: Artistic-2.0
organism: Plasmodium berghei ANKA
species: Plasmodium berghei ANKA
provider: Plasmodium berghei ANKA data from PlasmoDB release 24
provider_version: Plasmodium berghei ANKA data from PlasmoDB release 24
release_date: 2015-05-14 19:05:45 +0900 (Thu, 14 May 2015)
resource_url: Plasmodium berghei ANKA data from PlasmoDB release 24
biocViews: AnnotationData, Genetics, TxDb, Plasmodium_berghei_ANKA

So, is there a way to make the package name be TxDb.PbergheiANKA when creating it with makeTxDbPackage?

PS: My motivation is several organisms with identical species/genus that end up with identical package names. Other examples:

url <- "http://plasmodb.org/common/downloads/Current_Release/Pyoeliiyoelii17XNL/gff/data/PlasmoDB-24_Pyoeliiyoelii17XNL.gff"

PPS: Obviously I could change the names manually, but I am looking for a possibly easier way.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      stats4    parallel  graphics  grDevices utils     datasets  stats     methods  
[10] base     

other attached packages:
 [1] BSgenome_1.36.0        rtracklayer_1.28.2     GenomicFeatures_1.20.1 AnnotationDbi_1.30.1  
 [5] GenomicRanges_1.20.3   GenomeInfoDb_1.4.0     gtable_0.1.2           motifTools_0.13.0     
 [9] knitr_1.10.5           ape_3.2                ggtree_1.0.7           Biostrings_2.36.1     
[13] XVector_0.8.0          IRanges_2.2.1          S4Vectors_0.6.0        Biobase_2.28.0        
[17] BiocGenerics_0.14.0    XML_3.98-1.1           fortunes_1.5-2         dplyr_0.4.1           
[21] tidyr_0.2.0            reshape2_1.4.1         ggplot2_1.0.1          lattice_0.20-31       
[25] devtools_1.8.0        

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1          colorspace_1.2-6        DBI_0.3.1              
 [4] BiocParallel_1.2.1      EBImage_4.10.0          lambda.r_1.1.7         
 [7] jpeg_0.1-8              plyr_1.8.2              stringr_1.0.0          
[10] zlibbioc_1.14.0         futile.logger_1.4.1     munsell_0.4.2          
[13] labeling_0.3            biomaRt_2.24.0          BiocInstaller_1.18.2   
[16] proto_0.3-10            Rcpp_0.11.6             scales_0.2.4           
[19] jsonlite_0.9.16         abind_1.4-3             Rsamtools_1.20.1       
[22] gridExtra_0.9.1         png_0.1-7               digest_0.6.8           
[25] stringi_0.4-1           tiff_0.1-5              tools_3.2.0            
[28] bitops_1.0-6            magrittr_1.5            lazyeval_0.1.10        
[31] RCurl_1.95-4.6          RSQLite_1.0.0           futile.options_1.0.0   
[34] MASS_7.3-40             assertthat_0.1          rstudioapi_0.3.1       
[37] fftwtools_0.9-7         rstudio_0.98.1103       GenomicAlignments_1.4.1
[40] nlme_3.1-120

genomicfeatures txdb • 4.0k views

ADD COMMENT • link updated 10.7 years ago by Marc Carlson ★ 7.2k • written 10.7 years ago by Diego Diez ▴ 760

score 2 · Answer 1 · 2015-05-19

Hi Marc,

I think the key problem here may be that the NCBI taxonomy does not appear to have the most up-to-date information for some organisms. This is particularly a problem when it comes to microbes, which often have numerous sub-species, strains, etc.

I think you are absolutely correct that it is important to have a consistent naming scheme, ideally based on some authority. At the same time, however, by doing so you may be preventing a large part of the community from being able to use the Bioconductor organisms packages in their work.

For example, Trypanosoma brucei TREU927 and Trypanosoma brucei Lister 427 are two of the mostly commonly studied strains of *T. brucei*, each of which are associated with their own very different genome sequences and gene and transcript annotations. Only the TREU927 strain, however, has been assigned its own taxonomy ID on NCBI.

This means that anyone studying the Lister 427 strain is unable to generate a database for their organism of interest and must find alternative methods to work with the data.

I suspect this is a representative problem of what people will face working with many different types of Microbes.

score 1 · Answer 2 · 2015-05-14

1

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 9.5 years ago

United States

There is a function deep down that generates the name from (among other things) the genus and species information provided. Right now a major goal of that function was to produce consistent names, but it seems that we should probably nudge that towards greater specificity instead. I will look into this.

Marc

ADD COMMENT • link 10.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Great- thanks. Note that for some of the organisms I have in mind there is even further details in their names like "Plasmodium yoelii yoelii 17X": http://www.ncbi.nlm.nih.gov/bioproject/256942

ADD REPLY • link 10.7 years ago Diego Diez ▴ 760

0

Entering edit mode

I also think this would be very useful. At the moment I'm using a combination of `sed` and `sqlite` to modify some things such as the database name for which there are currently no parameters.

Perhaps you could create some guidelines for what you think would be the best way to name the databases? I modeled the naming after the most popular organism dbs already in use, but some parts of that don't apply to non-modal databases (e.g. "ensGene" or "knownGene").

ADD REPLY • link 10.7 years ago Keith Hughitt ▴ 180

score 1 · Answer 3 · 2015-05-18

1

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 9.5 years ago

United States

OK Diego, I have checked in a change to devel that should allow you to proceed. But I could use more information from you as I can't test your specific case since you didn't give me enough information to fully reproduce it this... :/

Also for some of your other species names like "Plasmodium yoelii yoelii 17X" you will still have a problem since that name is not an official subspecies designation for an NCBI tax ID. In that case you would have to specify what you mean a little bit better. Here is a link with tax IDs and "real" full subspecies names for that organism at NCBIs taxonomy ID listing.

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=5861&lvl=3&lin=f&keep=1&srchmode=1&unlock

So as you can see these would both work (for example):

Plasmodium yoelii yoelii

Plasmodium yoelii yoelii 17XNL

But in either case, for Plasmodium yoelii yoelii you will need to specify which one you mean or we won't be able to work out what that taxonomy ID is supposed to be. And actually following the link you gave, it looks like you will want to say this:

Plasmodium yoelii 17X

Marc

ADD COMMENT • link 10.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc, thank you. I have updated my question to add a specific example. Let me know if you need more details. Regarding the taxonomy, I meant exactly Plasmodium yoelii yoelii 17XNL. I know it does not show in the NCBI taxonomy, but that does not mean is wrong. Please see here: Organisms. I guess it depends on how important is the information in NCBI taxonomy for the generation of the package. I would prefer to have flexibility to call the package whatever I want, then enforce a particular name scheme depending on whether there is official nomenclature or not. Does it make sense?

ADD REPLY • link 10.7 years ago Diego Diez ▴ 760

0

Entering edit mode

Sorry, I meant Plasmodium yoelii yoelii 17X in my comment above. As you can see in the link to PlasmoDB, there are three different species of Plasmodium yoelii used.

ADD REPLY • link 10.7 years ago Diego Diez ▴ 760

0

Entering edit mode

Hi Diego,

I would rather that you use only official organism names as it creates confusion if you use ones that are not. I just updated our taxonomy data for this purpose and after I check it in (this morning) you will have this 'official' list of names to choose from:

5861                    Plasmodium yoelii
5862            Plasmodium berghei yoelii
31274        Plasmodium yoelii nigeriensis
31274 Plasmodium yoelii subsp. nigeriensis
73239      Plasmodium yoelii subsp. yoelii
73239             Plasmodium yoelii yoelii
283801           Plasmodium yoelii killicki
352914       Plasmodium yoelii yoelii 17XNL
352914  Plasmodium yoelii yoelii str. 17XNL
1050261                 Plasmodium yoelii YM
1050262                 Plasmodium yoelii 17
1323249                Plasmodium yoelii 17X

I am considering also adding an argument so that you can specify the taxonomy ID separately. That way you could type in the organism as more of a free text field (but ONLY if you also provided a valid tax ID instead of a default argument of NULL). Basically, I agree that it is nice to try and allow people *some* creative license for custom packages. But I don't think it's a good idea for people to not have real taxonomy IDs and I also really don't like the idea of people getting creative with organism names either since that is meddling with data that ought to be guarded rigorously. Right now we use the Organism to determine the correct taxonomy ID, so right now this is a problem (since correct tax IDs are a must in the near future).

In short, I am sympathetic to your desire to customize the name of the package, but I think that messing with the organism name is probably not the right place to do that. :( Perhaps I should provide over-rides for both the taxonomy ID (as mentioned) and also provide an override for the package name too (for those who want that). But I have to think carefully before I start adding tons of new arguments to functions that already have too many arguments. So I will have to get back to you about these two arguments.

Marc

ADD REPLY • link 10.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc,

Of course, you are right about that following taxonomy definitions is the best way to systematically name an organism's package, particularly when making the package accessible to other researchers. But making it mandatory at the package construction prevents the generation of packages for organisms not included yet in that taxonomy (of course it is possible to manually modify the package's name). Arguably, naming a package Plasmodium yoelii yoelii 17X would not confuse the plasmodium research community. Keith's answer below has another good example.

ADD REPLY • link 10.7 years ago Diego Diez ▴ 760

score 1 · Answer 4 · 2015-05-20

Hi Keith,

I recognized the potential for that problem I just wasn't sure if it was actually happening to people out in the real world or not (most people still only work on mouse and humans). And this is why I was mentioning the taxonomyId argument. Because especially for bacteria, the tree at NCBI is incomplete (and likely to always be so).

So in anticipation of this, last night I added a separate argument for the taxonomyId for the makeTxDbFromGFF function. The idea is that if you pass in the taxonomyId manually, then you no longer need to supply a perfectly named organism (that string will just be accepted 'as is' instead of used to look up a taxonomy id). The supplied taxonomy id will still be checked for validity, but only to make sure that it's really an actual taxonomy ID.

IOW I still think you should have a taxonomyId even if it is only the 'closest' ID you can reasonably choose for your organism. So even if there is not an id for your specific strain: it's still useful to know that it is some kind of Trypanosoma brucei, (and there is an id for that = 5691).

So for your Trypanosoma case, you could now choose to supply the general taxomony ID and then also supply an organism string that mentions your specific strain (and in this case the organism string would act as a free text field).

As for packages I also agree that you should be able to name those in the way that you want. So I am also looking into adding an argument to let you over-ride the auto-generated name for the package too.

Thanks for the feedback,

Marc

score 1 · Answer 5 · 2015-05-21

Diego,

In case it is helpful to you, I put together some scripts for parsing files from different EuPathDB annotations in order to generate OrgDb, TxDb, and OrganismDb packages:

https://github.com/elsayed-lab/eupathdb-organismdb

At the moment, it is geared towards Bioconductor 3.0 and includes many ugly workarounds to handle renaming, etc. The plan is to update the code to work with Bioconductor 3.1 and remove some of the ugly workarounds once it is possible.

I have tested it on several TriTrypDB species, and one ToxoDB species, but have tried to make the code general enough to work with any of the other EuPathDB organisms.

Feel free to use it if you find it helpful and submit changes to Github that you think would be useful to others.

Best,

Keith

score 1 · Answer 6 · 2015-05-21

Actually Keith, I am trying to move things towards a world where these resources can just come down from the AnnotationHub. So for example you can already do this:

library(AnnotationHub)
ah = AnnotationHub()
ahs = subset(ah, ah$rdataclass=='OrgDb')

Which already has about ~1100 OrgDb objects in it, and you can access these object using the hub so for example:

obj1 <- ahs[[1]]
obj1

Would download and cache the 1st of these OrgDb objects for you (based on NCBI resources). So if you have reliable code that parses and build OrgDb objects out of other good resources, then it may make sense to formalize a simple recipe so that all of those OrgDb objects could also be available in the hub.

Similar things can also be done for TxDb and soon for OrganismDb objects. And I want all of those kinds of things to be available in the hub. If you are interested in contributing a recipe please see our vignettes here:

http://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html

As for your package names, we had a similar conversation about that earlier this week. We chose not to use '.' s to break apart part of an organisms name since it implied separate fields, and we also didn't use underscores on the TxDb names since those don't work as package names. BUT: you can use underscores if it's just an object in the AnnotationHub, so that characters availability depends on your use case. Anyhow what we settled on for TxDbs was to append the subspecies names using camelcase (awful for this kind of thing - I know).

So for example: Homo sapiens sapiens

Becomes: HsapiensSapiens

Anyhow that is what the code that is in GenomicFeatures will do currently with extended species names. IHMO it's not pretty, but at least the information is there and won't break as a package name. As for Org packages, if you look beyond the initial 18 DBs that were offered as packages, you will see that once we went to putting OrgDb objects into the Hub (which meant MANY more names that had to be specified), we basically had to stop abbreviating the genus and species names. And since these are never going to be packaged (just in the hub) you end up with things that look like "org.Pseudomonas_mendocina_NK-01.eg.sqlite".

And now you can hopefully see why I am so concerned about getting tax IDs labeled onto things. Because the older ways of using names is just not going to work out in the long run. There are too many names, they are already too similar to each other, and there are too many ways to creatively abbreviate them. From here on out we are just going to have be a lot more regimented about metadata. Which brings me back to the AnnotationHub. The AnnotationHub has a metadata database with records that are kept for each of its ~34K (and growing) resources. This is going to be very important for finding things in the future. So if you have done the work to parse some valuable annotation resources into standard OrDb, TxDb and OrganismDb objects, then by all means please consider giving us a recipe for those.

Marc