Could someone please explain to me what the best practice is for working with gene annotation data in R, specifically with respect to the issue of working across datasets of different ages where the set of IDs is different between different versions of the annotations.
Trying to use specific annotation versions in OrgDb does not seem to be possible:
library(AnnotationHub)
hub <- AnnotationHub()
Dmel <- hub[["AH79576"]]
# Error: AH79576 is an OrgDb resource.
# orgDb resources are generated for specific biocversions.
# Requested resource works with biocversion: 3.11
What is the rationale for having only one version of the annotations per biocversion? How am I supposed to use OrgDb then?
For context I am working with Drosophila data and the IDs are quite unstable / change a lot depending on what version of annotations you are using. You discard a fair number of genes in your analysis if your dataset and annotations are not matched properly. I had tried to use OrgDb with clusterProfiler and I noticed this issue.
In Python I just use gffutils to create a database from a gtf file which is derived from a specific version of the annotations that matches my dataset exactly. Is there a way to do something similar in R that most people use?
Could you please explain why they made the decision not to allow you to download the version of annotations that you need with a given version of Bioc? That seems very restrictive to me. Why does the annotation version have anything to do with the version of R/Bioconductor? Is it computationally too difficult for some reason to allow the user to choose among all the possibilities? I can't speak for everyone, but it would be much easier for me if the annotation version and biocversion were separate and you could download whatever annotation version that you need for a project. For example, what if you needed to run an analysis with a package only available in R4.2 and you needed annotations from R4.0? Or what if you are working with multiple datasources in the same analysis which is running the same R session?
Any other advice for how to get around this issue? I haven't tried yet, but it I guess it's possible for me to build my own OrgDb? Do you think that's what people do?
Maybe it's not as big an issue for mammals, but for flies, often a significant amount of the data will get dropped because the IDs don't match when you don't use the appropriate annotation version.
Can you explain further how data gets dropped because IDs don't match? Are you saying that for flies the NCBI Gene IDs are changing all the time? That doesn't jibe with my experience.
I do not know if the RefSeq IDs are changing all the time, but the Flybase/Ensembl IDs are changing all the time! Most fly people I know use Flybase (or Ensembl, which uses specific Flybase versions for their releases). As far as I understand, Flybase changes the gene ID any time there is an update to the gene model, even for something as trivial as adding a new transcript isoform. This means there are several changes each version where a gene now has a different ID.
In my specific example 52/7838 genes were not able to be converted from Ensembl99 to RefSeq ID using a recent version of OrgDb (hub[["AH100407"]]). Let's take a look at the first four: "FBgn0027341" "FBgn0028993" "FBgn0029526" "FBgn0031629". If you look these up in the current version of Flybase with the ID validator tool, you can see that #1, 2, & 4 indeed now have new IDs compared to annotations which are just a couple years old. That's why the newest version of OrgDb doesn't work for them. Looks like #3 is non-coding RNA, so it might not be annotated in RefSeq and that's OK. But for most of the genes, it simply doesn't find them because it's not the right version.
In fact, if I compare the underlying databases between 3.11 and 3.16, there are only a handful of NCBI Gene IDs that differ. There are 20 Gene IDs that existed in the 3.11 db that aren't in the 3.16 version, and 109 IDs that are in the 3.16 version that aren't in the 3.11 version. Out of over 25,000 total IDs. And there are 176 gene symbols that have changed in the intervening period. But I don't see any wholesale changes in the
OrgDb
packages.But maybe I am not understanding the issue correctly?
If I compare the flybase IDs between the two databases, there are only 22 that are different, in addition to the 129 NCBI IDs that have changed. And you are correct about the four you mention:
The only Gene ID -> flybase mapping that remains is for FBgn0029526
But the issue you are highlighting has to do with changes that happen through time, between annotation services. The
OrdDb
packages use the NCBI Gene ID as the central key, and any mappings you might want to do will have to go through that ID. So as an example, if you wanted to convert an Ensembl ID to a Flybase ID, what would happen is the Ensembl ID is mapped to the corresponding NCBI ID, and then that ID is mapped to the Flybase ID. If all you care about are Ensembl IDs and Flybase IDs, you shouldn't be using anOrgDb
!It's taken years for NCBI and Ensembl to come up with a single transcript per human gene that they can agree upon. Just one per gene! Because of these differences, it is exceedingly difficult to map between NCBI and Ensembl, let alone adding in Flybase. So if the IDs you care about are Ensembl and Flybase, you should be using the
biomaRt
package to do the mapping.I understand that it's difficult to keep track of mapping between different databases. However, I do believe that there is a OrgDb database that corresponds (or at least most closely corresponds) to the version that has the mapping to the version of Ensembl you're using. Not all tools in Bioconductor seem to accept a biomaRt annotation package I think(?). Will need to look into that for my specific use cases. Thank you for the suggestion!
In response to above, Ensembl ID are the same as Flybase IDs for the specific version that they state (i.e. Ensembl99= Flybase 6.28). I can't explain why you only see 22 IDs dropped/25k where I see 52/7838. Nevertheless, I still think it's not good that these genes are getting dropped. If you look even just at the 4 I showed, 3/4 are named genes with known functions. They shouldn't get dropped. It's not correct.
Btw, you still haven't answered my question about the design decision not to allow the user to choose their desired annotation versions and locking it to the bioc version. Is there a reason for this?
The genes aren't getting dropped by us. We are simply packaging the available annotation data that we can download from NCBI as of a certain date. If NCBI or Flybase or Ensembl change the underlying data, then the
OrgDb
will reflect that. It so happens that Gene ID 33678 is now annotated to FBgn0287234, so if you ask the current DB for the Flybase ID for 33678, you will get FBgn0287234 and vice versa.Saying that is incorrect is itself incorrect. If you try to map Flybase IDs that are no longer Flybase IDs and you get a result, that is incorrect because the Flybase ID no longer exists. You might personally have data with deprecated IDs, but that has nothing to do with the current state of the annotation.
I have already answered your question about why you can only get one annotation package per species. It's because the annotation package is supposed to be temporally distinct, tied as all Bioconductor package are, to a certain release. And it is supposed to reflect the current state of the annotation data as of the time that the version of Bioconductor was released.
An even better reason is that there have been like 30 Bioconductor releases, and for each release there were dozens to hundreds of
OrgDb
packages that were generated, and over that time the underlying structure of theOrgDb
packages has changed, with consequent changes to theAnnotationDbi
package. Just making theOrgDb
packages for a given release is a non-trivial affair, let alone trying to update old packages and ensuring they work with the current release. There aren't enough people in Bioc-core to do that.And the alternative seems simple enough. All you have to do is install R-4.0.0 and the relevant Bioconductor packages if you want old annotation data.
I'm sorry that I wasn't clear -- I don't mean that anything you're doing is incorrect; I meant that I don't think that me using the wrong version of annotations for my data analysis is correct. Or at least it could be 'correct-er':) Sorry if my wording caused offence.
OK, I see that making sure everything would work across so many releases would be too difficult. I guess I just thought before running into this problem that there would be a way to choose which snapshot date you wanted from the annotation hub, regardless of your R version. I guess it's a lot more complicated than that due to changes in the structure of the packages, as you've explained.
Thank you for suggestions and the discussion!