Why is OrgDb linked to specific biocversion? / How to analyze old datasets?
1
0
Entering edit mode
MK • 0
@user-24137
Last seen 15 months ago
United Kingdom

Could someone please explain to me what the best practice is for working with gene annotation data in R, specifically with respect to the issue of working across datasets of different ages where the set of IDs is different between different versions of the annotations.

Trying to use specific annotation versions in OrgDb does not seem to be possible:

library(AnnotationHub)
hub <- AnnotationHub()
Dmel <- hub[["AH79576"]]
# Error: AH79576 is an OrgDb resource.
# orgDb resources are generated for specific biocversions.
# Requested resource works with biocversion: 3.11

What is the rationale for having only one version of the annotations per biocversion? How am I supposed to use OrgDb then?

For context I am working with Drosophila data and the IDs are quite unstable / change a lot depending on what version of annotations you are using. You discard a fair number of genes in your analysis if your dataset and annotations are not matched properly. I had tried to use OrgDb with clusterProfiler and I noticed this issue.

In Python I just use gffutils to create a database from a gtf file which is derived from a specific version of the annotations that matches my dataset exactly. Is there a way to do something similar in R that most people use?

BiocVersion OrgDb clusterProfiler • 1.5k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

The goal with all of the OrgDb packages is to split the difference between being as up to date as possible and providing consistent results over a span of time. To do that, we provide annotation packages that are based on the current data from approximately 2-3 weeks prior to the release, and those data remain static for the six months that a given release is current. In your example, an annotation package was built using the current data from NCBI as of April 2021, and that package is available as part of the Bioconductor 3.11 release. In October of that year we built another one that was current as of October 2021, and it persisted for the next six months.

It is not uncommon for people to want to use archival data, and if they want to do so, they can always use an old version of R/Bioconductor. As an example, when I have to re-run some dusty old analysis from 2021 and the PI won't like gene symbols changing or whatever, I will use R-4.0.0 and Bioc 3.11. As somebody who regularly needs to do such things, I have R/Bioc installations that go back many years that I can fire up at will.

ADD COMMENT
0
Entering edit mode

Could you please explain why they made the decision not to allow you to download the version of annotations that you need with a given version of Bioc? That seems very restrictive to me. Why does the annotation version have anything to do with the version of R/Bioconductor? Is it computationally too difficult for some reason to allow the user to choose among all the possibilities? I can't speak for everyone, but it would be much easier for me if the annotation version and biocversion were separate and you could download whatever annotation version that you need for a project. For example, what if you needed to run an analysis with a package only available in R4.2 and you needed annotations from R4.0? Or what if you are working with multiple datasources in the same analysis which is running the same R session?

Any other advice for how to get around this issue? I haven't tried yet, but it I guess it's possible for me to build my own OrgDb? Do you think that's what people do?

Maybe it's not as big an issue for mammals, but for flies, often a significant amount of the data will get dropped because the IDs don't match when you don't use the appropriate annotation version.

ADD REPLY
0
Entering edit mode

Can you explain further how data gets dropped because IDs don't match? Are you saying that for flies the NCBI Gene IDs are changing all the time? That doesn't jibe with my experience.

ADD REPLY
0
Entering edit mode

I do not know if the RefSeq IDs are changing all the time, but the Flybase/Ensembl IDs are changing all the time! Most fly people I know use Flybase (or Ensembl, which uses specific Flybase versions for their releases). As far as I understand, Flybase changes the gene ID any time there is an update to the gene model, even for something as trivial as adding a new transcript isoform. This means there are several changes each version where a gene now has a different ID.

In my specific example 52/7838 genes were not able to be converted from Ensembl99 to RefSeq ID using a recent version of OrgDb (hub[["AH100407"]]). Let's take a look at the first four: "FBgn0027341" "FBgn0028993" "FBgn0029526" "FBgn0031629". If you look these up in the current version of Flybase with the ID validator tool, you can see that #1, 2, & 4 indeed now have new IDs compared to annotations which are just a couple years old. That's why the newest version of OrgDb doesn't work for them. Looks like #3 is non-coding RNA, so it might not be annotated in RefSeq and that's OK. But for most of the genes, it simply doesn't find them because it's not the right version.

ADD REPLY
0
Entering edit mode

In fact, if I compare the underlying databases between 3.11 and 3.16, there are only a handful of NCBI Gene IDs that differ. There are 20 Gene IDs that existed in the 3.11 db that aren't in the 3.16 version, and 109 IDs that are in the 3.16 version that aren't in the 3.11 version. Out of over 25,000 total IDs. And there are 176 gene symbols that have changed in the intervening period. But I don't see any wholesale changes in the OrgDb packages.

But maybe I am not understanding the issue correctly?

ADD REPLY
0
Entering edit mode

If I compare the flybase IDs between the two databases, there are only 22 that are different, in addition to the 129 NCBI IDs that have changed. And you are correct about the four you mention:

> dbGetQuery(old, "select flybase_id, gene_id from flybase inner join genes using(_id) where flybase_id in ('FBgn0027341','FBgn0028993','FBgn0029526','FBgn0031629');")
   flybase_id gene_id
1 FBgn0031629   33678
2 FBgn0027341 3354921
3 FBgn0028993 3355151
4 FBgn0029526 8673968
> dbGetQuery(new, "select flybase_id, gene_id from flybase inner join genes using(_id) where flybase_id in ('FBgn0027341','FBgn0028993','FBgn0029526','FBgn0031629');")
   flybase_id gene_id
1 FBgn0029526 8673968

The only Gene ID -> flybase mapping that remains is for FBgn0029526

But the issue you are highlighting has to do with changes that happen through time, between annotation services. The OrdDb packages use the NCBI Gene ID as the central key, and any mappings you might want to do will have to go through that ID. So as an example, if you wanted to convert an Ensembl ID to a Flybase ID, what would happen is the Ensembl ID is mapped to the corresponding NCBI ID, and then that ID is mapped to the Flybase ID. If all you care about are Ensembl IDs and Flybase IDs, you shouldn't be using an OrgDb!

It's taken years for NCBI and Ensembl to come up with a single transcript per human gene that they can agree upon. Just one per gene! Because of these differences, it is exceedingly difficult to map between NCBI and Ensembl, let alone adding in Flybase. So if the IDs you care about are Ensembl and Flybase, you should be using the biomaRt package to do the mapping.

ADD REPLY
0
Entering edit mode

I understand that it's difficult to keep track of mapping between different databases. However, I do believe that there is a OrgDb database that corresponds (or at least most closely corresponds) to the version that has the mapping to the version of Ensembl you're using. Not all tools in Bioconductor seem to accept a biomaRt annotation package I think(?). Will need to look into that for my specific use cases. Thank you for the suggestion!

In response to above, Ensembl ID are the same as Flybase IDs for the specific version that they state (i.e. Ensembl99= Flybase 6.28). I can't explain why you only see 22 IDs dropped/25k where I see 52/7838. Nevertheless, I still think it's not good that these genes are getting dropped. If you look even just at the 4 I showed, 3/4 are named genes with known functions. They shouldn't get dropped. It's not correct.

Btw, you still haven't answered my question about the design decision not to allow the user to choose their desired annotation versions and locking it to the bioc version. Is there a reason for this?

ADD REPLY
0
Entering edit mode

The genes aren't getting dropped by us. We are simply packaging the available annotation data that we can download from NCBI as of a certain date. If NCBI or Flybase or Ensembl change the underlying data, then the OrgDb will reflect that. It so happens that Gene ID 33678 is now annotated to FBgn0287234, so if you ask the current DB for the Flybase ID for 33678, you will get FBgn0287234 and vice versa.

Saying that is incorrect is itself incorrect. If you try to map Flybase IDs that are no longer Flybase IDs and you get a result, that is incorrect because the Flybase ID no longer exists. You might personally have data with deprecated IDs, but that has nothing to do with the current state of the annotation.

I have already answered your question about why you can only get one annotation package per species. It's because the annotation package is supposed to be temporally distinct, tied as all Bioconductor package are, to a certain release. And it is supposed to reflect the current state of the annotation data as of the time that the version of Bioconductor was released.

An even better reason is that there have been like 30 Bioconductor releases, and for each release there were dozens to hundreds of OrgDb packages that were generated, and over that time the underlying structure of the OrgDb packages has changed, with consequent changes to the AnnotationDbi package. Just making the OrgDb packages for a given release is a non-trivial affair, let alone trying to update old packages and ensuring they work with the current release. There aren't enough people in Bioc-core to do that.

And the alternative seems simple enough. All you have to do is install R-4.0.0 and the relevant Bioconductor packages if you want old annotation data.

ADD REPLY
0
Entering edit mode

I'm sorry that I wasn't clear -- I don't mean that anything you're doing is incorrect; I meant that I don't think that me using the wrong version of annotations for my data analysis is correct. Or at least it could be 'correct-er':) Sorry if my wording caused offence.

OK, I see that making sure everything would work across so many releases would be too difficult. I guess I just thought before running into this problem that there would be a way to choose which snapshot date you wanted from the annotation hub, regardless of your R version. I guess it's a lot more complicated than that due to changes in the structure of the packages, as you've explained.

Thank you for suggestions and the discussion!

ADD REPLY

Login before adding your answer.

Traffic: 639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6