OrgDB download failed via AnnotationHub
2
0
Entering edit mode
@chiragparsania-13271
Last seen 18 months ago
Australia

Hi,

I tried to download the OrgDB object provided by fungidb through annotationhub. Somehow it failed. See the commands and error below. However, downloading GRanges objects working perfectly fine. Can anyone throw some light on why OrgDB failed ?

library("AnnotationHub")
hub <- AnnotationHub()

> hub
AnnotationHub with 46259 records
# snapshotDate(): 2019-05-02 
# $dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, Haemcode, FungiDB, Inparanoid8, TriTrypDB, PlasmoDB, AmoebaDB
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus, Rattus norvegicus, Pan troglodytes, Danio rerio, Gallus gallus, Mono...
# $rdataclass: GRanges, BigWigFile, TwoBitFile, OrgDb, Rle, ChainFile, EnsDb, Inparanoid8Db, TxDb, data.frame
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH5012"]]' 

            title                                                     
  AH5012  | Chromosome Band                                           
  AH5013  | STS Markers                                               
  AH5014  | FISH Clones                                               
  AH5015  | Recomb Rate                                               
  AH5016  | ENCODE Pilot                                              
  ...       ...                                                       
  AH73812 | org.Plasmodium_vivax.eg.sqlite                            
  AH73813 | org.Burkholderia_mallei_ATCC_23344.eg.sqlite              
  AH73814 | org.Bacillus_cereus_(strain_ATCC_14579_|_DSM_31).eg.sqlite
  AH73815 | org.Bacillus_cereus_ATCC_14579.eg.sqlite                  
  AH73816 | org.Schizosaccharomyces_cryophilus_OY26.eg.sqlite    

hub_subset <- query(hub , c("fungidb" ,"OrgDb"))

> hub_subset
AnnotationHub with 277 records
# snapshotDate(): 2019-05-02 
# $dataprovider: FungiDB
# $species: Naganishia albida, Albugo candida 2VRR, Albugo laibachii Nc14, Allomyces macrogynus ATCC 38327, Aphanomyces astaci, Aphanomyces invad...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH71411"]]' 

            title                                                         
  AH71411 | Transcript information for Albugo candida 2VRR                
  AH71412 | Transcript information for Albugo laibachii Nc14              
  AH71413 | Transcript information for Allomyces macrogynus ATCC 38327    
  AH71414 | Transcript information for Aspergillus aculeatus ATCC 16872   
  AH71415 | Transcript information for Aspergillus brasiliensis CBS 101740
  ...       ...                                                           
  AH71937 | Transcript information for Phytophthora sojae P6497           
  AH71938 | Transcript information for Pythium vexans DAOM BR484          
  AH71939 | Transcript information for Saccharomyces cerevisiae S288c     
  AH71940 | Transcript information for Scedosporium apiospermum IHEM 14462
  AH71941 | Transcript information for Yarrowia lipolytica CLIB89 W29   


> hub_subset[["AH71940"]]
downloading 1 resources
retrieving 1 resource
Downloading: 240 B     
Error: failed to load resource
  name: AH71940
  title: Transcript information for Scedosporium apiospermum IHEM 14462
  reason: 1 resources failed to download
In addition: Warning messages:
1: download failed
  web resource path: ‘https://annotationhub.bioconductor.org/fetch/78686’
  local file path: ‘/Users/chirag/Library/Caches/AnnotationHub/25d6f57295d_78686’
  reason: Forbidden (HTTP 403). 
2: bfcadd() failed; resource removed
  rid: BFC16
  fpath: ‘https://annotationhub.bioconductor.org/fetch/78686’
  reason: download failed 
3: download failed
  hub path: ‘https://annotationhub.bioconductor.org/fetch/78686’
  cache resource: ‘AH71940 : 78686’
  reason: bfcadd() failed; see warnings() 


> hub_subset[["AH71412"]]
downloading 1 resources
retrieving 1 resource
Downloading: 240 B     
Error: failed to load resource
  name: AH71412
  title: Transcript information for Albugo laibachii Nc14
  reason: 1 resources failed to download
In addition: Warning messages:
1: download failed
  web resource path: ‘https://annotationhub.bioconductor.org/fetch/78158’
  local file path: ‘/Users/chirag/Library/Caches/AnnotationHub/25d435cd1c6_78158’
  reason: Forbidden (HTTP 403). 
2: bfcadd() failed; resource removed
  rid: BFC17
  fpath: ‘https://annotationhub.bioconductor.org/fetch/78158’
  reason: download failed 
3: download failed
  hub path: ‘https://annotationhub.bioconductor.org/fetch/78158’
  cache resource: ‘AH71412 : 78158’
  reason: bfcadd() failed; see warnings()
annotationhub orgdb • 5.3k views
ADD COMMENT
0
Entering edit mode
shepherl 3.8k
@lshep
Last seen 15 minutes ago
United States

There is an issue with the files. I have reached out to the maintainer of EuPathDb to hopefully get a resolution quickly.

ADD COMMENT
0
Entering edit mode

Thanks for coming back. Waiting for your reply.

ADD REPLY
0
Entering edit mode

Hi @Shepherl,

I wonder if you get any updates from author.

Thanks.

ADD REPLY
0
Entering edit mode

Yes and I am working with them on the solution. There was a naming mismatch with the files and we are working on the re-upload

ADD REPLY
0
Entering edit mode

Thanks ! looking forward to it

ADD REPLY
0
Entering edit mode

While we are waiting for the reupload - where you interested in any other AH ids besides the two above? I might be able to implement a temporary work around while the rest of the files are being processed?

ADD REPLY
1
Entering edit mode

The two above should now be downloadable - I made some manually changes while we wait for the datasets to be reloaded - if you need any more please let me know

> hub = AnnotationHub()
snapshotDate(): 2019-05-20
> hub_subset <- query(hub , c("fungidb" ,"OrgDb"))
> hub_subset[["AH71412"]]
downloading 1 resources
retrieving 1 resource
loading from cache 
    'AH71412 : 78158'


OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Albugo laibachii Nc14
| SPECIES: Albugo laibachii Nc14
| CENTRALID: GID
| Taxonomy ID: 890382
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information

ADD REPLY
0
Entering edit mode

I am interested in all fungus data by fungi data (OrgDB and GRanges objects). I can wait till upload finish.

Thanks a lot. Cheers

ADD REPLY
0
Entering edit mode
shepherl 3.8k
@lshep
Last seen 15 minutes ago
United States

The maintainer has uploaded the new files. I believe everything should now be correct. If you have any further troubles please notify us here.

ADD COMMENT
0
Entering edit mode

Thanks a lot. I will post here if any difficulty encountered,

~C.

ADD REPLY
0
Entering edit mode

Hi,

I encountered same error i reported before, but with different AH id

loading from cache 
    ‘AH70681 : 77427’
downloading 1 resources
retrieving 1 resource
Downloading: 240 B     
Error: failed to load resource
  name: AH70682
  title: Transcript information for Coccidioides immitis RMSCC 2394
  reason: 1 resources failed to download
In addition: Warning messages:
1: download failed
  web resource path: ‘https://annotationhub.bioconductor.org/fetch/77428’
  local file path: ‘/Users/chirag/Library/Caches/AnnotationHub/11135888e414_77428’
  reason: Forbidden (HTTP 403). 
2: bfcadd() failed; resource removed
  rid: BFC87
  fpath: ‘https://annotationhub.bioconductor.org/fetch/77428’
  reason: download failed 
3: download failed
  hub path: ‘https://annotationhub.bioconductor.org/fetch/77428’
  cache resource: ‘AH70682 : 77428’
  reason: bfcadd() failed; see warnings()
ADD REPLY
0
Entering edit mode

Different error with different AH id

 hub[["AH71458"]]
downloading 0 resources
loading from cache 
    ‘AH71458 : 78204’
Error: failed to load resource
  name: AH71458
  title: Transcript information for Coprinopsis cinerea okayama7 130
  reason: database disk image is malformed
In addition: Warning messages:
1: Couldn't set cache size: database disk image is malformed
Use `cache_size` = NULL to turn off this warning. 
2: Couldn't set synchronous mode: database disk image is malformed
  Use `synchronous` = NULL to turn off this warning.

===============================================================

Edit :

Above problem is solved once I download with force=TRUE argument. Exact command is : hub[["AH71458" , force = TRUE]]

ADD REPLY
0
Entering edit mode

This one there probably was a disruption when initially downloading causing a partial download. We will look into the other ERRORs

ADD REPLY
0
Entering edit mode

It seems the files were not uploaded for these 14 files. I have again reached out to the maintainer to hopeful get the files uploaded. Sorry for the inconvenience.

ADD REPLY
0
Entering edit mode

Greetings, I spent some time hunting down the errors for these resources and found that they fall into two classes.

First. fungidb.org does not have transcript data for: Coccidioides.immitis.RMSCC.2394, Coccidioides.immitis.RMSCC.3703, Cryptococcus.neoformans.var.neoformans.B.3501A, Coccidioides.posadasii.CPA.0001, Coccidioides.posadasii.CPA.0020, Coccidioides.posadasii.CPA.0066, Coccidioides.posadasii.RMSCC.1037, Coccidioides.posadasii.RMSCC.1038, Coccidioides.posadasii.RMSCC.2133, Coccidioides.posadasii.RMSCC.3700, Naganishia.albida.NRRL.Y.1402, and Phytophthora.plurivora.AV1007. These should have been removed from the metadata I uploaded to AnnotationHub, but due to an error my filter failed; this has been corrected and the metadata regenerated.

Second. For a small number of species in the various eupathdb projects, including 3 from fungidb: Cryptococcus.neoformans.var.neoformans.B.3501A, Phanerochaete.chrysosporium.RP.78, and Phytophthora.capsici.LT1534; there are some utterly unexpected things in the data downloaded from the eupathdb, including random EOF entries in the middle of the data. I added logic to check for these strange cases and now have the OrgDB/TxDB/GRanges files for them.

I am committing the relevant changes now. If you wish I can upload the 18 or so new files at your leisure.

ADD REPLY
0
Entering edit mode

Could you please retry this resource - I was able to download after chagning the permissions on the file to public.

> ah[["AH70681"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache 
    'AH70681 : 77427'
require("GenomicRanges")
GRanges object with 98134 ranges and 9 metadata columns:
          seqnames        ranges strand |   source            type     score
             <Rle>     <IRanges>  <Rle> | <factor>        <factor> <numeric>
      [1] DS016992     5125-5627      + | EuPathDB            gene      <NA>
      [2] DS016992     5125-5627      + | EuPathDB            mRNA      <NA>
      [3] DS016992     5125-5390      + | EuPathDB            exon      <NA>
      [4] DS016992     5558-5627      + | EuPathDB            exon      <NA>
      [5] DS016992     5125-5390      + | EuPathDB             CDS      <NA>
      ...      ...           ...    ... .      ...             ...       ...
  [98130] DS017001 354109-354201      + | EuPathDB three_prime_UTR      <NA>
  [98131] DS017007   77545-78036      - | EuPathDB            gene      <NA>
  [98132] DS017007   77545-78036      - | EuPathDB            mRNA      <NA>
  [98133] DS017007   77545-78036      - | EuPathDB            exon      <NA>
  [98134] DS017007   77545-78036      - | EuPathDB             CDS      <NA>
              phase                       ID          description
          <integer>              <character>          <character>
      [1]      <NA>               CIHG_04050 hypothetical protein
      [2]      <NA>         CIHG_04050-t26_1 hypothetical protein
      [3]      <NA>       exon_CIHG_04050-E1                 <NA>
      [4]      <NA>       exon_CIHG_04050-E2                 <NA>
      [5]         0 CIHG_04050-t26_1-p1-CDS1                 <NA>
      ...       ...                      ...                  ...
  [98130]      <NA>   utr_CIHG_05753-t26_1_1                 <NA>
  [98131]      <NA>               CIHG_06410 hypothetical protein
  [98132]      <NA>         CIHG_06410-t26_1 hypothetical protein
  [98133]      <NA>       exon_CIHG_06410-E1                 <NA>
  [98134]         0 CIHG_06410-t26_1-p1-CDS1                 <NA>
                    Parent   protein_source_id            Note
           <CharacterList>         <character> <CharacterList>
      [1]             <NA>                <NA>            <NA>
      [2]       CIHG_04050                <NA>            <NA>
      [3] CIHG_04050-t26_1                <NA>            <NA>
      [4] CIHG_04050-t26_1                <NA>            <NA>
      [5] CIHG_04050-t26_1 CIHG_04050-t26_1-p1            <NA>
      ...              ...                 ...             ...
  [98130] CIHG_05753-t26_1                <NA>            <NA>
  [98131]             <NA>                <NA>            <NA>
  [98132]       CIHG_06410                <NA>            <NA>
  [98133] CIHG_06410-t26_1                <NA>            <NA>
  [98134] CIHG_06410-t26_1 CIHG_06410-t26_1-p1            <NA>
  -------

If it still fails could you please also provide the results of sessionInfo()

ADD REPLY
0
Entering edit mode

Thanks shepherl. ah[["AH70681"]] working perfectly fine. However, the above 14 I mentioned are still failing to download.

Below is the summary table, showing fungidb provided OrgDB and GRanges failed downloads

# A tibble: 14 x 7
   genome                         species                                         taxonomyid GRanges OrgDb   orgdb_cols gr_cols
   <chr>                          <chr>                                                <int> <chr>   <chr>   <list>     <list> 
 1 FungiDB-42_CimmitisRMSCC2394   Coccidioides immitis RMSCC 2394                     404692 AH70682 AH71445 <NULL>     <NULL> 
 2 FungiDB-42_CimmitisRMSCC3703   Coccidioides immitis RMSCC 3703                     454286 AH70683 AH71446 <NULL>     <NULL> 
 3 FungiDB-42_CneoformansB-3501A  Cryptococcus neoformans var. neoformans B-3501A     283643 AH70702 AH71465 <NULL>     <NULL> 
 4 FungiDB-42_CposadasiiCPA0001   Coccidioides posadasii CPA 0001                     469472 AH70686 AH71449 <NULL>     <NULL> 
 5 FungiDB-42_CposadasiiCPA0020   Coccidioides posadasii CPA 0020                     490068 AH70687 AH71450 <NULL>     <NULL> 
 6 FungiDB-42_CposadasiiCPA0066   Coccidioides posadasii CPA 0066                     490069 AH70688 AH71451 <NULL>     <NULL> 
 7 FungiDB-42_CposadasiiRMSCC1037 Coccidioides posadasii RMSCC 1037                   490065 AH70689 AH71452 <NULL>     <NULL> 
 8 FungiDB-42_CposadasiiRMSCC1038 Coccidioides posadasii RMSCC 1038                   490066 AH70690 AH71453 <NULL>     <NULL> 
 9 FungiDB-42_CposadasiiRMSCC2133 Coccidioides posadasii RMSCC 2133                   469470 AH70691 AH71454 <NULL>     <NULL> 
10 FungiDB-42_CposadasiiRMSCC3700 Coccidioides posadasii RMSCC 3700                   469471 AH70693 AH71456 <NULL>     <NULL> 
11 FungiDB-42_NalbidaNRRLY1402    Naganishia albida                                   100951 AH70773 AH71536 <NULL>     <NULL> 
12 FungiDB-42_PcapsiciLT1534      Phytophthora capsici LT1534                         763924 AH70734 AH71497 <NULL>     <NULL> 
13 FungiDB-42_PchrysosporiumRP-78 Phanerochaete chrysosporium RP-78                   273507 AH70732 AH71495 <NULL>     <NULL> 
14 FungiDB-42_PplurivoraAV1007    Phytophthora plurivora                              639000 AH70774 AH71537 <NULL>     <NULL>
ADD REPLY
0
Entering edit mode

Other than download, one more thing I would like to add is, no data given for species Candida glabrata. Though it is present in fungiDB online version

ADD REPLY
0
Entering edit mode

Hello,

I have encountered the same error posted here while trying to download data for Trypanosoma brucei brucei TREU927. Kindly check it out too. Below is the code. Please see the session info in the reply to this post as it exceeds 5000 characters with the session info included.

> library('AnnotationHub')
> ah <- AnnotationHub()
> res <- query(ah, c('Trypanosoma brucei brucei TREU927', 'OrgDb', 'EuPathDB'))
> res
AnnotationHub with 2 records
# snapshotDate(): 2019-05-02 
# $dataprovider: TriTrypDB
# $species: Trypanosoma brucei brucei TREU927
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, coordinate_1_based,
#   maintainer, rdatadateadded, preparerclass, tags, rdatapath, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH71682"]]' 

            title                                                       
  AH71682 | Transcript information for Trypanosoma brucei brucei TREU927
  AH72083 | Transcript information for Trypanosoma brucei brucei TREU927

> orgdb <- res[['AH71682']]
downloading 1 resources
retrieving 1 resource
Downloading: 240 B     
Error: failed to load resource
  name: AH71682
  title: Transcript information for Trypanosoma brucei brucei TREU927
  reason: 1 resources failed to download
In addition: Warning messages:
1: download failed
  web resource path: ‘https://annotationhub.bioconductor.org/fetch/78428’
  local file path: ‘/home/wanjau/.cache/AnnotationHub/136336f14bc0_78428’
  reason: Forbidden (HTTP 403). 
2: bfcadd() failed; resource removed
  rid: BFC12
  fpath: ‘https://annotationhub.bioconductor.org/fetch/78428’
  reason: download failed 
3: download failed
  hub path: ‘https://annotationhub.bioconductor.org/fetch/78428’
  cache resource: ‘AH71682 : 78428’
  reason: bfcadd() failed; see warnings()

> orgdb <- res[['AH72083']]
downloading 1 resources
retrieving 1 resource
Downloading: 240 B     
Error: failed to load resource
  name: AH72083
  title: Transcript information for Trypanosoma brucei brucei TREU927
  reason: 1 resources failed to download
In addition: Warning messages:
1: download failed
  web resource path: ‘https://annotationhub.bioconductor.org/fetch/78829’
  local file path: ‘/home/wanjau/.cache/AnnotationHub/136325e2c571_78829’
  reason: Forbidden (HTTP 403). 
2: bfcadd() failed; resource removed
  rid: BFC11
  fpath: ‘https://annotationhub.bioconductor.org/fetch/78829’
  reason: download failed 
3: download failed
  hub path: ‘https://annotationhub.bioconductor.org/fetch/78829’
  cache resource: ‘AH72083 : 78829’
  reason: bfcadd() failed; see warnings()

Thanks!

ADD REPLY
0
Entering edit mode

Here is the session info:

> sessionInfo()
    R version 3.6.0 (2019-04-26)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 18.04.2 LTS

    Matrix products: default
    BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
    LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

    Random number generation:
     RNG:     Mersenne-Twister 
     Normal:  Inversion 
     Sample:  Rounding 

    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
     [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
    [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] edgeR_3.26.1         limma_3.40.0         AnnotationHub_2.16.0 BiocFileCache_1.8.0 
    [5] dbplyr_1.4.2         BiocGenerics_0.30.0 

    loaded via a namespace (and not attached):
     [1] Rcpp_1.0.1                    pillar_1.4.0                 
     [3] compiler_3.6.0                BiocManager_1.30.4           
     [5] later_0.8.0                   tools_3.6.0                  
     [7] digest_0.6.18                 bit_1.1-14                   
     [9] lattice_0.20-38               RSQLite_2.1.1                
    [11] memoise_1.1.0                 tibble_2.1.1                 
    [13] pkgconfig_2.0.2               rlang_0.3.4                  
    [15] shiny_1.3.2                   DBI_1.0.0                    
    [17] rstudioapi_0.10               curl_4.0                     
    [19] yaml_2.2.0                    dplyr_0.8.1                  
    [21] httr_1.4.0                    IRanges_2.18.0               
    [23] S4Vectors_0.22.0              rappdirs_0.3.1               
    [25] grid_3.6.0                    locfit_1.5-9.1               
    [27] stats4_3.6.0                  bit64_0.9-7                  
    [29] tidyselect_0.2.5              Biobase_2.44.0               
    [31] glue_1.3.1                    R6_2.4.0                     
    [33] AnnotationDbi_1.46.0          purrr_0.3.2                  
    [35] blob_1.1.1                    magrittr_1.5                 
    [37] promises_1.0.1                htmltools_0.3.6              
    [39] assertthat_0.2.1              xtable_1.8-4                 
    [41] mime_0.6                      interactiveDisplayBase_1.22.0
    [43] httpuv_1.5.1                  crayon_1.3.4
ADD REPLY
0
Entering edit mode

Thank you for pointing this out we have been experience some inconsistency with the annotations added from the contributed package EuPathDB and are working with the maintainer @abelew to get it corrected. Some of these resources were added but there is no data actually generated and could be removed. I will leave @abelew to respond to if the species reported here will be included or removed. Temporarily we will be removing the resources from the database and re-entering a clean set to hopefully remedy the issue. We are sorry for any confusion or inconvenience this has caused and hope to have the resources you are interested in available in the hub as soon as possible.

ADD REPLY
0
Entering edit mode

Thanks a lot for the update. I look forward to the remedy of the issue.

ADD REPLY
0
Entering edit mode

One more thing I would like to get here is, the AnnotationHub::snapshotDate() The timestamp of online version of annotationHub is 2019-08-02 12:51:11 +0000. However, when i run locally - AnnotationHub::snapshotDate() time stamp is snapshotDate(): 2019-05-02. i wonder why this discrepancy ? Is this correct ?

P.S. I run time stamp after running hub <- AnnotationHub(). So there is no issue of using available local version.

Thanks.

ADD REPLY
1
Entering edit mode

The 2019-05-02 is correct for the release (3.9) version of Bioconductor and 2019-08-02 is correct for the devel (3.10) version of Bioconductor. The online version always shows devel while the R code does filtering and displays based on your running version. Resources associated with a different version or a new version of a package are kept in sync with the version of Bioconductor that the new resources are added.

ADD REPLY
0
Entering edit mode

Ok. Thanks.!

Temporarily we will be removing the resources from the database and re-entering a clean set to hopefully remedy the issue

Regarding to above comment, can you please update here once database updated ? Thanks a lot.

ADD REPLY
0
Entering edit mode

Yes I will update. I removed the currently dataset from the database. The maintainer has regenerated the data and I am performing some checks now. I hope to have them uploaded by the end of the week but I will post back when they have been added back in.

ADD REPLY
0
Entering edit mode

Hi Shepherl,

Just a reminder if there is any update on AnnotationHub regarding data from EuPathDb.

Chirag.

ADD REPLY
0
Entering edit mode

Yes we are still working on the issue. We hope it will be resolved and the resources will be re-added by the end of the week or early next week.

ADD REPLY
0
Entering edit mode

Thanks for updates. Looking forward.

ADD REPLY
0
Entering edit mode

We have added the files back in and they should be available.
There is one problematic file that we are aware of

  name: AH74302
  title: Transcript information for Entamoeba dispar SAW760
  reason: database disk image is malformed

That we are investigating and/or regenerating but the others should be available.

ADD REPLY
0
Entering edit mode

Thanks a lot. I will look in to that

ADD REPLY
0
Entering edit mode

Thanks a lot. I will look in to that

ADD REPLY
0
Entering edit mode

Hi Shepherl,

Sorry for bothering you again. I find some version discrepancy in the data given by FungiDb. Look at the example below.

I compare the version given in the sourceurl column and genome column. They are not agreeing with each other and therefore I am confused that the given annotation hub object (AH id) belongs to which version.

As you can see below although release is 42 data genome is of version 45.

Correct me if I am missing something here .

library(AnnotationHub)
library(tidyverse)
hub <- AnnotationHub()
#> snapshotDate(): 2019-05-02

message(c("Total number of object in the hub : ",   hub@.db_uid %>% length() ))
#> Total number of object in the hub : 45228


hub_master_tbl <- tibble(ah_id = hub$ah_id , 
                         title = hub$title , 
                         dataprovider = hub$dataprovider , 
                         species = hub$species , 
                         taxonomyid = hub$taxonomyid , 
                         genome = hub$genome , 
                         description = hub$description , 
                         coordinate_1_based = hub$coordinate_1_based , 
                         maintainer = hub$maintainer , 
                         rdatadateadded = hub$rdatadateadded , 
                         preparerclass = hub$preparerclass , 
                         tags = hub$tags , 
                         rdataclass = hub$rdataclass , 
                         rdatapath = hub$rdatapath, 
                         sourceurl = hub$sourceurl , 
                         sourcetype = hub$sourcetype) 

## data from fungi DB
hub_fungidb <- hub_master_tbl %>% filter(dataprovider == "FungiDB")

## type of rdataclass objects 
hub_fungidb %>% group_by(rdataclass) %>% tally()
#> # A tibble: 2 x 2
#>   rdataclass     n
#>   <chr>      <int>
#> 1 GRanges      114
#> 2 OrgDb        137


## discrepancy between  column `sourceurl` and column `genome`
hub_fungidb %>% dplyr::select(sourceurl,  genome) %>% 
  mutate(release = map_chr(sourceurl , ~( str_split( .x , "\\/")[[1]] [6])  ) ) %>% 
  mutate(genome_version = map_chr(genome , ~( str_split( .x , "\\_")[[1]] [1])  ) ) %>%
  dplyr::select(release , genome_version) %>% unique()
#> # A tibble: 2 x 2
#>   release         genome_version
#>   <chr>           <chr>         
#> 1 Current_Release FungiDB-39    
#> 2 release-42      FungiDB-45

Created on 2019-09-19 by the reprex package (v0.3.0)

ADD REPLY
0
Entering edit mode

I did not generate the data so I will ask the maintainer to respond here. @abelew

ADD REPLY
0
Entering edit mode

Hello Chirag, Unfortunately, when the eupathdb moved to the current REST api for downloading annotation data, the ability to specify the version for that information was lost. I wrote the eupathdb.org perhaps two years ago asking about how I might specify the version and at the time it was impossible, so I set that question aside. I will write again and see if that has changed. In contrast, the TxDb and GRanges are generated directly from the downloaded gff files, it is therefore trivial to specify their version ids. I regenerated the OrgDb data on August 7 and therefore it downloaded version 45 data. I am sorry for the confusion.

ADD REPLY
0
Entering edit mode

Hi abelew, Thanks for clarifications.

Another question I have is regarding GO data given in the OrgDB object. As you say, OrgDb generated from version 45 (which is also current online version of FungiDB), why the genes annotated to GO terms are drastically different.

See the example below.

I filter three different GO terms (GO:0048315 , GO:0005840 ,GO:0019748) for Aspergillus nidulans using annotation hub and match the total number of genes assigned to it with FungiDb go data. Surprisingly, I could match only one GO term (GO:0048315), while other two terms have huge difference in the number of genes assigned to it.

GO term : GO:0019748 has 660 genes assigned by FungiDB while in the OrgDB object only 125 genes assigned to it. GO term : GO:0005840 has 151 genes assigned to it while in the OrgDB object only 125 genes assigned to it.

Could you please clarify, why this discrepancy ? Or please correct me if I am missing something.

library(AnnotationHub)
library(tidyverse)
hub <- AnnotationHub()
#> snapshotDate(): 2019-05-02


message(c("Total number of object in the hub : ",   hub@.db_uid %>% length() ))
#> Total number of object in the hub : 45228


hub_master_tbl <- tibble(ah_id = hub$ah_id , 
                         title = hub$title , 
                         dataprovider = hub$dataprovider , 
                         species = hub$species , 
                         taxonomyid = hub$taxonomyid , 
                         genome = hub$genome , 
                         description = hub$description , 
                         coordinate_1_based = hub$coordinate_1_based , 
                         maintainer = hub$maintainer , 
                         rdatadateadded = hub$rdatadateadded , 
                         preparerclass = hub$preparerclass , 
                         tags = hub$tags , 
                         rdataclass = hub$rdataclass , 
                         rdatapath = hub$rdatapath, 
                         sourceurl = hub$sourceurl , 
                         sourcetype = hub$sourcetype) 

## data from fungi DB
hub_fungidb <- hub_master_tbl %>% filter(dataprovider == "FungiDB")


## Find A nidulans orgdb  AH ID 
an_ah_id <- hub_fungidb %>% filter(grepl("nidulans" , species) & (rdataclass == "OrgDb")) %>% pull(ah_id)


## Get orgdb
orgdb <- hub[[an_ah_id]]

## columns to extract 
## get GO data 

cols_to_pull <- c("GID" , "GO_ID" , "GO_TERM_NAME" , "GO_EVIDENCE_CODE", "GO_ONTOLOGY")

go_data <-  orgdb %>% 
  AnnotationDbi::select(columns = cols_to_pull , keys = keys(orgdb) , keytype = "GID") %>% 
  as_tibble()
#> 'select()' returned 1:many mapping between keys and columns

## count genes in each go   
gene_count_to_go <- go_data  %>% group_by(GO_ID) %>% 
  tally(sort = T , name = "number_of_genes")  %>% 
  left_join(go_data %>% dplyr::select(GO_ID , GO_TERM_NAME) ) %>% 
  drop_na() %>% 
  unique() 
#> Joining, by = "GO_ID"


## genes assigned to term GO:0048315 (conidium formation)
gene_count_to_go %>% filter(GO_ID == "GO:0048315")
#> # A tibble: 1 x 3
#>   GO_ID      number_of_genes GO_TERM_NAME      
#>   <chr>                <int> <chr>             
#> 1 GO:0048315             130 conidium formation

## genes assigned to term GO:0005840 (ribosome)
gene_count_to_go %>% filter(GO_ID == "GO:0005840")
#> # A tibble: 1 x 3
#>   GO_ID      number_of_genes GO_TERM_NAME
#>   <chr>                <int> <chr>       
#> 1 GO:0005840             125 ribosome

## genes assigned to term  GO:0019748 (secondary metabolic process)
gene_count_to_go %>% filter(GO_ID == "GO:0019748")
#> # A tibble: 1 x 3
#>   GO_ID      number_of_genes GO_TERM_NAME               
#>   <chr>                <int> <chr>                      
#> 1 GO:0019748             125 secondary metabolic process

Created on 2019-09-22 by the reprex package (v0.3.0)

ADD REPLY
2
Entering edit mode

TL;DR; The fungidb website returns to you the union of the GO and GOSLIM tables. As of now, the OrgDb data that I generate only has the GO table. While answering below, I figured out how to get the GOSLIM data, and will be adding that in future revisions.

Here is the long version:

Greetings! I am not certain that I have a complete answer, but I can quickly identify a couple of things which are relevant. Focusing on secondary metabolic process (GO:0019748), I wanted first to duplicate your query. I first put the GO ID into the 'Gene Text Search' which returned 699 genes, of which 98 are in Aspergillus nidulans. I therefore assumed this is not your query. which retrieves 660 genes. In an attempt to do so, I sent my web browser to fungidb.org and put into the 'Gene Text Search' button the string "GO:0019748". In a separate search, I went to: https://fungidb.org/fungidb/showQuestion.do?questionFullName=GeneQuestions.GenesByGoTerm and put in "GO:0019748". this netted me 660 entries. So I think I can safely assume that is your query.

I then asked the web interface to show me all 660 entries in one table on my left screen. On my right screen, I asked for the first 100 godata entries for all genes along with the first 30 annotation table entries. The first fungidb entry for GO:0019748 (secondary metabolic process) is AN0014, which has as its putative annotation: "protein of unknown function", checking my top-level annotations for A.nidulans, that matches my AN0014, checking the godata, I see that in my version of the data, AN0014 is in groups 'GO:0008150 (biological process), GO:0044550 (secondary metabolite synthetic process), GO:0005575 (cellular component), and GO:0003674 (molecular function)'. So, 3 of the 4 annotations are basically meaningless for our purposes. However, GO:0044550 provides (I think) the answer to your question; in the go heirarchy, it is a child node of both 'GO:0019748' (your query) and 'GO:0009058' (biosynthetic process).

Here is the catch: when I download the GO tables from the various eupathdb data sources, the table I receive (and therefore dump into the OrgDB) contains the GO annotations from the various data sources at whatever level of the GO tree they came from, not the levels above it. I think for that information, you need to cross reference against GO.db or GOSLIM or whatever.

With that in mind, I think we can find a more explicit answer by clicking on the AN0014 entry in the GO table. When I do that, it returns me the full set of annotations for this gene, if I click on the little section '15' on the left (Function Prediction), it shows me exactly what is going on: The first table is the set of EC numbers assigned for this gene (there are none), the second is the set of GO SLIM entries (and lo, there is GO:0019748 in the third row, GO Slim ID column!). The third table is exactly the data I download to create the Orgdb in the columns prefixed GO_ (e.g. the columns you downloaded in your select to create go_data). I have not yet written a function to download GOSLIM data from the eupathdb (I am not sure they have a REST query for it yet -- at least it is not listed in the set of services available and I think I tried a long time ago and failed).

I have a new email chain with the kind folks at the eupathdb in which I asked (again) about querying for a specific version of the data(they cannot); I will ask about this tonight. Actually, I recently figured out how to get more information about stuff like the linkout tables -- let me try modifying that query... Oh! It worked! I can now add the goslim information as a separate table, the missing IDs are in the column 'GOSLIMGOID'. Unfortunately, in order to get the full 660 IDs, you are going to need to separately query the GO table and GOSLIM table, otherwise I think AnnotationDBI will throw an error due to how it merges rows off of separate tables. I will add the GOSLIM data in the next round of OrgDb creations. If they are something you are desperate for now, I can regenerate them now and send you a link to the sqlite files and/or orgdb packages and you can make a local ah instance using them.

ADD REPLY
0
Entering edit mode

Dear abelew,

Thanks so much for clarifying all my doubts. I am not in a hurry to get GOSLIM data. I can wait till next round of OrgDb creations.

Thanks, Chirag.

ADD REPLY
0
Entering edit mode

Do you need to also include the goslim information, or just have an easy way to map between go id's and goslim ids? left_join(<result of select()>, <goslim representation mapping between go and goslim ids>)... This would be useful in general, not just for these resources.

ADD REPLY
1
Entering edit mode

Tangentially, your initial commands (hub to tibble) can be written as

mcols(hub) %>% as_tibble(rownames="ah_id")

and the discovery phase could be written using AnnotationHub::query() directly.


> query(hub, c("FungiDB", "nidulans", "OrgDb"))
AnnotationHub with 1 record
# snapshotDate(): 2019-09-17
# names(): AH74340
# $dataprovider: FungiDB
# $species: Aspergillus nidulans FGSC A4
# $rdataclass: OrgDb
# $rdatadateadded: 2019-09-17
# $title: Transcript information for Aspergillus nidulans FGSC A4
# $description: FungiDB 42 annotations for Aspergillus nidulans FGSC A4
# $taxonomyid: 227321
# $genome: FungiDB-45_AnidulansFGSCA4
# $sourcetype: GFF
# $sourceurl: http://FungiDB.org/common/downloads/release-42/AnidulansFGSCA4...
# $sourcesize: NA
# $tags: c("Annotation", "Eukaryote", "EuPathDB", "Fungi", "Fungus",
#   "Parasite", "Pathogen")
# retrieve record with 'object[["AH74340"]]'
> query(hub, c("FungiDB", "nidulans", "OrgDb"))[[1]]

I don't know anything about FungiDB, but I'm a little perplexed by the description / genome / sourceurl discrepancy and don't really follow the explanation in a previous question...; maybe more generally I see

> mcols(query(hub, c("FungiDB", "nidulans"))) %>%
       as_tibble(rownames="ahid") %>%
       select(ahid, rdataclass, description, genome)

# A tibble: 4 x 4
  ahid    rdataclass description                               genome
  <chr>   <chr>      <chr>                                     <chr>
1 AH65324 GRanges    FungiDB 39 transcript information for As… FungiDB-39_Anidu…
2 AH74026 GRanges    FungiDB 42 transcript information for As… FungiDB-45_Anidu…
3 AH74340 OrgDb      FungiDB 42 annotations for Aspergillus n… FungiDB-45_Anidu…
4 AH74665 TxDb       FungiDB 42 Transcript information for As… FungiDB-45_Anidu…

This seems like a great resource so I look forward to getting the kinks (in the data or my understanding!) worked out...

ADD REPLY
2
Entering edit mode

Greetings! When I was poking around in my own R session, I primarily did AnnotationHub::query() followed by mcols() just because that is what I first learned; so I found it neat to see a different way of thinking about how to filter the ah instance.

With respect to the version number discrepancy, the full reason is: I explicitly set the version for the GRanges/TxDB/BSGenome data because I can explicitly download any revision from the eupathdb. All data structures are created at the same time, so that version number was carried along into the OrgDb data (this was intended to facilitate making an organismdbi out of the union of the OrgDb and TxDb data), even though the REST API does not have a way to ask for a specific version. Because of the various problems I had in getting the version 42 data completed, I regenerated it multiple times in the interval and so the data passed beyond the original version. For future generations, I removed the version number parameter, so the OrgDb version ID will now just be whatever the eupathdb.org REST API returns.

ADD REPLY
0
Entering edit mode

Thanks for the explanation. I also appreciated seeing an alternative way of working with the hub objects, especially opportunities for using dplyr / tibble in this context.

ADD REPLY
0
Entering edit mode

Thanks @Martin. You reduced significant chunk of the code.

Cheers, Chirag.

ADD REPLY

Login before adding your answer.

Traffic: 733 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6