Whilst running a `RIPSeeper` analysis, I noticed that the dataset `drerio_gene_ensembl` which used to be available via `biomaRt` is not longer listed or accessible. To test this I first upgraded my `bioC` to make sure I am working with the latest version of `biomaRt` (2.34.0).
```r
biocLite("BiocUpgrade")
biocLite("BiocUpgrade")
```
I then followed the instructions in the [vignette](https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html) and connected to `ensembl`:
```r
library("biomaRt")
```
```
## Loading required package: methods
```
```r
ensembl <- useMart("ensembl")
dat <- listDatasets(ensembl)
str(dat)
```
```
## 'data.frame': 33 obs. of 3 variables:
## $ dataset :Class 'AsIs' chr [1:33] "amelanoleuca_gene_ensembl" "dordii_gene_ensembl" "mpahari_gene_ensembl" "trubripes_gene_ensembl" ...
## $ description:Class 'AsIs' chr [1:33] "Panda genes (ailMel1)" "Kangaroo rat genes (Dord_2.0)" "Shrew mouse genes (PAHARI_EIJ_v1.1)" "Fugu genes (FUGU 4.0)" ...
## $ version :Class 'AsIs' chr [1:33] "ailMel1" "Dord_2.0" "PAHARI_EIJ_v1.1" "FUGU 4.0" ...
```
```r
dim(dat)
```
```
## [1] 33 3
```
```r
dat[grepl("hsapiens", dat$dataset),]
```
```
## [1] dataset description version
## <0 rows> (or 0-length row.names)
```
```r
dat[grepl("drerio", dat$dataset),]
```
```
## [1] dataset description version
## <0 rows> (or 0-length row.names)
```
The tutorial lists 85 datasets wheres now it only retrieves 50. Weirdly, I noticed that numbers changed when I repeated this so I wrapped this in a loop and repeated the analysis several times:
```r
for (i in 1:10){
print(paste("cycle:",i))
ensembl <- useMart("ensembl")
dat <- listDatasets(ensembl)
print(dim(dat))
print(paste("Is drerio present?", "drerio_gene_ensembl" %in% dat$dataset))
}
```
```
## [1] "cycle: 1"
## [1] 33 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 2"
## [1] 50 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 3"
## [1] 50 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 4"
## [1] 33 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 5"
## [1] 50 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 6"
## [1] 33 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 7"
## [1] 46 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 8"
## [1] 50 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 9"
## [1] 45 3
## [1] "Is drerio present? FALSE"
## [1] "cycle: 10"
## [1] 46 3
## [1] "Is drerio present? FALSE"
```
The number of datasets listed varies with almost each run. Importantly for me `drerio_gene_ensembl` was missing in all the tests except one.
This instability leads to:
- errors when using packages which depend on a connection to ensembl, for instance `RIPSeeker`.
- reproducibility errors for anyone not using one the stable datasets (I did not test which ones were always present but hsapiens appears to be always available).
I "solved" the issue by using an archive host:
```r
host <- "http://oct2016.archive.ensembl.org"
ensembl <- useMart("ensembl", host = "oct2016.archive.ensembl.org")
dat <- listDatasets(ensembl)
str(dat)
```
```
## 'data.frame': 69 obs. of 3 variables:
## $ dataset :Class 'AsIs' chr [1:69] "oanatinus_gene_ensembl" "cporcellus_gene_ensembl" "gaculeatus_gene_ensembl" "itridecemlineatus_gene_ensembl" ...
## $ description:Class 'AsIs' chr [1:69] "Ornithorhynchus anatinus genes (OANA5)" "Cavia porcellus genes (cavPor3)" "Gasterosteus aculeatus genes (BROADS1)" "Ictidomys tridecemlineatus genes (spetri2)" ...
## $ version :Class 'AsIs' chr [1:69] "OANA5" "cavPor3" "BROADS1" "spetri2" ...
```
```r
dim(dat)
```
```
## [1] 69 3
```
```r
dat[grepl("drerio", dat$dataset),]
```
```
## dataset description version
## 40 drerio_gene_ensembl Danio rerio genes (GRCz10) GRCz10
```
but using older annotations is a bit of an hack. Has anything changed recently in ensembl or `biomaRt` that explains the missing dataset and this instability?
```r
sessionInfo()
```
```
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] biomaRt_2.34.0 knitr_1.17
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.14 AnnotationDbi_1.40.0 magrittr_1.5
## [4] BiocGenerics_0.24.0 progress_1.1.2 IRanges_2.12.0
## [7] bit_1.1-12 R6_2.2.2 rlang_0.1.4
## [10] stringr_1.2.0 blob_1.1.0 tools_3.4.2
## [13] parallel_3.4.2 Biobase_2.38.0 DBI_0.7
## [16] assertthat_0.2.0 bit64_0.9-7 digest_0.6.12
## [19] tibble_1.3.4 S4Vectors_0.16.0 bitops_1.0-6
## [22] RCurl_1.95-4.8 memoise_1.1.0 RSQLite_2.0
## [25] evaluate_0.10.1 stringi_1.1.6 compiler_3.4.2
## [28] prettyunits_1.0.2 stats4_3.4.2 XML_3.98-1.9
```
@moderators: I could not format the post properly due to an error:
Language "fr" is not one of the supported languages ['en']!
Post was copy-pasted from a markdown document generated via
knitr
, so no idea.You could instead of using an old archived version go for version 90 from August (http://Aug2017.archive.ensembl.org), which in many cases will have limited differences to the very latest release
Good tip. In my case it was a little lazy because I am also using the script to run some C. elegans data analysis and this will work for both - RIPSeeker needs
biomaRt/ensembl
so I need an archive version before the move to Wormbase. Another hack.