why is downloading the .tar.gz file for a given bioconductor package, then installing with R CMD INSTALL so much more effective than BiocManager::install()
0
1
Entering edit mode
Ndimensional ▴ 20
@vlaufer-14169
Last seen 16 months ago
United States
# consider the following:
LibList=c('this', 'that', 'theother', ... , 'libraryN - 1', 'library N')
LibsToInstall<-LibList[ !(LibList %in% as.data.frame(installed.packages())[["Package"]]) ]
BiocManager::install(LibsToInstall, update=TRUE, ask=FALSE)
lapply(TcgaUtilLibs, library, character.only = TRUE)

Irrespective of the version of R/Bioconductor, as Liblist grows beyond 10, the probability of an installation can only increase; by the time N=20, in my experience, the failure rate approaches 1 (and more often you will be looking at 2 to 3 failures).

By contrast, on the command line, irrespective of OS (e.g. Mac, various flavors of Linux)

cd RlibPathDir
wget https://bioconductor.org/packages/release/bioc/src/contrib/this_1.40.2.tar.gz
wget https://bioconductor.org/packages/release/bioc/src/contrib/that_1.40.2.tar.gz
wget https://bioconductor.org/packages/release/bioc/src/contrib/the_other_1.40.2.tar.gz
...
wget https://bioconductor.org/packages/release/bioc/src/contrib/LibraryN.tar.gz
R CMD INSTALL ./this_1.40.2.tar.gz

almost never fails (all I can really say here is that I have not experience a failure to date, but I have worked with many versions of R, Rstudio, bioconductor, etc.).

What is BiocManager::install doing that creates such a high failure rate? Is there any prebuilt functionality that will take the above approach instead of whatever the default behavior of BiocManager::install() is?

Bioconductor Install library biocmanager • 1.6k views
ADD COMMENT
1
Entering edit mode

I was shown an example of this recently where more than 100 packages were being installed on a docker image. one or two packages would fail to download, with the error message being something along the lines of 'unable to find host'. This implies to me that there was a temporary internet connectivity issue on the wireless network, and in general an unreliable internet connection seems like a likely candidate for intermittent failed downloads. What is your internet connection like to bioconductor.org?

The way the failure is handled is not useful -- the installation continues to download all packages, then fails to install the failed package and its reverse dependencies. The user then tries again, but the failed package and reverse dependencies need to be downloaded again (though not the packages that were successfully installed).

BiocManager::install() delegates this to R's install.packages(), so the underlying issue / solution might be there. Several ideas are to investigate alternative download methods, and to cache downloaded tarballs. I believe that install.packages() uses download.packages() and in turn download.file(). download.packages() has a destdir argument that might be exploited to avoid re-downloading; download.file() has a method argument that could be explored.

For method, the help page ?download.file says that an 'libcurl' is used by default but on some OS (including the poster's?) there is wget. The help page suggests one way of setting this as options(download.file.method = "wget").

I'm not sure how destdir could be exploited easily by the 'user' -- I don't think there's a way to tell install.packages() to only download packages that have not yet been downloaded.

I think both method= and destdir= could be used as arguments to BiocManager::install() and would be passed to install.packages().

Update -- yes, adding method = "wget" to BiocManager::install() changes the download method. Also true with destfile =, but still not sure how that might help...

...and after a little more digging R also has options(download.file.extra=...), which passes command-line arguments to the download method. For wget, passing -c might mean that partial downloads are continued, and existing downloads are not re-downloaded; there is also -N and options for retries, but I am not a wget expert. Also, since the download file location is constant per-session, it seems reasonable if a bit hacky to do

options(download.file.method = "wget", download.file.extra = "-c")
pkgs <- c(....)
BiocManager::install(pkgs)

and repeat BiocManager::install(pkgs) as necessary -- successfully downloaded packages won't be re-downloaded; successfully installed packages won't be re-installed.

ADD REPLY
0
Entering edit mode

Can you give some examples of the type of error(s) you're encountering when using BiocManager::install() e.g. is it missing R package dependencies, missing system libraries, corrupted tarball downloads, problems with existing 00LOCK-pkg folders in the library directory, multiple parallel installations that encounter race conditions etc?

There are lots of ways package installation can fail, but I don't feel the behaviour you describe is my typical experience. Certainly not a 50% failure rate if I try to install N=2 packages. Personally I think I'd find getting collecting all the dependencies to install manually using R CMD INSTALL considereably more work and error prone.

For example, I regularly install hundreds of packages using BiocManager e.g MSMB-Quarto and it generally works fine.

There are other packages that try to manage package installation (amongst other things) for example pak or renv

ADD REPLY
0
Entering edit mode

Sorry about the N=2. That was meant to read 20 (following the 10 in the previous sentence), my apologies there.

The provided link to MSMB-quarto issues a 404 for me.

Re: R CMD install - I wish that were true, I would sure as hell prefer to be doing this through bioconductor.

Re: the types of errors, please see reply below.

Thanks for takign time out to reply.

ADD REPLY
1
Entering edit mode

Hi, I don't see information on types of errors. Feel free to attach a report with a reproducible event and full information on BiocManager::version(), BiocManager::valid(), and sessionInfo(). There are many ways for R package installation to trigger adverse events. BiocManager::install was created to help reduce the risk of adverse events, and it introduces certain constraints to help users avoid inconsistencies. Your report is important to us, but without sufficient information we cannot provide more assistance.

ADD REPLY
0
Entering edit mode

Sorry, I forgot that MSMB-Quarto was a private repo. It's the workflow that builds www.huber.embl.de/msmb and currently installs 120 named packages + dependencies. Here's the current list:

url("https://www.huber.embl.de/msmb/msmb_packages.rds") |>
  readRDS() |>
  base::`$`("packages") |>
  unique()
#>   [1] "BiocManager"                  "Biostrings"                   "BSgenome.Celegans.UCSC.ce2"   "BSgenome"                     "BSgenome.Ecoli.NCBI.20080805" "BSgenome.Hsapiens.UCSC.hg19" 
#>   [7] "dplyr"                        "ggplot2"                      "Gviz"                         "HardyWeinberg"                "igraph"                       "markovchain"                 
#>  [13] "Renext"                       "seqLogo"                      "vcd"                          "AnnotationDbi"                "Biobase"                      "biovizBase"                  
#>  [19] "colorspace"                   "GenomicRanges"                "ggbeeswarm"                   "ggbio"                        "ggridges"                     "ggthemes"                    
#>  [25] "grid"                         "Hiiragi2013"                  "Hmisc"                        "magrittr"                     "mouse4302.db"                 "pheatmap"                    
#>  [31] "plotly"                       "RColorBrewer"                 "reshape2"                     "rgl"                          "tibble"                       "bootstrap"                   
#>  [37] "flexmix"                      "HistData"                     "mixtools"                     "modeltools"                   "mosaics"                      "mosaicsExample"              
#>  [43] "tidyr"                        "cluster"                      "clusterExperiment"            "dada2"                        "dbscan"                       "flowCore"                    
#>  [49] "flowPeaks"                    "flowViz"                      "fpc"                          "ggcyto"                       "gplots"                       "graphics"                    
#>  [55] "gridExtra"                    "kernlab"                      "labeling"                     "limma"                        "MASS"                         "readr"                       
#>  [61] "scRNAseq"                     "vegan"                        "airway"                       "DESeq2"                       "fdrtool"                      "gganimate"                   
#>  [67] "IHW"                          "magick"                       "transformr"                   "ade4"                         "factoextra"                   "GGally"                      
#>  [73] "phyloseq"                     "SummarizedExperiment"         "xcms"                         "genefilter"                   "matrixStats"                  "pasilla"                     
#>  [79] "shiny"                        "vsn"                          "diffusionMap"                 "ggrepel"                      "LPCM"                         "MSMB"                        
#>  [85] "photobiology"                 "PMA"                          "rnaturalearth"                "rnaturalearthdata"            "Rtsne"                        "scatterplot3d"               
#>  [91] "sf"                           "sva"                          "xtable"                       "ape"                          "BioNet"                       "DECIPHER"                    
#>  [97] "DLBCL"                        "ggnetwork"                    "ggtree"                       "GOplot"                       "GSEABase"                     "network"                     
#> [103] "phangorn"                     "phyloseqGraphTest"            "reshape"                      "rworldmap"                    "stats"                        "structSSI"                   
#> [109] "EBImage"                      "geometry"                     "spatstat"                     "caret"                        "curatedMetagenomicData"       "ExperimentHub"               
#> [115] "glmnet"                       "grDevices"                    "rrcov"                        "pwr"                          "Rcpp"                         "survey"
ADD REPLY

Login before adding your answer.

Traffic: 520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6