Question: Dealing With Package Sizes Greater than 5 MB
0
gravatar for j.flesch
7 weeks ago by
j.flesch0
j.flesch0 wrote:

We are developing a package with a test for detection of differential distributions. It will offer generalized testing functions for any one-dimensional data that exists in two conditions, as well as a specialized statistical test for single cell RNAseq data. As development of all features is now finished we are trying to submit it to BioConductor. For me this would be the first time submitting to Bioconductor and I would greatly appreciate any help with our problem.

Problem

Included in the package data/ directory is an empirical cumulative distribution function of 1,000,000 values and 4,8M size that takes up the most space in the package and leads to Notes/Errors when running R CMD check ... / R CMD BiocCheck ... telling us the package is too big.

  1. Is there any tolerance with regard to the package size restrictions on BioConductor? (See below for the total size)
  2. Is it justifiable to submit a data package with this distribution to BioConductor on the basis that it is required by another BioC package, even though it contains no biological data-set? Searching for such existing packages among the BioConductor, I could only find cases where example data-sets (genomical/biological data only) were imported as separate packages.

Again, thanks for any advice you have to offer!

Details

A function included in the package determines p-values from the empirical quantile function of a distribution called the Brownian bridge. The quantile function has been calculated beforehand up to a high precision and is saved as the following function:

> empcdf.ref
Empirical CDF 
Call: ecdf(value.integral)
 x[1:1000000] = 0.0083841, 0.0088768, 0.0095009,  ..., 2.7204,  3.012

In the current state of our package, the function has been stored as a .RData file to a data/ directory with the following command:

> save(empcdf.ref, file="data/empcdf_ref.RData", compress=TRUE, compression_level=9)

To ensure best compression we also tried the following commands:

> tools::checkRdaFiles("empcdf_ref.RData")
                     size ASCII compress version
empcdf_ref.RData 12337274 FALSE     gzip       3
> tools::resaveRdaFiles("empcdf_ref.RData", compress ="auto")
> tools::checkRdaFiles("empcdf_ref.RData")
                    size ASCII compress version
empcdf_ref.RData 5009924 FALSE       xz       3

So that finally the quantile function can be stored at a size of 4.8M:

$ du -h data/empcdf_ref.RData 
4,8M    data/empcdf_ref.RData

Upon running R CMD BiocCheck this shows as an Error

* Checking package size...
    * ERROR: Package Source tarball exceeds Bioconductor size
      requirement.
        Package Size: 5.0324 MB
        Size Requirement: 5.0000 MB
* Checking individual file sizes...
    * WARNING: The following files are over 5MB in size:
        'data/empcdf_ref.RData'

EDIT: BiocCheck Tag added and R CMD BiocCheck output

ADD COMMENTlink modified 6 weeks ago • written 7 weeks ago by j.flesch0

You might get a better response posting this to the Bioc Developers mailing list, which is more focused on issues like this. You can sign up at https://stat.ethz.ch/mailman/listinfo/bioc-devel

There was a post very recently on this same topic which may be useful: https://stat.ethz.ch/pipermail/bioc-devel/2019-July/015311.html

ADD REPLYlink written 6 weeks ago by Mike Smith3.9k

Thank you, Mike! I will reframe this question and try the mailing list.

ADD REPLYlink written 6 weeks ago by j.flesch0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 184 users visited in the last hour