Dealing With Package Sizes Greater than 5 MB
0
1
Entering edit mode
j.flesch ▴ 10
@jflesch-21490
Last seen 4.7 years ago

We are developing a package with a test for detection of differential distributions. It will offer generalized testing functions for any one-dimensional data that exists in two conditions, as well as a specialized statistical test for single cell RNAseq data. As development of all features is now finished we are trying to submit it to BioConductor. For me this would be the first time submitting to Bioconductor and I would greatly appreciate any help with our problem.

Problem

Included in the package data/ directory is an empirical cumulative distribution function of 1,000,000 values and 4,8M size that takes up the most space in the package and leads to Notes/Errors when running R CMD check ... / R CMD BiocCheck ... telling us the package is too big.

  1. Is there any tolerance with regard to the package size restrictions on BioConductor? (See below for the total size)
  2. Is it justifiable to submit a data package with this distribution to BioConductor on the basis that it is required by another BioC package, even though it contains no biological data-set? Searching for such existing packages among the BioConductor, I could only find cases where example data-sets (genomical/biological data only) were imported as separate packages.

Again, thanks for any advice you have to offer!

Details

A function included in the package determines p-values from the empirical quantile function of a distribution called the Brownian bridge. The quantile function has been calculated beforehand up to a high precision and is saved as the following function:

> empcdf.ref
Empirical CDF 
Call: ecdf(value.integral)
 x[1:1000000] = 0.0083841, 0.0088768, 0.0095009,  ..., 2.7204,  3.012

In the current state of our package, the function has been stored as a .RData file to a data/ directory with the following command:

> save(empcdf.ref, file="data/empcdf_ref.RData", compress=TRUE, compression_level=9)

To ensure best compression we also tried the following commands:

> tools::checkRdaFiles("empcdf_ref.RData")
                     size ASCII compress version
empcdf_ref.RData 12337274 FALSE     gzip       3
> tools::resaveRdaFiles("empcdf_ref.RData", compress ="auto")
> tools::checkRdaFiles("empcdf_ref.RData")
                    size ASCII compress version
empcdf_ref.RData 5009924 FALSE       xz       3

So that finally the quantile function can be stored at a size of 4.8M:

$ du -h data/empcdf_ref.RData 
4,8M    data/empcdf_ref.RData

Upon running R CMD BiocCheck this shows as an Error

* Checking package size...
    * ERROR: Package Source tarball exceeds Bioconductor size
      requirement.
        Package Size: 5.0324 MB
        Size Requirement: 5.0000 MB
* Checking individual file sizes...
    * WARNING: The following files are over 5MB in size:
        'data/empcdf_ref.RData'

EDIT: BiocCheck Tag added and R CMD BiocCheck output

data package submission BiocCheck • 1.8k views
ADD COMMENT
0
Entering edit mode

You might get a better response posting this to the Bioc Developers mailing list, which is more focused on issues like this. You can sign up at https://stat.ethz.ch/mailman/listinfo/bioc-devel

There was a post very recently on this same topic which may be useful: https://stat.ethz.ch/pipermail/bioc-devel/2019-July/015311.html

ADD REPLY
0
Entering edit mode

Thank you, Mike! I will reframe this question and try the mailing list.

ADD REPLY

Login before adding your answer.

Traffic: 704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6