Hi, I am developing an R package "RNASeqWorkflow". It is my first time to develop an R package, and I hope I can submit my package to bioconductor before October. I have 4 questions want to ask:
Operating System limit: In this package, it will install HISAT2, StringTie binaries based on the operating system of the workstation. Because HISAT2 and StringTie only support Linux and macOS, so I want this package only support for Linux and macOS. I have checked the previous post and package guideline and found that it is ok to limit the operating system. I have mentioned its limitation in vignette. I am wondering when is the propriate time to mention this problem after package submission?
Package size limit: Here is the result of
R CMD BiocCheck
. Because my package will produce a lot of plots during the process, and my vignette includes these plots for comprehensive explanation, the html file that generated is bigger than 5 MB. I have do my best to reduce the package size(and seperate the example toy data), but package source tarball and html file still exceed Bioconductor size. I am wondering whether this check can be skipped duringR CMD BiocCheck
or how can I solve this ERROR. If I can skip this check, when sould I mention this problem after my package submission?Seperate experiment package problem: Here is the result of
R CMD check
. I have seperate my example data in to a differentRNASeqWorkflowData
experiment package. During myR CMD check
, I got this note* checking for unstated dependencies in vignettes ... NOTE 'library' or 'require' call not declared from: ‘RNASeqWorkflowData’
. I have read the package guideLine and not sure whether I should put my experiment package inImports
orSuggests
? I am not sure the process of submitting software package and experiment package, is there any example or suggestion?Because my package can run from reads alignemnt, assembly, differential analysis to functional analysis. It takes a long time to finish vignette process. I have limited my example data size to about 5~6 MB. I am wondering whether I should evaluate these codes in vignette process or I should just mark
eval=FALSE
?
Thank you very much in advance.
Here is R CMD BiocCheck
This is BiocCheck version 1.17.23. BiocCheck is a work in progress.
Output and severity of issues may change. Installing package...
* Checking for version number mismatch...
* Checking vignette directory...
This is a software package
# of chunks: 45, # of eval=FALSE: 9 (20%)
* Checking version number...
Checking version number validity...
Package version 0.99.0; pre-release
* Checking R Version dependency...
* Checking package size...
* ERROR: Package Source tarball exceeds Bioconductor size
requirement.
Package Size: 8.6382 MB
Size Requirement: 4.0000 MB
* Checking individual file sizes...
* WARNING: The following files are over 5MB in size:
'inst/doc/RNASeqWorkflow.html'
* Checking biocViews...
* Checking that biocViews are present...
* Checking package type based on biocViews...
Software
* Checking for non-trivial biocViews...
* Checking that biocViews come from the same category...
* Checking biocViews validity...
* Checking for recommended biocViews...
* NOTE: Consider adding these automatically suggested biocViews:
Transcription, Microarray, Metabolomics, Proteomics, Coverage,
Bayesian, Regression, DNAMethylation, ChIPSeq, SystemsBiology,
AlternativeSplicing, DifferentialMethylation,
DifferentialSplicing, BatchEffect, MultipleComparison,
GraphAndNetwork, TimeCourse
See http://bioconductor.org/developers/how-to/biocViews/
* Checking build system compatibility...
* Checking for blank lines in DESCRIPTION...
* Checking for whitespace in DESCRIPTION field names...
* Checking that Package field matches directory/tarball name...
* Checking for Version field...
* Checking for valid maintainer...
* Checking unit tests...
* NOTE: Consider adding unit tests. We strongly encourage them. See
http://bioconductor.org/developers/how-to/unitTesting-guidelines/.
* Checking skip_on_bioc() in tests...
* Checking library calls...
* Checking coding practice...
* Checking native routine registration...
* Checking for deprecated package usage...
* Checking parsed R code in R directory, examples, vignettes...
* Checking for direct slot access...
Found @ in man/CheckToolAll.Rd
Found @ in vignettes/RNASeqWorkflow.Rmd
* NOTE: Use accessors; don't access S4 class slots via '@' in
examples/vignettes.
* Checking for browser()...
* Checking for <<-...
* Checking for library/require of RNASeqWorkflow...
* Checking DESCRIPTION/NAMESPACE consistency...
* Checking function lengths...................................
The longest function is 326 lines long
The longest 5 functions are:
edgeRRawCountAnalysis() (R/DE_edgeR_analysis.R, line 1): 326 lines
BallgownAnalysis() (R/DE_Ballgown_analysis.R, line 2): 283 lines
DESeq2RawCountAnalysis() (R/DE_DESeq2_analysis.R, line 1): 282
lines
ProgressGenesFiles() (R/utility_installtool_rnaseqpipline.R, line
2): 257 lines
TPMNormalizationAnalysis() (R/DE_TPM.R, line 2): 211 lines
* Checking man pages...
* Checking exported objects have runnable examples...
* Checking package NEWS...
* Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source,
and vignette source...
* NOTE: Consider shorter lines; 188 lines (2%) are > 80 characters
long.
First 6 lines:
R/AllClasses.R:95 #' genome.name = ...
R/AllClasses.R:98 #' case.group = ...
R/cmd_batch_rnaseq_differential_analysis.R:189 #' ...
R/yeast-data.R:3 #' @description Small RNASeqWorkflowParam S4 object cr...
man/RNASeqWorkflowParam-class.Rd:28 \item{\code{genome.name}}{Variable ...
man/RNASeqWorkflowParam-class.Rd:31 \item{\code{sample.pattern}}{Regula...
* NOTE: Consider multiples of 4 spaces for line indents, 3820
lines(46%) are not.
First 6 lines:
R/AllClasses.R:47 representation(
R/AllClasses.R:48 os.type = "character",
R/AllClasses.R:49 python.variable = "list",
R/AllClasses.R:50 python.2to3 = "logical",
R/AllClasses.R:51 path.prefix = "character",
R/AllClasses.R:52 input.path.prefix = "character",
See http://bioconductor.org/developers/how-to/coding-style/
* Checking for canned comments in man pages...
* Checking if package already exists in CRAN...
* Checking for bioc-devel mailing list subscription...
* NOTE: Cannot determine whether maintainer is subscribed to the
bioc-devel mailing list (requires admin credentials). Subscribe
here: https://stat.ethz.ch/mailman/listinfo/bioc-devel
* Checking for support site registration...
Maintainer is registered at support site.
Summary:
ERROR count: 1
WARNING count: 1
NOTE count: 6
For detailed information about these checks, see the BiocCheck
vignette, available at
https://bioconductor.org/packages/3.8/bioc/vignettes/BiocCheck/inst/doc/BiocCheck.html#interpreting-bioccheck-output
BiocCheck FAILED.
Sorry to have more additional questions.
The main reason that I create this package is to provide easier way to do two-group RNA-Seq analysis. Moreover, users can do whole RNA-Seq analysis just in R environment. I almost finish this package and faced the problems that I mentioned above before my submission to bioconductor. I am wondering whether this package would be appropriate for Bioconductor.
The following are some details about how I deal with third-party softwares :
I spent a lot of effort to deal with dependencies on third-party software. In my package, the additional softwares, which are not R package, that need to be installed are ‘HISAT2’, ‘StringTie’, ‘Samtools’ and ‘Gffcompare’. I wrote internal function to check the user’s operating system. Any OS other than ‘Linux’ and ‘macOS’ detected will trigger ERROR.
And due to the reason that ‘HISAT2’, ‘StringTie’, ‘Gffcompare’ provide available ‘binaries’, binaries(not source code, don’t need to compile) will be installed automatically based the OS, and exported to R environment. It will eliminate the chance of compiling ERROR.
However, ‘SAMtools’ only provide source code(no binary). By default, source code will be installed and compiled. Compiling may cause unexpected ERROR, therefore, I also provide parameters for users to choose whether they want skip the installation process for these four tools. However they have to make sure that r
system2()
can find the commands on their system. Command availability will be check after the installation process. Any ERROR occured will terminated the environment setup step.Environment setup step must be checked successful so that users can run the following analysis step. And the other downstream steps depends on other bioconductor packages(no third-party software problem).
I have also create another experiment package to provide comprehensive analysis explaination in vignette, and it can be knitr successfully.
Please let me know whether this approach would be acceptable to Bioconductor.
Thank you very much in advance.
Sounds like Martin gave a fairly comprehensive answer about third-party software installation via a Bioconductor package. Are there equivalent tools already available via existing package? For example, rather than trying to download & compile Samtools, have you looked at using the Rsamtools package? This should provide all the same functionality as the command line version, but the installation & compilation headaches have already been solved.
I'll also point out that there's an existing workflow package with a very similar title, so you might want to think about how you might differentiate them i.e.
https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html
To Mike's comment I'll also mention the GenomicRanges package and rtracklayer::import() for working with gff files.
I'll just repeat that the dependence on external tools (no matter how carefully implemented) and the restricted cross-platform utility of your package make it unsuitable for Bioconductor. If there is significant functionality available after removing these dependencies, then you might consider closely specifying what the appropriate inputs are to the remaining functions, and provide in a vignette the 'pre-processing' steps required to create those inputs. Then in the body of your package remove the tool- and platform-specific code.