Trouble creating an RnBeads annotation for rn6 rat genome build
1
0
Entering edit mode
@richardjacton-19729
Last seen 5.9 years ago

I've been trying to use the RnBeadsAnnotationCreator (https://github.com/epigen/RnBeadsAnnotationCreator) to create an annotation for the rn6 build of the rat genome. I started with the createAnnotationPackage.rn5.R file and updated it for rn6 see: but i've been unable to get it to build the rn6 annotation.

When I call: createAnnotationPackage("rn6", dest = "~/Documents/myCustomRlibs/RnBeads.rn6") The VCF files apprear to be sucessfully downloaded but contain this warning in the output:

2019-02-04 14:57:56     1.8 WARNING             Could not detect MAF data (CAF field)
2019-02-04 14:57:56     1.8 WARNING             Could not detect MAF data (G5 field); allele frequency is not considered

And then I get this error:

2019-02-04 14:59:09     2.3   ERROR         dbSNP table(s) are too large. MapReduce implemented in ~/Documents/myCustomRlibs/RnBeads.rn6/RnBeads.rn6/temp/split.snps.R
Error in logger.error(paste("dbSNP table(s) are too large. MapReduce implemented in",  : 
  dbSNP table(s) are too large. MapReduce implemented in ~/Documents/myCustomRlibs/RnBeads.rn6/RnBeads.rn6/temp/split.snps.R

I took a look at the source of the rnb.update.dbsnp() function (https://rdrr.io/github/epigen/RnBeadsAnnotationCreator/src/R/snps.R), which appears to be the source of the error, but haven't been able to figure out why it's triggering an error.

Any suggestions would be much appreciated.


SessionInfo() :

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.10 (Santiago)

Matrix products: default
BLAS: /home/local/software/R/3.4.2/lib64/R/lib/libRblas.so
LAPACK: /home/local/software/R/3.4.2/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] BSgenome.Rnorvegicus.UCSC.rn6_1.4.1    
 [2] BSgenome_1.46.0                        
 [3] rtracklayer_1.38.3                     
 [4] RnBeadsAnnotationCreator_0.99.0        
 [5] RnBeads_1.10.8                         
 [6] plyr_1.8.4                             
 [7] methylumi_2.24.1                       
 [8] minfi_1.24.0                           
 [9] bumphunter_1.20.0                      
[10] locfit_1.5-9.1                         
[11] Biostrings_2.46.0                      
[12] XVector_0.18.0                         
[13] SummarizedExperiment_1.8.1             
[14] DelayedArray_0.4.1                     
[15] FDb.InfiniumMethylation.hg19_2.2.0     
[16] org.Hs.eg.db_3.5.0                     
[17] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[18] GenomicFeatures_1.30.3                 
[19] AnnotationDbi_1.40.0                   
[20] reshape2_1.4.3                         
[21] scales_0.5.0                           
[22] Biobase_2.38.0                         
[23] illuminaio_0.20.0                      
[24] matrixStats_0.53.1                     
[25] limma_3.34.9                           
[26] gridExtra_2.3                          
[27] gplots_3.0.1                           
[28] ggplot2_3.1.0                          
[29] fields_9.6                             
[30] maps_3.3.0                             
[31] spam_2.2-1                             
[32] dotCall64_1.0-0                        
[33] ff_2.2-14                              
[34] bit_1.1-14                             
[35] cluster_2.0.7-1                        
[36] MASS_7.3-50                            
[37] GenomicRanges_1.30.3                   
[38] GenomeInfoDb_1.14.0                    
[39] IRanges_2.12.0                         
[40] S4Vectors_0.16.0                       
[41] BiocGenerics_0.24.0                    
[42] doParallel_1.0.14                      
[43] iterators_1.0.10                       
[44] foreach_1.4.4                          

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2         siggenes_1.52.0          mclust_5.4              
 [4] base64_2.0               bit64_0.9-7              xml2_1.2.0              
 [7] codetools_0.2-15         splines_3.4.2            Rsamtools_1.30.0        
[10] annotate_1.56.2          readr_1.3.1              compiler_3.4.2          
[13] httr_1.3.1               assertthat_0.2.0         Matrix_1.2-14           
[16] lazyeval_0.2.1           prettyunits_1.0.2        tools_3.4.2             
[19] bindrcpp_0.2.2           gtable_0.2.0             glue_1.3.0              
[22] GenomeInfoDbData_1.0.0   dplyr_0.7.5              doRNG_1.6.6             
[25] Rcpp_1.0.0               multtest_2.34.0          gdata_2.18.0            
[28] preprocessCore_1.40.0    nlme_3.1-137             stringr_1.3.1           
[31] rngtools_1.3.1           gtools_3.5.0             XML_3.98-1.11           
[34] beanplot_1.2             zlibbioc_1.24.0          hms_0.4.2               
[37] GEOquery_2.46.15         RColorBrewer_1.1-2       memoise_1.1.0           
[40] pkgmaker_0.27            biomaRt_2.34.2           reshape_0.8.7           
[43] stringi_1.2.4            RSQLite_2.1.1            genefilter_1.60.0       
[46] RMySQL_0.10.15           caTools_1.17.1           BiocParallel_1.12.0     
[49] bibtex_0.4.2             rlang_0.3.1              pkgconfig_2.0.2         
[52] bitops_1.0-6             nor1mix_1.2-3            lattice_0.20-35         
[55] purrr_0.3.0              bindr_0.1.1              GenomicAlignments_1.14.2
[58] tidyselect_0.2.5         magrittr_1.5             R6_2.2.2                
[61] DBI_1.0.0                pillar_1.3.1             withr_2.1.2             
[64] survival_2.42-3          RCurl_1.95-4.10          tibble_2.0.1            
[67] crayon_1.3.4             KernSmooth_2.23-15       progress_1.2.0          
[70] data.table_1.11.4        blob_1.1.1               digest_0.6.15           
[73] xtable_1.8-2             tidyr_0.8.2              openssl_1.0.1           
[76] munsell_0.5.0            registry_0.5             quadprog_1.5-5          
RnBeads rn6 RnBeadsAnnotationCreator • 1.4k views
ADD COMMENT
0
Entering edit mode
mscherer ▴ 50
@mscherer-19737
Last seen 2.8 years ago
Spain/Barcelona/Centre for Genomic Regu…

Hey Richard,

I just had a look into the code for the annotation creator and found the following function definition:

createAnnotationPackage <- function(assembly,dest=getwd(),maxSNPs=500000L,cleanUp=TRUE)

The parameter maxSNPs accounts for the maximal number of SNPs allowed in any given annotation. If the number is larger than this value, the annotation creator tries to distribute the jobs over a high performance compute cluster using MapReduce (that is the error you get). The error is thrown in lines 304-313 of snps.R

    if (.globals[['SNP_MAX']] != 0 && .globals[['SNP_MAX']] <= max(sapply(snps, nrow))) {
    fname <- normalizePath(fname, '/')
    dname <- normalizePath(file.path(.globals[["DIR.PACKAGE"]], "temp", "qsub"), '/', FALSE)
    txt <- c('suppressPackageStartupMessages(library(RnBeadsAnnotationCreator))',
        paste0('txt <- rnb.split.snps("', fname, '", "', dname, '")'),
        'cat(txt, sep = "\\n", file = "split.snps.log")')
    fname <- file.path(.globals[["DIR.PACKAGE"]], "temp", "split.snps.R")
    cat(paste(txt, collapse = "\n"), file = fname)
    logger.error(paste('dbSNP table(s) are too large. MapReduce implemented in', fname))
}

Further information on this process can be found in the documentation of the createAnnotationPackage function:

#' @param maxSNPs  Maximum number of dbSNP records in preprocessed table for a single chromosome, given as a single
#'                 non-negative \code{integer} value. If a larger table (than this threshold) is found, the annotation
#'                 creator stops and suggests a solution that utilizes the environment of a computational cluster.
#'                 Setting this parameter to \code{0} disables such a test. This parameter is used only when SNP data is
#'                 available for the targeted genome assembly.

So you have two possibilities: Either increase the value for MAX_SNPs, or consider distributing this over a high performance compute cluster, in case so have such available.

Hope that helps, Michael

ADD COMMENT
0
Entering edit mode

Hi Michael,

I'm not quite clear on what exactly the maxSNPs parameter does, would reducing it just change the 'batch' size for processing or will it mean some SNPs might not be available in the annotation?

I was logged into an HPC cluster using the PBS/Torque a job management setup when I tried running this - is there something I have to do to set it up to make use of cluster - there didn't seem to be any options related to setting up the jobs?

Thanks for the help, Richard

ADD REPLY
0
Entering edit mode

Hey Richard,

The maxSNPs parameter is used as the maximum batch size. So if you have more SNPs than this number, RnBeads suggests to use a compute cluster for that. This is the case you are in. RnBeads automatically creates a script that you can use for job scheduling with PBS. In your case, it is stored in:

~/Documents/myCustomRlibs/RnBeads.rn6/RnBeads.rn6/temp/split.snps.R

It uses the function rnb.split.snps to create smaller batches of SNPs to be processed. If I were you, I would first try increasing the maxSNPs parameter and then go for the batch processing.

ADD REPLY

Login before adding your answer.

Traffic: 562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6