Question: BSmooth error during the smoothing step for a subset of files
0
2.5 years ago by
brandon.le0 wrote:

I have a problem while smoothing using the bsseq package. I imported bismark coverage files that I previously separated based on cytosine context (e.g., CG, CHG, CHH). I was able to processed data for the CG and CHG context  following the manual but when I tried to process the CHH data, I obtained the error below:

> library("bsseq")
> list.files(pattern="*.Gm01.CHH.gz") -> cov.files
> read.bismark(files=cov.files,sampleNames = gsub(".Gm01.CHH.gz","",cov.files), rmZeroCov = TRUE) -> bsseq.data
Assuming file type iscov
Read 13672009 rows and 6 (of 6) columns from 0.373 GB file in 00:00:04
done in 8.2 secs
Read 12227559 rows and 6 (of 6) columns from 0.336 GB file in 00:00:04
done in 7.7 secs
Read 13374971 rows and 6 (of 6) columns from 0.368 GB file in 00:00:04
done in 8.8 secs
Read 13209474 rows and 6 (of 6) columns from 0.365 GB file in 00:00:04
done in 7.9 secs
[read.bismark] Joining samples ... done in 29.9 secs
> bsseq.data
An object of type 'BSseq' with
14696052 methylation loci
4 samples
has not been smoothed
> bsseq.data.smoothed <- BSmooth(bsseq.data, mc.cores=8, parallelBy="chromosome", verbose=TRUE)
[BSmooth] preprocessing ... done in 7.6 sec
[BSmooth] smoothing by 'chromosome' (mc.cores = 8, mc.preschedule = FALSE)
Error in BSmooth(bsseq.data, mc.cores = 8, parallelBy = "chromosome",  :
BSmooth encountered smoothing errors
In mclapply(1:length(clusterIdx), function(ii) { :
1 function calls resulted in an error

Since there are more CHH sites in the genome than the other two contexts, I thought the number of methylation loci was too much so I was going to smooth one chromosome at a time and then combine them afterwards. However, I obtained the same error message. I've tried this on several datasets and it always fail on the CHH separated files. Do you have any ideas?

Thank you.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocInstaller_1.24.0       edgeR_3.16.5               limma_3.30.13              bsseq_1.10.0               systemPipeR_1.8.1
[6] RSQLite_1.1-2              DBI_0.6                    ShortRead_1.32.1           GenomicAlignments_1.10.1   SummarizedExperiment_1.4.0
[11] Biobase_2.34.0             BiocParallel_1.8.1         Rsamtools_1.26.1           Biostrings_2.42.1          XVector_0.14.1
[16] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.10           locfit_1.5-9.1         lattice_0.20-34        GO.db_3.4.0            gtools_3.5.0           digest_0.6.12
[7] plyr_1.8.4             futile.options_1.0.0   BatchJobs_1.6          backports_1.0.5        ggplot2_2.2.1          zlibbioc_1.20.0
[13] GenomicFeatures_1.26.3 data.table_1.10.4      annotate_1.52.1        R.oo_1.21.0            R.utils_2.5.0          Matrix_1.2-8
[19] checkmate_1.8.2        GOstats_2.40.0         splines_3.3.1          stringr_1.2.0          pheatmap_1.0.8         RCurl_1.95-4.8
[25] biomaRt_2.30.0         munsell_0.4.3          sendmailR_1.2-1        rtracklayer_1.34.2     base64enc_0.1-3        BBmisc_1.11
[31] fail_1.3               matrixStats_0.51.0     XML_3.98-1.5           AnnotationForge_1.16.1 R.methodsS3_1.7.1      bitops_1.0-6
[37] grid_3.3.1             RBGL_1.50.0            xtable_1.8-2           GSEABase_1.36.0        gtable_0.2.0           magrittr_1.5
[43] scales_0.4.1           graph_1.52.0           stringi_1.1.3          hwriter_1.3.2          genefilter_1.56.0      latticeExtra_0.6-28
[49] futile.logger_1.4.3    brew_1.0-6             rjson_0.2.15           lambda.r_1.1.9         RColorBrewer_1.1-2     tools_3.3.1
[55] Category_2.40.0        survival_2.40-1        AnnotationDbi_1.36.2   colorspace_1.3-2      
bsseq bsmooth • 632 views
modified 2.5 years ago by Peter Hickey460 • written 2.5 years ago by brandon.le0

Hi Brandon,

Can you produce and share a minimal CHH file that reproduces the error? I've not had issues specific to CHH files.

Cheers,

Pete

Hi Peter,

I attached a link containing the smallest CHH files that generate the error. Interestingly, when I reduced the same data file to contain 1M or 10M loci, the smoothing step worked without any error. When I increased the number of loci to 12M, the error appeared. This is odd since the number of loci for CHG was > 35M and it worked okay.

https://www.dropbox.com/sh/ohxfb4dujqkbuxe/AACjXl6T1lezDjuDVo1y-8rfa?dl=0

Thank you.

Best,

Brandon

Thanks, Brandon. I was able to reproduce the error using the BS1.12M.CHH.gz file. Ultimately, an error is occurring when locfdr::locfit() is called from within bsseq::BSmooth(). Unfortunately, because it's a little buried and calling some fairly complicated code in locfdr, this means it will take a little work to track down and fix. I'll spend some time on it this week and post back when I have more.

Pete

Answer: BSmooth error during the smoothing step for a subset of files
0
2.5 years ago by
Peter Hickey460
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Peter Hickey460 wrote:

Hi Brandon,

This issue is that the default BSmooth() parameters, especially nsh, and maxgap, are tuned for the distribution of CG loci in human/mammalian genomes [1]. You are using these default parameters to analyse CHH loci from what I'm guessing is soybean (based on 'Gm01' in your example), and these CHH loci will have a very different spatial distribution to mammalian CG loci. Basically, what is happening is that locfit errors because it doesn't like this choice of smoothing parameters given the spatial distribution of CHH loci in your data.

But there is good news. I was able to run BSmooth() on the BS1.12M.CHH.gz example by increasing ns to 700 and h to 10000. This was a fairly arbitrary choice of parameters, but demonstrates that by modifying the smoothing parameters we can run BSmooth() on these data.

We do not currently have recommended parameters for BSmooth() for non-mammalian, non-CG loci, so you will need to experiment to see what gives a reasonable looking smooth. I recommend focusing on a single sample and small-ish chromosome (or even a part of a chromosome) and plotting the results of various smooth fits along with the raw data.

Cheers,

Pete

[1] The ns and h parameters ultimately influence how parameters in locfit::lp() and locfit::locfit() which are called by BSmooth().  As a user of bsseq, you cannot set these locfit parameters directly.

Hi Peter,

I will try your suggestion with different ns and h to see what works best for soybean.  Thank you very much for looking into this problem. Are there plans to examine parameters for non-mammalian, non-CG loci in the future? In plants, non-CG loci are highly prevalent.

Best,

Brandon