Question

snapCGH segmentation fault

0

Entering edit mode

Tobias Schmidt • 0

@tobias-schmidt-9247

Last seen 8.4 years ago

Germany

Hi there,

I am currently working on my master's thesis about the influence of copy number changes on the response to a new treatment. My data consists of about 100 samples, measured with Nimblegen 2.1M CGH arrays. Each of those arrays contains about 2.1 million features. After preprocessing my data as described in the snapCGH vignette (http://www.bioconductor.org/packages/3.3/bioc/vignettes/snapCGH/inst/doc/snapCGHguide.pdf), I tried to use the function runBioHMM to segment my data. Unfortunately, it seems like snapCGH can't handle such large datasets, since I always get the following error:

Error: segfault from C stack overflow

I already downloaded the Package Source from https://www.bioconductor.org/packages/release/bioc/html/snapCGH.html and spent quite a few hours in looking through the source code of the package, trying to locate the source of the error and I'm pretty sure it has something to do with the function runNelderMead from snapCGH/src/optimizer.c. Sadly I'm not that familiar with C programming so I decided to ask for help here.

Below you can find a minimal working example that produces the error. The function generateSegList creates an Object similar to MA2 from the vignette mentioned above, just with more rows (corresponding to more features on the microarray). Obviously the function works for 10000 features, but does not for 1000000 features.

library("snapCGH")

generateSegList <- function(nrows){
  state <- rpred <- prob <- M.predicted <- dispersion <- variance <- NULL
  M.observed <- matrix(rnorm(2 * nrows, 0, 1), nrows, 2)
  colnames(M.observed) <- c("sample1", "sample2")
  genes <- data.frame(
    "Block" = rep(1L, nrows), 
    "Row" = rep(1:sqrt(nrows), each = sqrt(nrows)),
    "Col" = rep(1:sqrt(nrows), sqrt(nrows)), 
    "ID" = paste("ID", 1:nrows, sep = ""), 
    "Name" = paste("Name", 1:nrows, sep = ""), 
    "Position" = rep(as.integer(seq(1,2e8, , length.out = nrows/10)), 10) / 1e6,
    "Chr" = rep(1:10, each = nrows / 10),
    "Status" = rep("Probe", nrows),
    stringsAsFactors = FALSE)
  attributes(genes$Status) <- list("values" = "Probe", "col" = "black")
  SegList <- new("SegList")
  SegList$M.observed <- M.observed
  SegList$genes <- genes
  return(SegList)
}

TestSegList1 <- generateSegList(100*100)
TestSegList2 <- generateSegList(1000*1000)
SegInfo.Bio <- runBioHMM(TestSegList1)
# OUTPUT
# sample is  1   Chromosomes: 1  2  3  4  5  6  7  8  9  10  
# sample is  2   Chromosomes: 1  2  3  4  5  6  7  8  9  10
SegInfo.Bio <- runBioHMM(TestSegList2)
# OUTPUT
# sample is  1   Chromosomes: 1  
# Fehler: segmentation fault wegen eines C-Stack Überlaufes

Any suggestions on how to fix this error would be appreciated.

Thanks,

Tobias Schmidt

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] snapCGH_1.40.0 DNAcopy_1.44.0 limma_3.26.3  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.2           BiocInstaller_1.20.1  RColorBrewer_1.1-2   
 [4] plyr_1.8.3            tools_3.2.2           zlibbioc_1.16.0      
 [7] digest_0.6.8          GLAD_2.34.0           RSQLite_1.0.0        
[10] annotate_1.48.0       preprocessCore_1.32.0 gtable_0.1.2         
[13] lattice_0.20-33       DBI_0.3.1             parallel_3.2.2       
[16] proto_0.3-10          pixmap_0.4-11         genefilter_1.52.0    
[19] stringr_1.0.0         cluster_2.0.3         IRanges_2.4.3        
[22] S4Vectors_0.8.3       stats4_3.2.2          multtest_2.26.0      
[25] grid_3.2.2            Biobase_2.30.0        AnnotationDbi_1.32.0 
[28] XML_3.98-1.3          survival_2.38-3       ggplot2_1.0.1        
[31] reshape2_1.4.1        magrittr_1.5          scales_0.3.0         
[34] MASS_7.3-45           splines_3.2.2         BiocGenerics_0.16.1  
[37] xtable_1.8-0          strucchange_1.5-1     colorspace_1.2-6     
[40] sandwich_2.3-4        aCGH_1.48.0           stringi_1.0-1        
[43] affy_1.48.0           tilingArray_1.48.0    munsell_0.4.2        
[46] vsn_3.38.0            zoo_1.7-12            affyio_1.40.0

snapcgh software error • 1.3k views

ADD COMMENT • link updated 8.4 years ago by Ramon Diaz-Uriarte ▴ 70 • written 8.4 years ago by Tobias Schmidt • 0

score 2 · Accepted Answer · 2015-11-23

Dear Tobias,

Increasing the stack limit with ulimit might help. For instance, you could do:

ulimit -s unlimited

R --vanilla < run-my-biohmm-code.R &> output-biohmm.Rout

from the shell. As you see, we issue ulimit before running R (and we run R from that very same shell in which we issued the ulimit command). In the past I was able to run fairly large examples (as described in the "benchmarks.pdf" file of the ADaCGH2 package: https://www.bioconductor.org/packages/release/bioc/vignettes/ADaCGH2/inst/doc/benchmarks.pdf, which is also in the supplementary material of the paper --- http://bioinformatics.oxfordjournals.org/content/30/12/1759) that way). Regardless, be aware that BioHMM can be slow.

Best,

R.