Problem running summarizeOverlaps()

0

Entering edit mode

Jessica Perry Hekman ▴ 140

@jessica-perry-hekman-6556

Last seen 9.6 years ago

I am working from http://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc /RNA-seqWorkflow.pdf Before writing my own script, I attempted to run the exact code from that vignette: library(TxDb.Hsapiens.UCSC.hg19.knownGene) exByGn <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") # Read counts library(Rsamtools) fls <- list.files("../../bam/", pattern="fox-readgroups.bam$", full.names=T) library(leeBamViews) # I inserted this line bamfls <- BamFileList(fls) flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE) param <- ScanBamParam(flag=flag) gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, single.end=TRUE, param=param) hnrnp.cnts=assay(gnCnt) When I executed this in R, I got the error message on the second to last line: Error in validObject(.Object) : invalid class ?SummarizedExperiment? object: 'rowData' length differs from 'assays' nrow In addition: Warning message: In lapply(setNames(seq_along(reads), names(reads)), function(i, : all scheduled cores encountered errors in user code I'm not sure what to make of this error message. Apparently SummarizedExperiment is the output object which I should be getting back from summarizeOverlaps(). Is this a problem with the data I'm putting in? (But it is just the example data used in the vignette, so it should be trustworthy.) Any suggestions welcome! Thanks, Jessica -- Jessica P. Hekman, DVM, MS PhD student, University of Illinois, Urbana-Champaign Animal Sciences / Genetics, Genomics, and Bioinformatics

Genetics Genetics • 3.3k views

ADD COMMENT • link updated 8.2 years ago by shania90.lk ▴ 10 • written 9.9 years ago by Jessica Perry Hekman ▴ 140

1

Entering edit mode

shania90.lk ▴ 10

@shania90lk-9656

Last seen 7.9 years ago

Hi all,

I know this is quite a late reply. But maybe it might help anyone needing help in the future. I realised that the "single.end=TRUE/FALSE" option is now changed to "singleEnd=TRUE/FALSE". The errors I saw in this thread were given to me because of that particular difference.

Cheers,

Shani.

ADD COMMENT • link 8.2 years ago shania90.lk ▴ 10

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 2 days ago

United States

On 05/19/2014 06:55 PM, Jessica Perry Hekman wrote: > I am working from > > http://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc /RNA-seqWorkflow.pdf > > > Before writing my own script, I attempted to run the exact code from that vignette: > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > exByGn <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") > > # Read counts > library(Rsamtools) > fls <- list.files("../../bam/", pattern="fox-readgroups.bam$", full.names=T) > > library(leeBamViews) # I inserted this line > bamfls <- BamFileList(fls) > > flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE) > param <- ScanBamParam(flag=flag) > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > ignore.strand=TRUE, single.end=TRUE, param=param) > hnrnp.cnts=assay(gnCnt) > > > When I executed this in R, I got the error message on the second to last line: > > Error in validObject(.Object) : > invalid class ?SummarizedExperiment? object: 'rowData' length differs from > 'assays' nrow > In addition: Warning message: > In lapply(setNames(seq_along(reads), names(reads)), function(i, : > all scheduled cores encountered errors in user code Hi Jessica -- I think that summarizeOverlaps is trying to evaluate your counting algorithm in on several different cores, but an error occurs. Try running the commands above, and then immediately before summarizeOverlaps evaluate options(mc.cores=1) gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, single.end=TRUE, param=param) Hopefully this will at least make the error apparent, even if it might still be cryptic. Please be sure to include the output of the command 'sessionInfo()' after you have a problem; here's mine > sessionInfo() R version 3.0.2 Patched (2014-01-02 r64626) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.14.3 Biostrings_2.30.1 GenomicRanges_1.14.4 [4] XVector_0.2.0 IRanges_1.20.7 BiocGenerics_0.8.0 [7] BiocInstaller_1.12.1 loaded via a namespace (and not attached): [1] bitops_1.0-6 stats4_3.0.2 zlibbioc_1.8.0 (actually, this is my best guess at the version of R you're using; there's a more recent version, and more recent versions of the packages, available). Hope that helps, Martin > > > I'm not sure what to make of this error message. Apparently SummarizedExperiment > is the output object which I should be getting back from summarizeOverlaps(). Is > this a problem with the data I'm putting in? (But it is just the example data > used in the vignette, so it should be trustworthy.) > > Any suggestions welcome! > > Thanks, > Jessica > > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 9.9 years ago Martin Morgan 25k

0

Entering edit mode

On 05/19/2014 09:32 PM, Martin Morgan wrote: > On 05/19/2014 06:55 PM, Jessica Perry Hekman wrote: >> I am working from >> >> http://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc /RNA-seqWorkflow.pdf >> gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >> ignore.strand=TRUE, single.end=TRUE, param=param) > Hi Jessica -- > > I think that summarizeOverlaps is trying to evaluate your counting > algorithm in on several different cores, but an error occurs. Try > running the commands above, and then immediately before > summarizeOverlaps evaluate > > options(mc.cores=1) > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > ignore.strand=TRUE, single.end=TRUE, param=param) > > Hopefully this will at least make the error apparent, even if it might > still be cryptic. > > Please be sure to include the output of the command 'sessionInfo()' > after you have a problem; here's mine Ah yes! Very helpful! The error message after I added mc.cores=1 to my script is: Error: C stack usage is too close to the limit ...which is indeed much less cryptic. I am still not sure how to fix the problem, though! sessionInfo() output: R version 3.0.2 (2013-09-25) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] leeBamViews_0.99.24 [2] BSgenome_1.30.0 [3] Rsamtools_1.14.3 [4] Biostrings_2.30.1 [5] TxDb.Hsapiens.UCSC.hg19.knownGene_2.10.1 [6] GenomicFeatures_1.14.5 [7] AnnotationDbi_1.24.0 [8] Biobase_2.22.0 [9] GenomicRanges_1.14.4 [10] XVector_0.2.0 [11] IRanges_1.20.7 [12] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] biomaRt_2.18.0 bitops_1.0-6 DBI_0.2-7 RCurl_1.95-4.1 [5] RSQLite_0.11.4 rtracklayer_1.22.7 stats4_3.0.2 tools_3.0.2 [9] XML_3.98-1.1 zlibbioc_1.8.0 ...and I should have remembered that I am using an older version of R. What I am running is the latest version that my package manager has on offer. Last time I installed a more recent version separately from yum, it was a huge annoyance to keep the two separate versions on the system. Do you think updating R and Bioconductor (which appears to depend on the most recent R in order to upgrade) will help? Thanks very much, Jessica

ADD REPLY • link 9.9 years ago Jessica Perry Hekman ▴ 140

0

Entering edit mode

On 05/20/2014 05:34 AM, Jessica Perry Hekman wrote: > On 05/19/2014 09:32 PM, Martin Morgan wrote: >> On 05/19/2014 06:55 PM, Jessica Perry Hekman wrote: >>> I am working from >>> >>> http://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc /RNA-seqWorkflow.pdf >>> > >>> gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >>> ignore.strand=TRUE, single.end=TRUE, param=param) > >> Hi Jessica -- >> >> I think that summarizeOverlaps is trying to evaluate your counting >> algorithm in on several different cores, but an error occurs. Try >> running the commands above, and then immediately before >> summarizeOverlaps evaluate >> >> options(mc.cores=1) >> gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >> ignore.strand=TRUE, single.end=TRUE, param=param) >> >> Hopefully this will at least make the error apparent, even if it might >> still be cryptic. >> >> Please be sure to include the output of the command 'sessionInfo()' >> after you have a problem; here's mine > > Ah yes! Very helpful! The error message after I added mc.cores=1 to my script is: > > Error: C stack usage is too close to the limit > > ...which is indeed much less cryptic. I am still not sure how to fix the > problem, though! I haven't seen this error before in the context of summarizeOverlaps, so it's a bit puzzling. I'd first check that the fls <- list.files("../../bam/", pattern="fox-readgroups.bam$", full.names=T) all point to valid bam files, and the bam files have indexes. You might then try adding a 'yieldSize' argument to the following line, starting small (e.g., 100000) and moving toward the default (1000000) if the small size works when calling summarizeOverlaps, or perhaps smaller if it fails. bamfls <- BamFileList(fls, yieldSize=100000) Can you provide a little information about your system? It sounds like it's your own machine, not a server. How much memory? Probably you'd get a different outcome with a more recent R / Bioconductor, but I'm not sure whether the error would go away! I have a sense that the problem with package manager installation is that they or you end up installing non-default packages into a single system directory, and as a consequence the directory contains a mix of different Bioconductor releases. A 'better practice' is probably to a) remove any existing system-wide R installation and packages b) install R with only base packages as su, or (as I do) install R as a regular user (not su) in version-specific directories in your own user file system, e.g., ~mtmorgan/bin/R-3-1-branch/ c) install any additional packages, via biocLite or otherwise, as a regular user, following R's prompt to create a version-specific directory in your own user hierarchy. Obviously this can be a rats nest of problems, and should only be done immediately before a big deadline or when you are feeling too productive and need to scale back ;) Martin > > sessionInfo() output: > > R version 3.0.2 (2013-09-25) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] leeBamViews_0.99.24 > [2] BSgenome_1.30.0 > [3] Rsamtools_1.14.3 > [4] Biostrings_2.30.1 > [5] TxDb.Hsapiens.UCSC.hg19.knownGene_2.10.1 > [6] GenomicFeatures_1.14.5 > [7] AnnotationDbi_1.24.0 > [8] Biobase_2.22.0 > [9] GenomicRanges_1.14.4 > [10] XVector_0.2.0 > [11] IRanges_1.20.7 > [12] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.18.0 bitops_1.0-6 DBI_0.2-7 RCurl_1.95-4.1 > [5] RSQLite_0.11.4 rtracklayer_1.22.7 stats4_3.0.2 tools_3.0.2 > [9] XML_3.98-1.1 zlibbioc_1.8.0 > > ...and I should have remembered that I am using an older version of R. What I am > running is the latest version that my package manager has on offer. Last time I > installed a more recent version separately from yum, it was a huge annoyance to > keep the two separate versions on the system. Do you think updating R and > Bioconductor (which appears to depend on the most recent R in order to upgrade) > will help? > > Thanks very much, > Jessica > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 9.9 years ago Martin Morgan 25k

0

Entering edit mode

On 05/20/2014 11:05 AM, Martin Morgan wrote: >> Error: C stack usage is too close to the limit >> > I haven't seen this error before in the context of summarizeOverlaps, so > it's a bit puzzling. I'd first check that the > > fls <- list.files("../../bam/", pattern="fox-readgroups.bam$", > full.names=T) > > all point to valid bam files, and the bam files have indexes. * Yes, that directory is full of BAM files with their BAI indexes -- when I first ran the R script, I didn't have BAI files and it complained about that. I indexed the BAMs and then it was happy. Sample BAM filename: ../../bam/52_172-fox-readgroups.bam > You might then try adding a 'yieldSize' argument to the following line, > starting small (e.g., 100000) and moving toward the default (1000000) if > the small size works when calling summarizeOverlaps, or perhaps smaller > if it fails. > > bamfls <- BamFileList(fls, yieldSize=100000) Interesting -- that seems to have worked. The process is running much longer now, anyways, and I don't want to wait for it to finish because I suspect it might take hours. I'll try it again with a larger number once I'm sure it's exited successfully. yieldSize seems to have to do with the number of records it will process. It's only 24 BAM files so I'm not sure whether "records" means "files". Does "records" mean "reads"? > Can you provide a little information about your system? It sounds like > it's your own machine, not a server. How much memory? It's the lab server, and I have sudo access. 120G memory (about 80G free at the moment as other jobs are running). Fedora 20. > Probably you'd get a different outcome with a more recent R / > Bioconductor, but I'm not sure whether the error would go away! I have a > sense that the problem with package manager installation is that they or > you end up installing non-default packages into a single system > directory, and as a consequence the directory contains a mix of > different Bioconductor releases. A 'better practice' is probably to > > a) remove any existing system-wide R installation and packages > > b) install R with only base packages as su, or (as I do) install R as > a regular user (not su) in version-specific directories in your own user > file system, e.g., ~mtmorgan/bin/R-3-1-branch/ > > c) install any additional packages, via biocLite or otherwise, as a > regular user, following R's prompt to create a version-specific > directory in your own user hierarchy. > > Obviously this can be a rats nest of problems, and should only be done > immediately before a big deadline or when you are feeling too productive > and need to scale back ;) Noted! I wonder about just installing the updated version locally so everyone else on the server (2 other people) can use the default version. I might try that. Leaving two versions scares me, of course. Jessica

ADD REPLY • link 9.9 years ago Jessica Perry Hekman ▴ 140

0

Entering edit mode

On 05/20/2014 02:20 PM, Jessica Perry Hekman wrote: >>> Error: C stack usage is too close to the limit >> You might then try adding a 'yieldSize' argument to the following line, >> starting small (e.g., 100000) and moving toward the default (1000000) if >> the small size works when calling summarizeOverlaps, or perhaps smaller >> if it fails. >> >> bamfls <- BamFileList(fls, yieldSize=100000) So, this is perplexing. Is 1000000 really the default? Because I can set yieldSize to much larger OR smaller than that and the command will succeed (or at least complete without errors). But when I do not specify yieldSize at all, there is an error! > bamfls <- BamFileList(fls, yieldSize=100000) > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", + ignore.strand=TRUE, single.end=TRUE, param=param) > bamfls <- BamFileList(fls, yieldSize=500000) > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", + ignore.strand=TRUE, single.end=TRUE, param=param) > bamfls <- BamFileList(fls, yieldSize=1000000) > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", + ignore.strand=TRUE, single.end=TRUE, param=param) > bamfls <- BamFileList(fls, yieldSize=10000000) > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", + ignore.strand=TRUE, single.end=TRUE, param=param) > bamfls <- BamFileList(fls, yieldSize=1000000000) > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", + ignore.strand=TRUE, single.end=TRUE, param=param) BUT: > bamfls <- BamFileList(fls) > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", + ignore.strand=TRUE, single.end=TRUE, param=param) Error: C stack usage is too close to the limit ?! Jessica

ADD REPLY • link 9.9 years ago Jessica Perry Hekman ▴ 140

0

Entering edit mode

On 05/20/2014 01:37 PM, Jessica Perry Hekman wrote: > On 05/20/2014 02:20 PM, Jessica Perry Hekman wrote: >>>> Error: C stack usage is too close to the limit > >>> You might then try adding a 'yieldSize' argument to the following line, >>> starting small (e.g., 100000) and moving toward the default (1000000) if >>> the small size works when calling summarizeOverlaps, or perhaps smaller >>> if it fails. >>> >>> bamfls <- BamFileList(fls, yieldSize=100000) > > So, this is perplexing. Is 1000000 really the default? Because I can set > yieldSize to much larger OR smaller than that and the command will succeed (or > at least complete without errors). But when I do not specify yieldSize at all, > there is an error! Ok, I guess I did not remember correctly. If the function is passed a BamFile / BamFileList, it respects the yieldSize in the File / List. If yieldSize is not specified, then it'll try to read the entire file into memory. And hilarity ensues. If passed a character vector of file paths (I think this is supported in your version) then summarizeOverlaps will set the default yieldSize to 1000000. So yes, create the BamFileList with an appropriate yieldSize. From your earlier email, yieldSize refers to the number of reads read in at one time. In terms of appropriate yieldSize, summarizeOverlaps will iterate through individual bam files using yieldSize, and simultaneously use parallel (hence for you mc.cores, which by default is just 2 but can be set using options(mc.cores=8) or whatever; more recent versions use BiocParallel and register(MulticoreParam())) evaluation to process several bam files at once. So for optimal performance you want to choose a yieldSize such that all cores (or as many as being neighbourly dictates) are in use but not too much memory is being consumed. If you do decide to update your R, summarizeOverlaps has moved to GenomicAlignments. Martin > > > bamfls <- BamFileList(fls, yieldSize=100000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > bamfls <- BamFileList(fls, yieldSize=500000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > bamfls <- BamFileList(fls, yieldSize=1000000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > bamfls <- BamFileList(fls, yieldSize=10000000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > bamfls <- BamFileList(fls, yieldSize=1000000000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > > BUT: > > > bamfls <- BamFileList(fls) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > Error: C stack usage is too close to the limit > > ?! > > Jessica -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 9.9 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, do you think there is a reasonable case for dividing the yield size by the number of simultaneous processes? That way the yield size would represent the total number of reads yielded across all processes at any given time. -Ryan On May 20, 2014 2:26 PM, "Martin Morgan" <mtmorgan@fhcrc.org> wrote: > On 05/20/2014 01:37 PM, Jessica Perry Hekman wrote: > >> On 05/20/2014 02:20 PM, Jessica Perry Hekman wrote: >> >>> Error: C stack usage is too close to the limit >>>>> >>>> >> You might then try adding a 'yieldSize' argument to the following line, >>>> starting small (e.g., 100000) and moving toward the default (1000000) if >>>> the small size works when calling summarizeOverlaps, or perhaps smaller >>>> if it fails. >>>> >>>> bamfls <- BamFileList(fls, yieldSize=100000) >>>> >>> >> So, this is perplexing. Is 1000000 really the default? Because I can set >> yieldSize to much larger OR smaller than that and the command will >> succeed (or >> at least complete without errors). But when I do not specify yieldSize at >> all, >> there is an error! >> > > Ok, I guess I did not remember correctly. If the function is passed a > BamFile / BamFileList, it respects the yieldSize in the File / List. If > yieldSize is not specified, then it'll try to read the entire file into > memory. And hilarity ensues. If passed a character vector of file paths (I > think this is supported in your version) then summarizeOverlaps will set > the default yieldSize to 1000000. > > So yes, create the BamFileList with an appropriate yieldSize. From your > earlier email, yieldSize refers to the number of reads read in at one time. > > In terms of appropriate yieldSize, summarizeOverlaps will iterate through > individual bam files using yieldSize, and simultaneously use parallel > (hence for you mc.cores, which by default is just 2 but can be set using > options(mc.cores=8) or whatever; more recent versions use BiocParallel and > register(MulticoreParam())) evaluation to process several bam files at > once. So for optimal performance you want to choose a yieldSize such that > all cores (or as many as being neighbourly dictates) are in use but not too > much memory is being consumed. > > If you do decide to update your R, summarizeOverlaps has moved to > GenomicAlignments. > > Martin > > >> > bamfls <- BamFileList(fls, yieldSize=100000) >> > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >> + ignore.strand=TRUE, single.end=TRUE, param=param) >> > bamfls <- BamFileList(fls, yieldSize=500000) >> > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >> + ignore.strand=TRUE, single.end=TRUE, param=param) >> > bamfls <- BamFileList(fls, yieldSize=1000000) >> > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >> + ignore.strand=TRUE, single.end=TRUE, param=param) >> > bamfls <- BamFileList(fls, yieldSize=10000000) >> > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >> + ignore.strand=TRUE, single.end=TRUE, param=param) >> > bamfls <- BamFileList(fls, yieldSize=1000000000) >> > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >> + ignore.strand=TRUE, single.end=TRUE, param=param) >> >> >> BUT: >> >> > bamfls <- BamFileList(fls) >> > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", >> + ignore.strand=TRUE, single.end=TRUE, param=param) >> Error: C stack usage is too close to the limit >> >> ?! >> >> Jessica >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 9.9 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I used bamfls <- BamFileList(fls, yieldSize=100000) options(mc.cores=8) somewhat arbitrarily. With these settings, on my server, summarizeOverlaps() takes less than ten minutes to run, so it seems efficient enough for my purposes, but I'd be curious to hear Martin's response to Ryan's question. Martin: thanks so much for your help. I'm moving forward again. I imagine I'll be back as I proceed along the process, but your help has been much appreciated, as has your sense of humor. Jessia Jessica P. Hekman, DVM, MS PhD student, University of Illinois, Urbana-Champaign Animal Sciences / Genetics, Genomics, and Bioinformatics On 05/20/2014 05:48 PM, Ryan Thompson wrote: > Hi Martin, do you think there is a reasonable case for dividing the > yield size by the number of simultaneous processes? That way the yield > size would represent the total number of reads yielded across all > processes at any given time. > > -Ryan > > On May 20, 2014 2:26 PM, "Martin Morgan" <mtmorgan at="" fhcrc.org=""> <mailto:mtmorgan at="" fhcrc.org="">> wrote: > > On 05/20/2014 01:37 PM, Jessica Perry Hekman wrote: > > On 05/20/2014 02:20 PM, Jessica Perry Hekman wrote: > > Error: C stack usage is too close to the limit > > > You might then try adding a 'yieldSize' argument to the > following line, > starting small (e.g., 100000) and moving toward the > default (1000000) if > the small size works when calling summarizeOverlaps, or > perhaps smaller > if it fails. > > bamfls <- BamFileList(fls, yieldSize=100000) > > > So, this is perplexing. Is 1000000 really the default? Because I > can set > yieldSize to much larger OR smaller than that and the command > will succeed (or > at least complete without errors). But when I do not specify > yieldSize at all, > there is an error! > > > Ok, I guess I did not remember correctly. If the function is passed > a BamFile / BamFileList, it respects the yieldSize in the File / > List. If yieldSize is not specified, then it'll try to read the > entire file into memory. And hilarity ensues. If passed a character > vector of file paths (I think this is supported in your version) > then summarizeOverlaps will set the default yieldSize to 1000000. > > So yes, create the BamFileList with an appropriate yieldSize. From > your earlier email, yieldSize refers to the number of reads read in > at one time. > > In terms of appropriate yieldSize, summarizeOverlaps will iterate > through individual bam files using yieldSize, and simultaneously use > parallel (hence for you mc.cores, which by default is just 2 but can > be set using options(mc.cores=8) or whatever; more recent versions > use BiocParallel and register(MulticoreParam())) evaluation to > process several bam files at once. So for optimal performance you > want to choose a yieldSize such that all cores (or as many as being > neighbourly dictates) are in use but not too much memory is being > consumed. > > If you do decide to update your R, summarizeOverlaps has moved to > GenomicAlignments. > > Martin > > > > bamfls <- BamFileList(fls, yieldSize=100000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > bamfls <- BamFileList(fls, yieldSize=500000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > bamfls <- BamFileList(fls, yieldSize=1000000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > bamfls <- BamFileList(fls, yieldSize=10000000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > bamfls <- BamFileList(fls, yieldSize=1000000000) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > > > BUT: > > > bamfls <- BamFileList(fls) > > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", > + ignore.strand=TRUE, single.end=TRUE, param=param) > Error: C stack usage is too close to the limit > > ?! > > Jessica > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 <tel:%28206%29%20667-2793> > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >

ADD REPLY • link 9.9 years ago Jessica Perry Hekman ▴ 140

0

Entering edit mode

On 05/20/2014 04:25 PM, Martin Morgan wrote: > If you do decide to update your R, summarizeOverlaps has moved to > GenomicAlignments. So I figured if I was going to be having this much trouble I might as well be on the latest R in case that helped. So I upgraded to 3.1.0 and the latest Bioconductor. I modified my script a bit as below. And now I have a new error message, at that same line that we had just gotten working in the old version (augh). source("http://bioconductor.org/biocLite.R") biocLite(c("GenomicFeatures", "GenomicAlignments", "pathview", "gage", "gageData", "Rsamtools", "leeBamViews")) # Build data object: Dog txdb.canFam3 <- GenomicFeatures::makeTranscriptDbFromUCSC("canFam3","refGene") exByGn <- GenomicFeatures::exonsBy(txdb.canFam3, "gene") # Read counts library(Rsamtools) fls <- list.files("../../bam/", pattern="-fox-readgroups.bam$", full.names=T) library(leeBamViews) bamfls <- BamFileList(fls, yieldSize=100000) flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE) param <- ScanBamParam(flag=flag) # Count only reads which match exactly once ("Union") options(mc.cores=8) gnCnt <- GenomicAlignments::summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, single.end=TRUE, param=param) OUTPUT: Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of a vector type, was 'NULL' And, because all my packages have changed now: > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] leeBamViews_1.0.0 BSgenome_1.32.0 Biobase_2.24.0 [4] Rsamtools_1.16.0 Biostrings_2.32.0 XVector_0.4.0 [7] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.6 [10] BiocGenerics_0.10.0 BiocInstaller_1.14.2 loaded via a namespace (and not attached): [1] AnnotationDbi_1.26.0 BatchJobs_1.2 BBmisc_1.6 [4] BiocParallel_0.6.0 biomaRt_2.20.0 bitops_1.0-6 [7] brew_1.0-6 codetools_0.2-8 DBI_0.2-7 [10] digest_0.6.4 fail_1.2 foreach_1.4.2 [13] GenomicAlignments_1.0.1 GenomicFeatures_1.16.0 iterators_1.0.7 [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 [19] RSQLite_0.11.4 rtracklayer_1.24.1 sendmailR_1.1-2 [22] stats4_3.1.0 stringr_0.6.2 tools_3.1.0 [25] XML_3.98-1.1 zlibbioc_1.10.0 > I feel badly continuing to ask for help, but these error messages are beyond me. In this case, my best guess was that the message is saying that I got the signature of the method wrong -- but when I tried changing my arguments, I got a message saying "no method matches that signature," so I conclude that that's not the problem. Thanks, Jessica

ADD REPLY • link 9.9 years ago Jessica Perry Hekman ▴ 140

0

Entering edit mode

On 05/21/2014 07:39 PM, Jessica Perry Hekman wrote: > On 05/20/2014 04:25 PM, Martin Morgan wrote: > >> If you do decide to update your R, summarizeOverlaps has moved to >> GenomicAlignments. > > So I figured if I was going to be having this much trouble I might as well be on > the latest R in case that helped. So I upgraded to 3.1.0 and the latest > Bioconductor. I modified my script a bit as below. And now I have a new error > message, at that same line that we had just gotten working in the old version > (augh). > > > source("http://bioconductor.org/biocLite.R") > biocLite(c("GenomicFeatures", "GenomicAlignments", "pathview", "gage", > "gageData", "Rsamtools", "leeBamViews")) > > # Build data object: Dog > txdb.canFam3 <- GenomicFeatures::makeTranscriptDbFromUCSC("canFam3","refGene") > exByGn <- GenomicFeatures::exonsBy(txdb.canFam3, "gene") > > # Read counts > library(Rsamtools) > fls <- list.files("../../bam/", pattern="-fox-readgroups.bam$", full.names=T) > > library(leeBamViews) > bamfls <- BamFileList(fls, yieldSize=100000) > > flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE) > param <- ScanBamParam(flag=flag) > > # Count only reads which match exactly once ("Union") > options(mc.cores=8) > gnCnt <- GenomicAlignments::summarizeOverlaps(exByGn, bamfls, mode="Union", > ignore.strand=TRUE, single.end=TRUE, param=param) > > OUTPUT: > > Error in array(x, c(length(x), 1L), if (!is.null(names(x))) > list(names(x), : > 'data' must be of a vector type, was 'NULL' Hi Jessica -- sorry that this is again causing problems. Can you provide the output of traceback() after the error occurs? This will help isolate where in the code things are going wrong. Also on your own end and if you're feeling adventurous you could try to do options(error=recover) and then when the error occurs you'll get a 'backtrace' of the function calls that are in effect when error occurs; you can select the call that seems most helpful (this can be a bit of an art, and requires some overall knowledge of the code). This brings you in to the '?browser' and you can look around to figure out in more detail what the data looks like that causes the error. Here's an artificial example: f <- function(x) log(x) g <- function(x) f(as.character(x)) h <- function(x) g(x) and an error > h(1) Error in log(x) : non-numeric argument to mathematical function that we can understand how we got from where we invoked the call h(1) to where the error occurred > traceback() 3: f(as.character(x)) 2: g(x) 1: h(1) We could then look at f, say > f function (x) log(x) and not really understand what the problem was, so try to look at the value of 'x' when we get to f > options(error=recover) > h("1") Error in log(x) : non-numeric argument to mathematical function Enter a frame number, or 0 to exit 1: h("1") 2: g(x) 3: f(as.character(x)) Selection: 3 Called from: g(x) Browse[1]> ls() [1] "x" Browse[1]> x [1] "1" Browse[1]> ah, the variable 'x' is somehow a character vector, and that's causing problems as we can confirm Browse[1]> log(x) Error during wrapup: non-numeric argument to mathematical function We can get back out of the browser and the recover function, and reset options with Browse[1]> Q > options(error=NULL) Like in this example, the problem is often up-stream of where the error occurred, and I guess that's where the art comes in -- to understand the code enough to see where the problem is, and to trace the issue to the origin. Anyway, if you provide traceback() that'll give me a better start on understanding your problem. Martin > > > And, because all my packages have changed now: > > > sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] leeBamViews_1.0.0 BSgenome_1.32.0 Biobase_2.24.0 > [4] Rsamtools_1.16.0 Biostrings_2.32.0 XVector_0.4.0 > [7] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.6 > [10] BiocGenerics_0.10.0 BiocInstaller_1.14.2 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.26.0 BatchJobs_1.2 BBmisc_1.6 > [4] BiocParallel_0.6.0 biomaRt_2.20.0 bitops_1.0-6 > [7] brew_1.0-6 codetools_0.2-8 DBI_0.2-7 > [10] digest_0.6.4 fail_1.2 foreach_1.4.2 > [13] GenomicAlignments_1.0.1 GenomicFeatures_1.16.0 iterators_1.0.7 > [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1 > [19] RSQLite_0.11.4 rtracklayer_1.24.1 sendmailR_1.1-2 > [22] stats4_3.1.0 stringr_0.6.2 tools_3.1.0 > [25] XML_3.98-1.1 zlibbioc_1.10.0 > > > > I feel badly continuing to ask for help, but these error messages are beyond me. > In this case, my best guess was that the message is saying that I got the > signature of the method wrong -- but when I tried changing my arguments, I got a > message saying "no method matches that signature," so I conclude that that's not > the problem. > > Thanks, > Jessica > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 9.9 years ago Martin Morgan 25k

0

Entering edit mode

On 05/22/2014 10:10 AM, Martin Morgan wrote: >> gnCnt <- GenomicAlignments::summarizeOverlaps(exByGn, bamfls, >> mode="Union", >> ignore.strand=TRUE, single.end=TRUE, param=param) >> >> OUTPUT: >> >> Error in array(x, c(length(x), 1L), if (!is.null(names(x))) >> list(names(x), : >> 'data' must be of a vector type, was 'NULL' > > Hi Jessica -- sorry that this is again causing problems. Can you provide > the output of > > traceback() Hi, Martin. I really appreciate all the advice about debugging. Perplexingly, the error does not occur today when I go to reproduce it. I have no idea what's changed, but I'm keeping my fingers crossed and proceeding with the RNA-seq workflow vignette and hoping it doesn't return. The chances are that you'll be hearing from me again soon, but at any rate, thanks again for all your help so far. Jessica

ADD REPLY • link 9.9 years ago Jessica Perry Hekman ▴ 140

Login before adding your answer.