Dear Colleagues,
Recently I am analyzing my CAGE data using CAGEr package and I have experienced some difficulties. My project consists of 10 libraries among which 9 .bam files are up to 940 MB and one .bam is 1.5 GB. At the level of getCTSS(cage_bam) for this one huge file I obtain error:
> getCTSS(cage_bam)
Reading in file: ./MCF7_xx.bam...
-> Filtering out low quality reads... Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") : negative length vectors are not allowed
What I have found is that the issue is caused by size of this huge .bam file. My other .bams work perfectly fine. Moreover, I used SAMtools to extract subset of this huge .bam and it worked perfectly up to size of 1.1 GB. The computing power of my PC is really big, so I guess that the issue is located at different point. In addition, I am using the latest versions of R, RStudio, CAGEr and other adjacent packages @ Ubuntu Unity.
I would very appreciate your help how to solve this issue.
Thank you in advance and best regards,
Magda
> sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pl_PL.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=pl_PL.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=pl_PL.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg38_1.4.1 CAGEr_1.18.0 BSgenome_1.44.0 rtracklayer_1.36.3 [5] Biostrings_2.44.1 XVector_0.16.0 GenomicRanges_1.28.3 GenomeInfoDb_1.12.2 [9] IRanges_2.10.2 S4Vectors_0.14.3 BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] splines_3.4.1 zlibbioc_1.22.0 GenomicAlignments_1.12.1 beanplot_1.2 BiocParallel_1.10.1 [6] som_0.3-5.1 lattice_0.20-35 tools_3.4.1 SummarizedExperiment_1.6.3 grid_3.4.1 [11] data.table_1.10.4 Biobase_2.36.2 matrixStats_0.52.2 Matrix_1.2-10 GenomeInfoDbData_0.99.0 [16] bitops_1.0-6 RCurl_1.95-4.8 VGAM_1.0-3 DelayedArray_0.2.7 compiler_3.4.1 [21] Rsamtools_1.28.0 XML_3.98-1.9
Dear Magda, thanks for the report; could you suggest a pair of publicly available BAM files where one triggers the bug and the other does not ?
Dear Charles, unfortunately I could not find as big file as mine, which generates the same difficulties. I shared via e-mail with you my file, which triggers the bug and one publicly available, which does not (http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/releaseLatest/wgEncodeRikenCageGm12878CellPapAlnRep1.bam).
Hi,
The error message is coming from Biostrings and is admittedly not helpful. I just changed this in Biostrings 2.44.2 and now it says:
It looks like the
getCTSS()
function in the CAGEr package is trying to unlist some XStringSet object containing quality strings that are too big to be concatenated in one single string. Maybe that could be avoided, I don't know, I'm not familiar with the CAGEr package.H.
PS: Biostrings 2.44.2 should become available via
biocLite()
in the next 24 hours or so.Hi,
thank you for reply.
Maybe I should ask Biostrings support how to solve this issue then.
I have some alternative solution that before sorting my .bam I will randomly pick the reads to prepare a bit smaller, digestible file for CAGEr, but ideally I wish to avoid that. This is solution for the worst case scenario.
Best regards!
Hi,
You're using the CAGEr package so I'm not sure exactly how the Biostrings package could help you solve this issue. Unless I add support for long XString objects, which is on the radar but unlikely to happen in the near future. In the mean time, one way to solve the issue would be to revisit the implementation of the
getCTSS()
function from CAGEr. There should be a way for the function to avoid unlisting the XStringSet object of qualities. Hopefully the CAGEr maintainers will be able to help you with this.Cheers,
H.
Hi Hervé and Magda,
thanks for the report and the guidance.
getCTSS()
loads the BAM files withRsamtools::ScanBam
, and then extracts the qualities from the returned object, in which they are stored as aPhredQuality
object. For each read, an average quality is calculated, and that requires thePhredQuality
object to be converted, for instance to amatrix
to in aIntegerList
. Basically:In Magda's BAM files, there are 35,560,858 sequences of 73 nucleotides each, and conversion of the
PhredQuality
object tomatrix
orIntegerList
fail with the same error.Unless there are better suggestions, I will process the qualities by batches which are short enough to allow for the conversion (20,000,000 or smaller, if sequence length also matters?).
Have a nice week-end,
-- Charles