Question: CAGEr BAM files size issue
1
23 months ago by
orzechowska.mag10 wrote:

Dear Colleagues,

Recently I am analyzing my CAGE data using CAGEr package and I have experienced some difficulties. My project consists of 10 libraries among which 9 .bam files are up to 940 MB and one .bam is 1.5 GB. At the level of getCTSS(cage_bam) for this one huge file I obtain error:

> getCTSS(cage_bam)
Reading in file: ./MCF7_xx.bam...
	-> Filtering out low quality reads...
Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") :
negative length vectors are not allowed

What I have found is that the issue is caused by size of this huge .bam file. My other .bams work perfectly fine. Moreover, I used SAMtools to extract subset of this huge .bam and it worked perfectly up to size of 1.1 GB. The computing power of my PC is really big, so I guess that the issue is located at different point. In addition, I am using the latest versions of R, RStudio, CAGEr and other adjacent packages @ Ubuntu Unity.

I would very appreciate your help how to solve this issue.

Thank you in advance and best regards,

Magda

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=pl_PL.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=pl_PL.UTF-8
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=pl_PL.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg38_1.4.1 CAGEr_1.18.0                      BSgenome_1.44.0                   rtracklayer_1.36.3
[5] Biostrings_2.44.1                 XVector_0.16.0                    GenomicRanges_1.28.3              GenomeInfoDb_1.12.2
[9] IRanges_2.10.2                    S4Vectors_0.14.3                  BiocGenerics_0.22.0

loaded via a namespace (and not attached):
[1] splines_3.4.1              zlibbioc_1.22.0            GenomicAlignments_1.12.1   beanplot_1.2               BiocParallel_1.10.1
[6] som_0.3-5.1                lattice_0.20-35            tools_3.4.1                SummarizedExperiment_1.6.3 grid_3.4.1
[11] data.table_1.10.4          Biobase_2.36.2             matrixStats_0.52.2         Matrix_1.2-10              GenomeInfoDbData_0.99.0
[16] bitops_1.0-6               RCurl_1.95-4.8             VGAM_1.0-3                 DelayedArray_0.2.7         compiler_3.4.1
[21] Rsamtools_1.28.0           XML_3.98-1.9 
modified 22 months ago by Charles Plessy110 • written 23 months ago by orzechowska.mag10

Dear Magda, thanks for the report; could you suggest a pair of publicly available BAM files where one triggers the bug and the other does not ?

Dear Charles, unfortunately I could not find as big file as mine, which generates the same difficulties. I shared via e-mail with you my file, which triggers the bug and one publicly available, which does not (http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/releaseLatest/wgEncodeRikenCageGm12878CellPapAlnRep1.bam).

Hi,

The error message is coming from Biostrings and is admittedly not helpful. I just changed this in Biostrings 2.44.2 and now it says:

> unlist(big_dna_string_set)
Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") :
XStringSet object is too big to be unlisted (would result in an XString
object of length 2^31 or more)

It looks like the getCTSS() function in the CAGEr package is trying to unlist some XStringSet object containing quality strings that are too big to be concatenated in one single string. Maybe that could be avoided, I don't know, I'm not familiar with the CAGEr package.

H.

PS: Biostrings 2.44.2 should become available via biocLite() in the next 24 hours or so.

Hi,

Maybe I should ask Biostrings support how to solve this issue then.

I have some alternative solution that before sorting my .bam I will randomly pick the reads to prepare a bit smaller, digestible file for CAGEr, but ideally I wish to avoid that. This is solution for the worst case scenario.

Best regards!

Hi,

You're using the CAGEr package so I'm not sure exactly how the Biostrings package could help you solve this issue. Unless I add support for long XString objects, which is on the radar but unlikely to happen in the near future. In the mean time, one way to solve the issue would be to revisit the implementation of the getCTSS() function from CAGEr. There should be a way for the function to avoid unlisting the XStringSet object of qualities. Hopefully the CAGEr maintainers will be able to help you with this.

Cheers,

H.

Hi Hervé and Magda,

thanks for the report and the guidance.

getCTSS() loads the BAM files with Rsamtools::ScanBam, and then extracts the qualities from the returned object, in which they are stored as a PhredQuality object. For each read, an average quality is calculated, and that requires the PhredQuality object to be converted, for instance to a matrix to in a IntegerList. Basically:

bam <- scanBam(inputFiles(object)[i], param = param)
qal.avg <- as.integer(mean(as(bam[[1]]\$qual, "IntegerList")))

In Magda's BAM files, there are 35,560,858 sequences of 73 nucleotides each, and conversion of the PhredQuality object to matrix or IntegerList fail with the same error.

Unless there are better suggestions, I will process the qualities by batches which are short enough to allow for the conversion (20,000,000 or smaller, if sequence length also matters?).

Have a nice week-end,

-- Charles

Answer: CAGEr BAM files size issue
1
22 months ago by
Japan
Charles Plessy110 wrote:

I have pushed update 1.18.1 to the [Bioconductor release 3.5 branch](http://bioconductor.org/packages/3.5/bioc/html/CAGEr.html), where I [implemented the solution discussed above](https://github.com/Bioconductor-mirror/CAGEr/commit/0a1c85d2e8f99b0104446fb301f013ce2b4dfaad) (parse the qualities by chunks of 1,000,000 to avoid hitting limits on object length). The correction is not yet in the devel branch, where I opportunistically wait for the transition to Git to be completed.

The issue is gone now. Thank you very much for your support and solution. I really appreciate it.

All the best!