Question

global alignment in Biostrings , Memory fails when seq. lengths around 50K bp

1

Entering edit mode

branislav misovic ▴ 120

@branislav-misovic-4248

Last seen 5.0 years ago

Netherlands/Amsterdam

Hola Bioconductors , Biotransformers , Bioamplifiers ...

Before the question special thanx to team from UC Riverside who made nice R seq tutorial and of course Biostring developers. Oh and thank you for this Support site lunch ala StackExchange.

Biostrings did great job for my need (align short reads to long gene) but i got Memory issue when aligning 2 big genes using global alignement. There is also similar post regarding Memory issue (post from 2012).

It went ok till 40.000 bp , R was using ~3% of 160GB memory we have, and when running 50K it failed (bellow is the demo code). Please comment why this sudden memory jump saying to 16777216 Tb ?
One more question for the developer, was many local alignments (Waterman-Eggert) not considered for implementation ? (wiki link for aligners directs to tool "matcher")

oh and one note for support website: it is said "To create a new tag just type it in and press ENTER."... if i enter "pairwiseAlignment" and press enter to get new tag, nothing happenes)

Now the the demo code :

library(Biostrings)
set.seed (42)

refS=DNAString (paste( sample(c("A","T","G","C"), 90000, replace=T), collapse=""))
deb.mat <- nucleotideSubstitutionMatrix (match=1, mismatch=0, baseOnly=TRUE)

seqlns=seq(100, nchar(refS), 10000)

runT=rep(0,length(seqlns)); names(runT)= seqlns

for (L in 1:length(seqlns) ) { cat ( "using seq Length:" , seqlns[L] , "\n")
runT[L]= system.time(
glob.pwa <- pairwiseAlignment( substr(refS,1,seqlns[L]) , subject= substr(refS,1,seqlns[L]), type="global",substitutionMatrix=deb.mat, gapOpening=0, gapExtension=-1)
)[3]
}

# using seq Length: 100
# using seq Length: 10100
# using seq Length: 20100
# using seq Length: 30100
# using seq Length: 40100
# using seq Length: 50100
# Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
# cannot allocate memory block of size 16777216 Tb
# Timing stopped at: 0.02 0 0.017

plot (names(runT),runT, type="l", ylab="sec", xlab="seq.len")
runT
##100 10100 20100 30100 40100 50100 60100 70100 80100
0.045 2.747 9.323 20.103 35.792 0.000 0.000 0.000 0.000

traceback()
7: .Call(.NAME, ..., PACKAGE = PACKAGE)
6: .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,
          type, typeCode, scoreOnly, gapOpening, gapExtension, useQuality,
          substitutionArray, dim(substitutionArray), substitutionLookupTable,
          fuzzyMatrix, dim(fuzzyMatrix), fuzzyLookupTable, PACKAGE = "Biostrings")
5: XStringSet.pairwiseAlignment(pattern = pattern, subject = subject,
                                type = type, substitutionMatrix = substitutionMatrix, gapOpening = gapOpening,
                                gapExtension = gapExtension, scoreOnly = scoreOnly)
4: mpi.XStringSet.pairwiseAlignment(pattern, subject, type = type,
                                    substitutionMatrix = substitutionMatrix, gapOpening = gapOpening,
                                    gapExtension = gapExtension, scoreOnly = scoreOnly)
3: .local(pattern, subject, ...)
2: pairwiseAlignment(substr(refS, 1, seqlns[L]), subject = substr(refS,
                                                                  1, seqlns[L]), type = "global", substitutionMatrix = deb.mat,
                     gapOpening = 0, gapExtension = -1)
1: pairwiseAlignment(substr(refS, 1, seqlns[L]), subject = substr(refS,
                                                                  1, seqlns[L]), type = "global", substitutionMatrix = deb.mat,
                     gapOpening = 0, gapExtension = -1)

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] Biostrings_2.32.0 XVector_0.4.0 IRanges_1.22.3
[4] BiocGenerics_0.10.0 BiocInstaller_1.14.1

loaded via a namespace (and not attached):
[1] stats4_3.1.1 tools_3.1.1 zlibbioc_1.10.0

Biostrings pairwiseAlignment • 2.6k views

ADD COMMENT • link updated 9.4 years ago by Hervé Pagès 16k • written 9.5 years ago by branislav misovic ▴ 120

1

Entering edit mode

Looks like you (or someone) figured out how to tag your post with "pairwisealignment".

ADD REPLY • link 9.4 years ago Dan Tenenbaum ★ 8.2k

0

Entering edit mode

: ))) I guess it worked at the end but will check when posting new question .

ADD REPLY • link 9.4 years ago branislav misovic ▴ 120

score 2 · Answer 1 · 2014-11-11

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 13 hours ago

Seattle, WA, United States

Hi Branko,

Sorry for the delay in answering this.

pairwiseAlignment() cannot be used on such long sequences. The limitation is that nchar(pattern) * nchar(subject) must not exceed .Machine$integer.max (i.e. 2^32-1 on Intel-based platforms). The problem you see (i.e. the crazy amount of memory that pairwiseAlignment() is trying to allocate) was the consequence of an integer overflow when calculating the size of some internal buffers. In Biostrings 2.35.4 (this is in BioC devel) I fixed this so now pairwiseAlignment() catches the integer overflow and raises an error with an explicit message when too much memory is required:

library(Biostrings)
x <- sample(DNAString(paste(DNA_BASES, collapse="")), 50000, replace=TRUE)

Then:

> x
  50000-letter "DNAString" instance
seq: CGACATTTGCCCTGGCCGCTTCGGGGAAAAAACATT...AGTAGCAGTTGTTCTAGACCGTATTTTCCTGCGGTC
> pairwiseAlignment(x, x)
Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,  : 
  max(nchar(pattern) * nchar(subject)) is too big (must be <= 2147483647)

Hope this helps.

H.

PS: Please report the issue about the website as a separate question and with the appropriate tag. That will increase the chances that the people in charge of the website actually see the question.

ADD COMMENT • link 9.4 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you for the clear explanation & message fix, and of course for the Biostrings magic .

I thought this limitation was out from R.3 ( maybe this is one of the reasons there are few biocResistors out there : )
Browsing the web i found related post where hey talk about 64bit integers fix using some libs like 64bit package ... Is that possible to use in your functions ?

To explain why i needed this... I got 2 different gene sequences from bioMart (i asked seq. for 1 gene on two machines linux and windows ) and this got me headache as some post results were not matching so at the end I realized that two sequences were different in sizes, by ~400 NN ! Gene is about 80Kbp . SO i wanted to check where is difference.

btw. what do you use to global align 2 (or few) big sequences ?
oh and i will check issue about tagging again.

ADD REPLY • link 9.4 years ago branislav misovic ▴ 120

0

Entering edit mode

Hi Branko,

Sorry for the late answer on this.

I don't have much to offer here but just wanted to point out that even though it's true that R now has support for long vectors, we're not using them yet in our infrastructure. Supporting long vectors has some pitfalls and would add significant complexity to the code in S4Vectors, IRanges, XVector, GenomicRanges, and Biostrings so we need to think twice before we do it. Also it's hard to test code that involves long vectors, in particular it cannot really be done in unit tests.

In the case of pairwiseAlignment() the temporary buffers used at the C level are not R vectors so the issue is not really related to long vectors. And it would actually not be too hard to modify the C code to use buffers of size > 2^32. However it has to be done carefully because it impacts many aspects of the code. And then the modified code is hard to validate (you have to test it on big input and this cannot be done in unit tests). So if I were to spend some time on improving pairwiseAlignment() implementation, my preference would be to spend it on reducing its memory usage.

I don't know what to use to global align 2 (or few) big sequences at the moment so maybe you have a point that pairwiseAlignment() needs to be improved to support this. I'll see if I can find some time to work on this in January. I'll use bioc-devel for the follow-up on this.

Cheers,

H.

ADD REPLY • link 9.4 years ago Hervé Pagès 16k