global alignment in Biostrings , Memory fails when seq. lengths around 50K bp
1
1
Entering edit mode
@branislav-misovic-4248
Last seen 2.5 years ago
Netherlands/Amsterdam

Hola Bioconductors ,  Biotransformers , Bioamplifiers ...

Before the question special thanx to  team from UC Riverside who made nice  R  seq  tutorial and of course Biostring developers. Oh and thank you for this Support site  lunch ala StackExchange.

Biostrings did great job for my need (align short reads to long gene) but i got Memory issue when aligning 2 big genes using global alignement.  There is also similar post regarding Memory issue (post from 2012).

It went ok till 40.000 bp , R was using ~3% of 160GB memory we have, and  when running 50K it failed (bellow is the demo code). Please comment why this sudden memory jump saying to 16777216 Tb ?
One more question for the  developer, was many local alignments (Waterman-Eggert) not considered for implementation ? (wiki link for aligners  directs to tool  "matcher")

oh and  one note for support website: it is said "To create a new tag just type it in and press ENTER."... if i enter  "pairwiseAlignment" and press enter to get new tag, nothing happenes)

Now the the demo code :

library(Biostrings)
set.seed (42)

refS=DNAString (paste(  sample(c("A","T","G","C"), 90000, replace=T),  collapse=""))
deb.mat <- nucleotideSubstitutionMatrix (match=1, mismatch=0, baseOnly=TRUE)

seqlns=seq(100, nchar(refS), 10000)

runT=rep(0,length(seqlns)); names(runT)= seqlns

for (L in 1:length(seqlns) ) { cat ( "using seq Length:" , seqlns[L] , "\n")
runT[L]= system.time(
glob.pwa  <- pairwiseAlignment( substr(refS,1,seqlns[L]) , subject= substr(refS,1,seqlns[L]), type="global",substitutionMatrix=deb.mat, gapOpening=0, gapExtension=-1)
)[3]
}

# using seq Length: 100
# using seq Length: 10100
# using seq Length: 20100
# using seq Length: 30100
# using seq Length: 40100
# using seq Length: 50100
# Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,  :
# cannot allocate memory block of size 16777216 Tb
# Timing stopped at: 0.02 0 0.017

plot (names(runT),runT, type="l", ylab="sec", xlab="seq.len")
runT
##100  10100  20100  30100  40100  50100  60100  70100  80100
0.045   2.747  9.323  20.103  35.792  0.000  0.000  0.000  0.000

traceback()
7: .Call(.NAME, ..., PACKAGE = PACKAGE)
6: .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,
type, typeCode, scoreOnly, gapOpening, gapExtension, useQuality,
substitutionArray, dim(substitutionArray), substitutionLookupTable,
fuzzyMatrix, dim(fuzzyMatrix), fuzzyLookupTable, PACKAGE = "Biostrings")
5: XStringSet.pairwiseAlignment(pattern = pattern, subject = subject,
type = type, substitutionMatrix = substitutionMatrix, gapOpening = gapOpening,
gapExtension = gapExtension, scoreOnly = scoreOnly)
4: mpi.XStringSet.pairwiseAlignment(pattern, subject, type = type,
substitutionMatrix = substitutionMatrix, gapOpening = gapOpening,
gapExtension = gapExtension, scoreOnly = scoreOnly)
3: .local(pattern, subject, ...)
2: pairwiseAlignment(substr(refS, 1, seqlns[L]), subject = substr(refS,
1, seqlns[L]), type = "global", substitutionMatrix = deb.mat,
gapOpening = 0, gapExtension = -1)
1: pairwiseAlignment(substr(refS, 1, seqlns[L]), subject = substr(refS,
1, seqlns[L]), type = "global", substitutionMatrix = deb.mat,
gapOpening = 0, gapExtension = -1)

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] Biostrings_2.32.0    XVector_0.4.0        IRanges_1.22.3
[4] BiocGenerics_0.10.0  BiocInstaller_1.14.1

loaded via a namespace (and not attached):
[1] stats4_3.1.1    tools_3.1.1     zlibbioc_1.10.0

Biostrings pairwiseAlignment • 1.7k views
1
Entering edit mode

Looks like you (or someone) figured out how to tag your post with "pairwisealignment".

0
Entering edit mode

: )))  I guess it worked at the end but will check when posting new question .

2
Entering edit mode
@herve-pages-1542
Last seen 2 hours ago
Seattle, WA, United States

Hi Branko,

Sorry for the delay in answering this.

pairwiseAlignment() cannot be used on such long sequences. The limitation is that nchar(pattern) * nchar(subject) must not exceed .Machine\$integer.max (i.e. 2^32-1 on Intel-based platforms). The problem you see (i.e. the crazy amount of memory that pairwiseAlignment() is trying to allocate) was the consequence of an integer overflow when calculating the size of some internal buffers. In Biostrings 2.35.4 (this is in BioC devel) I fixed this so now pairwiseAlignment() catches the integer overflow and raises an error with an explicit message when too much memory is required:

library(Biostrings)
x <- sample(DNAString(paste(DNA_BASES, collapse="")), 50000, replace=TRUE)

Then:

> x
50000-letter "DNAString" instance
seq: CGACATTTGCCCTGGCCGCTTCGGGGAAAAAACATT...AGTAGCAGTTGTTCTAGACCGTATTTTCCTGCGGTC
> pairwiseAlignment(x, x)
Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,  :
max(nchar(pattern) * nchar(subject)) is too big (must be <= 2147483647)

Hope this helps.

H.

PS: Please report the issue about the website as a separate question and with the appropriate tag. That will increase the chances that the people in charge of the website actually see the question.

0
Entering edit mode

Thank you for the clear  explanation & message fix,   and of course for the Biostrings magic .

I thought this limitation was out from  R.3  (  maybe this is one of the reasons there are few   biocResistors  out there : )
Browsing the  web i found related  post where hey talk about   64bit integers fix  using some libs like 64bit package  ...   Is that  possible to use in your functions ?

To explain why i needed this...  I got 2 different  gene sequences from bioMart (i asked seq. for 1 gene on two machines  linux and windows ) and this got me headache as some post results were not matching so at the end I realized that   two sequences were different in sizes, by  ~400 NN  ! Gene is about 80Kbp .     SO i wanted to check where is difference.

btw. what do you use to global align 2 (or few)  big sequences ?
oh and i will check issue about tagging again.

0
Entering edit mode

Hi Branko,

Sorry for the late answer on this.

I don't have much to offer here but just wanted to point out that even though it's true that R now has support for long vectors, we're not using them yet in our infrastructure. Supporting long vectors has some pitfalls and would add significant complexity to the code in S4Vectors, IRanges, XVector, GenomicRanges, and Biostrings so we need to think twice before we do it. Also it's hard to test code that involves long vectors, in particular it cannot really be done in unit tests.

In the case of pairwiseAlignment() the temporary buffers used at the C level are not R vectors so the issue is not really related to long vectors. And it would actually not be too hard to modify the C code to use buffers of size > 2^32. However it has to be done carefully because it impacts many aspects of the code. And then the modified code is hard to validate (you have to test it on big input and this cannot be done in unit tests). So if I were to spend some time on improving pairwiseAlignment() implementation, my preference would be to spend it on reducing its memory usage.

I don't know what to use to global align 2 (or few)  big sequences at the moment so maybe you have a point that pairwiseAlignment() needs to be improved to support this. I'll see if I can find some time to work on this in January. I'll use bioc-devel for the follow-up on this.

Cheers,

H.