segfault for pairwiseAlignment
2
0
Entering edit mode
@taub-margaret-5847
Last seen 6.0 years ago
United States

Hi all,

I am trying to do some alignment using the pairwiseAlignment function and I am getting a segfault. I'm not sure what I'm doing that is causing this behavior, but any suggestions would be greatly appreciated! Thanks.

 

Cheers,

Margaret

 

 

# this code reproduces the error for me

library(ShortRead)
library(BSgenome.Hsapiens.UCSC.hg19)
chr1Region<-Hsapiens$chr1

foo<-"TTGGGGGGAGGGGGGGGGTGGGGGGGGGGGGGTTGGGGG"
pairwiseAlignment(foo, chr1Region)

 *** caught segfault ***
address 0x2aab92a5c000, cause 'memory not mapped'

Traceback:
 1: .Call(.NAME, ..., PACKAGE = PACKAGE)
 2: .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,     type, typeCode, scoreOnly, gapOpening, gapExtension, useQuality,     substitutionArray, dim(substitutionArray), substitutionLookupTable,     fuzzyReferenceMatrix, dim(fuzzyReferenceMatrix), fuzzyLookupTable,     PACKAGE = "Biostrings")
 3: QualityScaledXStringSet.pairwiseAlignment(pattern = pattern,     subject = subject, type = type, fuzzyMatrix = fuzzyMatrix,     gapOpening = gapOpening, gapExtension = gapExtension, scoreOnly = scoreOnly)
 4: mpi.QualityScaledXStringSet.pairwiseAlignment(pattern, subject,     type = type, fuzzyMatrix = fuzzyMatrix, gapOpening = gapOpening,     gapExtension = gapExtension, scoreOnly = scoreOnly)
 5: .local(pattern, subject, ...)
 6: pairwiseAlignment(foo, chr1Region)
 7: pairwiseAlignment(foo, chr1Region)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

> sessionInfo()
R version 3.1.0 Patched (2014-04-24 r65479)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices datasets  utils     methods
[8] base

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.3.99 ShortRead_1.22.0
 [3] GenomicAlignments_1.0.1            BSgenome_1.32.0
 [5] Rsamtools_1.16.0                   GenomicRanges_1.16.3
 [7] GenomeInfoDb_1.0.2                 Biostrings_2.32.0
 [9] XVector_0.4.0                      IRanges_1.22.6
[11] BiocParallel_0.6.1                 BiocGenerics_0.10.0
[13] RColorBrewer_1.0-5

loaded via a namespace (and not attached):
 [1] BatchJobs_1.2       BBmisc_1.6          Biobase_2.24.0
 [4] bitops_1.0-6        brew_1.0-6          codetools_0.2-8
 [7] DBI_0.2-7           digest_0.6.4        fail_1.2
[10] foreach_1.4.2       grid_3.1.0          hwriter_1.3
[13] iterators_1.0.7     lattice_0.20-29     latticeExtra_0.6-26
[16] plyr_1.8.1          Rcpp_0.11.1         RSQLite_0.11.4
[19] sendmailR_1.1-2     stats4_3.1.0        stringr_0.6.2
[22] tools_3.1.0         zlibbioc_1.10.0

sequencing software error alignment • 1.7k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 19 hours ago
Seattle, WA, United States

Hi Margaret,

Human chr1 is a pretty long sequence. My 1st guess would be that this segfault is due to the fact that pairwiseAlignment() was not meant (and was not designed for) being used on long sequences. The underlying algos used by the function (Needleman-Wunsch or Smith-Waterman, depending on the type of alignment) consume a lot of memory, something in the order of 3*p*s if I remember correctly, where p and s are the lengths of the pattern and the subject, respectively. We should clarify this in the man page and of course the function should not segfault when too much memory is requested. We need to do a better job at handling this e.g. with a meaningful error message.

Also note that, by default, pairwiseAlignment() will try to perform a global alignment. But since you have a short pattern and a very long subject, you probably want to perform a global-local alignment. So in that case you would need to specify type="global-local". See the difference:

subject <- subseq(chr1Region, end=2000000)

Then:

> pairwiseAlignment(foo, subseq(chr1Region, end=2000000))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] T----------------------------------...--------------------GGGGGGGTTGGGGG 
subject: [1] NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GGGGGTTGAGTGAGGGGGGTGGGGGGGTTGGGGG 
score: -7999814 

> pairwiseAlignment(foo, subseq(chr1Region, end=2000000), type="global-local")
Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
pattern:       [1] TTGGGGGGAGGGGG-GGGGTGGGGGGGGGGGGGTTGGGGG 
subject: [1219682] TTGGGGTGGGGGGTAGGGGTGGGGGGGTGGGGCTGGCGGG 
score: 8.121195 

In the first case, the full pattern is aligned to the full subject. In the second case, it's aligned to a subsequence of the subject.

Furthermore, when type="global-local"pairwiseAlignment() will only return "the best" match, i.e. the alignment that maximizes the score. If there is more than one "best match", then only the first one is returned. If you want all the matches, and if you don't need all the fancy scoring schemes supported by pairwiseAlignment(), you can use matchPattern() with with.indels=TRUE and set max.mismatch to the maximum number of mismatches+indels you want to allow:

> matchPattern(foo, subject, with.indels=TRUE, max.mismatch=8)
  Views on a 2000000-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...CAGTCCAGGACATTCACCTTGAGACCTGGCCTTA
views:
      start     end width
[1] 1219504 1219541    38 [TTGGGTGAGGGGGGTGGGGTCGGGGGGGGGGTTGAGTG]
[2] 1219523 1219563    41 [TCGGGGGGGGGGTTGAGTGAGGGGGGTGGGGGGGTTGGGGG]
[3] 1219535 1219572    38 [TTGAGTGAGGGGGGTGGGGGGGTTGGGGGGGTTGGGTG]
[4] 1219549 1219585    37 [TGGGGGGGTTGGGGGGGTTGGGTGAGGGGGTTGGGGG]
[5] 1219682 1219720    39 [TTGGGGTGGGGGGTAGGGGTGGGGGGGTGGGGCTGGCGG]

matchPattern() uses much less memory than pairwiseAlignment() and can be used on very long sequences like Human chromosome 1:

> matchPattern(foo, Hsapiens$chr1, with.indels=TRUE, max.mismatch=8)
  Views on a 249250621-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
         start       end width
 [1]   1219504   1219541    38 [TTGGGTGAGGGGGGTGGGGTCGGGGGGGGGGTTGAGTG]
 [2]   1219523   1219563    41 [TCGGGGGGGGGGTTGAGTGAGGGGGGTGGGGGGGTTGGGGG]
 [3]   1219535   1219572    38 [TTGAGTGAGGGGGGTGGGGGGGTTGGGGGGGTTGGGTG]
 [4]   1219549   1219585    37 [TGGGGGGGTTGGGGGGGTTGGGTGAGGGGGTTGGGGG]
 [5]   1219682   1219720    39 [TTGGGGTGGGGGGTAGGGGTGGGGGGGTGGGGCTGGCGG]
 ...       ...       ...   ... ...
[57] 232861092 232861129    38 [GGGGGGAGGGGAGGGGAGGGGGGGAGGGGGGGAGGGGG]
[58] 232861110 232861147    38 [GGGGGGAGGGGGGGAGGGGGGGAGGGGAGGGGAGGGGG]
[59] 232861118 232861155    38 [GGGGGGAGGGGGGGAGGGGAGGGGAGGGGGGGAGGGGG]
[60] 232861126 232861160    35 [GGGGGGAGGGGAGGGGAGGGGGGGAGGGGGGGAGG]
[61] 247651084 247651120    37 [TTGTGGGGGGCGGGGGGCGGGGGGGGGGCGGTGGCCG]

Hope this helps,

H.

 

 

 

ADD COMMENT
0
Entering edit mode

Thanks Herve, that's very helpful. I had used pairwiseAlignment in the past for shorter string matching, so it's what I knew -- I'll check out matchPattern() for the problem I'm looking at here. 

 

Cheers,

Margaret

ADD REPLY
0
Entering edit mode
@kasper-daniel-hansen-2979
Last seen 10 months ago
United States

Well, the issue here is that R should not segfault.  This is clearly a bug.

Best,
Kasper

ADD COMMENT
0
Entering edit mode

Agreed. I thought my answer was pretty clear about that.

H.

ADD REPLY

Login before adding your answer.

Traffic: 1020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6