Question

segfault for pairwiseAlignment

0

Entering edit mode

Taub, Margaret ▴ 10

@taub-margaret-5847

Last seen 7.8 years ago

United States

Hi all,

I am trying to do some alignment using the pairwiseAlignment function and I am getting a segfault. I'm not sure what I'm doing that is causing this behavior, but any suggestions would be greatly appreciated! Thanks.

Cheers,

Margaret

# this code reproduces the error for me

library(ShortRead)
library(BSgenome.Hsapiens.UCSC.hg19)
chr1Region<-Hsapiens$chr1

foo<-"TTGGGGGGAGGGGGGGGGTGGGGGGGGGGGGGTTGGGGG"
pairwiseAlignment(foo, chr1Region)

*** caught segfault ***
address 0x2aab92a5c000, cause 'memory not mapped'

Traceback:
1: .Call(.NAME, ..., PACKAGE = PACKAGE)
2: .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, type, typeCode, scoreOnly, gapOpening, gapExtension, useQuality, substitutionArray, dim(substitutionArray), substitutionLookupTable, fuzzyReferenceMatrix, dim(fuzzyReferenceMatrix), fuzzyLookupTable, PACKAGE = "Biostrings")
3: QualityScaledXStringSet.pairwiseAlignment(pattern = pattern, subject = subject, type = type, fuzzyMatrix = fuzzyMatrix, gapOpening = gapOpening, gapExtension = gapExtension, scoreOnly = scoreOnly)
4: mpi.QualityScaledXStringSet.pairwiseAlignment(pattern, subject, type = type, fuzzyMatrix = fuzzyMatrix, gapOpening = gapOpening, gapExtension = gapExtension, scoreOnly = scoreOnly)
5: .local(pattern, subject, ...)
6: pairwiseAlignment(foo, chr1Region)
7: pairwiseAlignment(foo, chr1Region)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

> sessionInfo()
R version 3.1.0 Patched (2014-04-24 r65479)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices datasets utils methods
[8] base

other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg19_1.3.99 ShortRead_1.22.0
[3] GenomicAlignments_1.0.1 BSgenome_1.32.0
[5] Rsamtools_1.16.0 GenomicRanges_1.16.3
[7] GenomeInfoDb_1.0.2 Biostrings_2.32.0
[9] XVector_0.4.0 IRanges_1.22.6
[11] BiocParallel_0.6.1 BiocGenerics_0.10.0
[13] RColorBrewer_1.0-5

loaded via a namespace (and not attached):
[1] BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0
[4] bitops_1.0-6 brew_1.0-6 codetools_0.2-8
[7] DBI_0.2-7 digest_0.6.4 fail_1.2
[10] foreach_1.4.2 grid_3.1.0 hwriter_1.3
[13] iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26
[16] plyr_1.8.1 Rcpp_0.11.1 RSQLite_0.11.4
[19] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2
[22] tools_3.1.0 zlibbioc_1.10.0

sequencing software error alignment • 2.1k views

ADD COMMENT • link updated 11.4 years ago by Kasper Daniel Hansen ★ 6.5k • written 11.4 years ago by Taub, Margaret ▴ 10

score 0 · Answer 1 · 2014-09-26

Hi Margaret,

Human chr1 is a pretty long sequence. My 1st guess would be that this segfault is due to the fact that pairwiseAlignment() was not meant (and was not designed for) being used on long sequences. The underlying algos used by the function (Needleman-Wunsch or Smith-Waterman, depending on the type of alignment) consume a lot of memory, something in the order of 3*p*s if I remember correctly, where p and s are the lengths of the pattern and the subject, respectively. We should clarify this in the man page and of course the function should not segfault when too much memory is requested. We need to do a better job at handling this e.g. with a meaningful error message.

Also note that, by default, pairwiseAlignment() will try to perform a global alignment. But since you have a short pattern and a very long subject, you probably want to perform a global-local alignment. So in that case you would need to specify type="global-local". See the difference:

subject <- subseq(chr1Region, end=2000000)

Then:

> pairwiseAlignment(foo, subseq(chr1Region, end=2000000))
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] T----------------------------------...--------------------GGGGGGGTTGGGGG 
subject: [1] NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GGGGGTTGAGTGAGGGGGGTGGGGGGGTTGGGGG 
score: -7999814 

> pairwiseAlignment(foo, subseq(chr1Region, end=2000000), type="global-local")
Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
pattern:       [1] TTGGGGGGAGGGGG-GGGGTGGGGGGGGGGGGGTTGGGGG 
subject: [1219682] TTGGGGTGGGGGGTAGGGGTGGGGGGGTGGGGCTGGCGGG 
score: 8.121195

In the first case, the full pattern is aligned to the full subject. In the second case, it's aligned to a subsequence of the subject.

Furthermore, when type="global-local", pairwiseAlignment() will only return "the best" match, i.e. the alignment that maximizes the score. If there is more than one "best match", then only the first one is returned. If you want all the matches, and if you don't need all the fancy scoring schemes supported by pairwiseAlignment(), you can use matchPattern() with with.indels=TRUE and set max.mismatch to the maximum number of mismatches+indels you want to allow:

> matchPattern(foo, subject, with.indels=TRUE, max.mismatch=8)
  Views on a 2000000-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...CAGTCCAGGACATTCACCTTGAGACCTGGCCTTA
views:
      start     end width
[1] 1219504 1219541    38 [TTGGGTGAGGGGGGTGGGGTCGGGGGGGGGGTTGAGTG]
[2] 1219523 1219563    41 [TCGGGGGGGGGGTTGAGTGAGGGGGGTGGGGGGGTTGGGGG]
[3] 1219535 1219572    38 [TTGAGTGAGGGGGGTGGGGGGGTTGGGGGGGTTGGGTG]
[4] 1219549 1219585    37 [TGGGGGGGTTGGGGGGGTTGGGTGAGGGGGTTGGGGG]
[5] 1219682 1219720    39 [TTGGGGTGGGGGGTAGGGGTGGGGGGGTGGGGCTGGCGG]

matchPattern() uses much less memory than pairwiseAlignment() and can be used on very long sequences like Human chromosome 1:

> matchPattern(foo, Hsapiens$chr1, with.indels=TRUE, max.mismatch=8)
  Views on a 249250621-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
         start       end width
 [1]   1219504   1219541    38 [TTGGGTGAGGGGGGTGGGGTCGGGGGGGGGGTTGAGTG]
 [2]   1219523   1219563    41 [TCGGGGGGGGGGTTGAGTGAGGGGGGTGGGGGGGTTGGGGG]
 [3]   1219535   1219572    38 [TTGAGTGAGGGGGGTGGGGGGGTTGGGGGGGTTGGGTG]
 [4]   1219549   1219585    37 [TGGGGGGGTTGGGGGGGTTGGGTGAGGGGGTTGGGGG]
 [5]   1219682   1219720    39 [TTGGGGTGGGGGGTAGGGGTGGGGGGGTGGGGCTGGCGG]
 ...       ...       ...   ... ...
[57] 232861092 232861129    38 [GGGGGGAGGGGAGGGGAGGGGGGGAGGGGGGGAGGGGG]
[58] 232861110 232861147    38 [GGGGGGAGGGGGGGAGGGGGGGAGGGGAGGGGAGGGGG]
[59] 232861118 232861155    38 [GGGGGGAGGGGGGGAGGGGAGGGGAGGGGGGGAGGGGG]
[60] 232861126 232861160    35 [GGGGGGAGGGGAGGGGAGGGGGGGAGGGGGGGAGG]
[61] 247651084 247651120    37 [TTGTGGGGGGCGGGGGGCGGGGGGGGGGCGGTGGCCG]

Hope this helps,

H.

score 0 · Answer 2 · 2014-09-26

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 2.6 years ago

United States

Well, the issue here is that R should not segfault. This is clearly a bug.

Best,
Kasper

ADD COMMENT • link 11.4 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Agreed. I thought my answer was pretty clear about that.

H.

ADD REPLY • link 11.4 years ago Hervé Pagès 16k