I would like identify potential hybridization off-targets for a set of short DNA probes (16-20 nt) by detecting sequence matches across all annotated transcripts in a genome. Off-targets are defined as matches within a maximum edit distance, e.g. allowing both mismatches and indels.
The Biostrings package provides the vmatchPattern
method that works great, but it seems that it doesn't support indels, yet. (See an example with BioC 3.17 below.)
Perhaps there are some alternative methods (either within R or implemented in another open source tool) that I could use? Maybe looking into one of the aligners designed for short next-generation reads is the way to go? Or maybe there are algorithms developed to map microarray probes that I could repurpose?
Many thanks for any pointers!
subject <- BStringSet(
c("ACDEFxxxCDEFxxxABCE", "KLMNOxxxPQRSxxxKLMN")
)
vmatchPattern("ABCDEF", subject, max.mismatch=2) # works
vmatchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE) # not supported
Error in .XStringSet.vmatchPattern(
pattern, subject, max.mismatch, min.mismatch, : vmatchPattern() does not support indels yet
sessionInfo( )
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.68.0 GenomeInfoDb_1.36.0 XVector_0.40.0 IRanges_2.34.0
[5] S4Vectors_0.38.0 BiocGenerics_0.46.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.46.0 compiler_4.3.0 tools_4.3.0
[4] GenomeInfoDbData_1.2.10 RCurl_1.98-1.12 crayon_1.5.2
[7] bitops_1.0-7
I just found an older post and a great answer by Herve Pages. He provided a substitute function (
vmatchPattern2
) that outputs anIRangesList
and supports indels there.