Probe off-target prediction: Identifying short sequence matches - potentially with mismatches & indels - in a transcriptome
1
0
Entering edit mode
sandmann.t ▴ 70
@sandmannt-11014
Last seen 7 months ago
United States

I would like identify potential hybridization off-targets for a set of short DNA probes (16-20 nt) by detecting sequence matches across all annotated transcripts in a genome. Off-targets are defined as matches within a maximum edit distance, e.g. allowing both mismatches and indels.

The Biostrings package provides the vmatchPattern method that works great, but it seems that it doesn't support indels, yet. (See an example with BioC 3.17 below.)

Perhaps there are some alternative methods (either within R or implemented in another open source tool) that I could use? Maybe looking into one of the aligners designed for short next-generation reads is the way to go? Or maybe there are algorithms developed to map microarray probes that I could repurpose?

Many thanks for any pointers!

subject <- BStringSet(
  c("ACDEFxxxCDEFxxxABCE", "KLMNOxxxPQRSxxxKLMN")
)
vmatchPattern("ABCDEF", subject, max.mismatch=2)  # works
vmatchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)  # not supported

Error in .XStringSet.vmatchPattern(
  pattern, subject, max.mismatch, min.mismatch,  : vmatchPattern() does not support indels yet


sessionInfo( )

R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.68.0   GenomeInfoDb_1.36.0 XVector_0.40.0      IRanges_2.34.0     
[5] S4Vectors_0.38.0    BiocGenerics_0.46.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.46.0         compiler_4.3.0          tools_4.3.0            
[4] GenomeInfoDbData_1.2.10 RCurl_1.98-1.12         crayon_1.5.2           
[7] bitops_1.0-7
Microarray Alignment Biostrings • 688 views
ADD COMMENT
0
Entering edit mode

I just found an older post and a great answer by Herve Pages. He provided a substitute function (vmatchPattern2) that outputs an IRangesList and supports indels there.

ADD REPLY

Login before adding your answer.

Traffic: 946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6