Hi Richard,
On 03/17/2014 12:53 PM, Richard Dannebaum [guest] wrote:
> When will vmatchPattern support indels?
>
>
>
> -- output of sessionInfo():
>
>> reads[1:5]
> A DNAStringSet instance of length 5
> width seq
> [1] 51 NTATTGGTAGCGAATTCCTGTACTGCTTCGACACACTGTACTCTTCAACTT
> [2] 51 NCTTGGGATGCGAATTCCTGTACTGCTTCGCGGTGTCGTACTCTTCAACTT
> [3] 51 NGAGGGTGTTCGAATTCCTGTACTGCTTCGAAATCTGGTACTCTTCAACTT
> [4] 51 NTAGCGTTGACGAATTCCTGTACTGCTTCGCTCCACCGTACTCTTCAACTT
> [5] 51 NCTGTTTGCCCGAATTCCTGTACTGCTTCGTACCGTTGTACTCTTCAACTT
>
>> constant <- vmatchPattern("CGAATTCCTGTACTGCTTCG", reads[1:5],
max.mismatch=1, with.indels=TRUE)
>
> Error in .XStringSet.vmatchPattern(pattern, subject, max.mismatch,
min.mismatch, :
> vmatchPattern() does not support indels yet
Yes that's something I've had on my list for a while. Recently I was
also asked off-list when matchPDict() and family would support indels.
But that's a much harder problem.
For vmatchPattern(), the only show stopper is the unability to store
matches of variable length in an MIndex object, because of the current
design of these objects. But the core algo implemented at the C level
already supports indels.
Here is a version of vmatchPattern() that returns an IRangesList
object instead of an MIndex. It supports indels:
vmatchPattern2 <- function(pattern, subject,
max.mismatch=0, min.mismatch=0,
with.indels=FALSE, fixed=TRUE,
algorithm="auto")
{
if (!is(subject, "XStringSet"))
subject <- Biostrings:::XStringSet(NULL, subject)
algo <- Biostrings:::normargAlgorithm(algorithm)
if (Biostrings:::isCharacterAlgo(algo))
stop("'subject' must be a single (non-empty) string ",
"for this algorithm")
pattern <- Biostrings:::normargPattern(pattern, subject)
max.mismatch <- Biostrings:::normargMaxMismatch(max.mismatch)
min.mismatch <- Biostrings:::normargMinMismatch(min.mismatch,
max.mismatch)
with.indels <- Biostrings:::normargWithIndels(with.indels)
fixed <- Biostrings:::normargFixed(fixed, subject)
algo <- Biostrings:::selectAlgo(algo, pattern,
max.mismatch, min.mismatch,
with.indels, fixed)
C_ans <- .Call2("XStringSet_vmatch_pattern", pattern, subject,
max.mismatch, min.mismatch,
with.indels, fixed, algo,
"MATCHES_AS_RANGES",
PACKAGE="Biostrings")
unlisted_ans <- IRanges(start=unlist(C_ans[[1L]],
use.names=FALSE),
width=unlist(C_ans[[2L]],
use.names=FALSE))
relist(unlisted_ans, C_ans[[1L]])
}
You'll need Biostrings 2.31.15 for this code to work, which should
become available thru biocLite() for BioC devel users in the next 24
hours or so.
A clean fix in Biostrings would be to do this (i.e. change the type
returned by vmatchPattern()) or to modify the MIndex internals to
support matches of variable length. Probably the former. Changing
the type returned by a function requires some precautions though
because it tends to break existing code.
Hopefully I manage to find some time to work on these things after
the BioC 2.14 release.
Cheers,
H.
>
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
Have there been any updates yet? I still get "not supported yet" when I try it.
No update on this yet. Never found the time to work on it. Do you want to open an issue on GitHub to request this feature? That way at least it stays on the radar. (Make sure to include a link to this thread in your post on GitHub.) Also at some point someone else will have to take care of the Biostrings package so we should make sure that all requests and issues are recorded in the Biostrings repo on GitHub. Thanks!