substr on XStringSet-class
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Hello, I would like to get all the substrings of a patternmatch on a XStringSet-class. I now use the following code, but this ignores multiple matches and I have the feeling there is a better way to do it that uses biostrings fuctions. I load a fastafile into a XStringSet-class object and then search for a specific string using the vmatchPattern function: genes <- readDNAStringSet(File = "filename", format = "fasta", use.names = T) view <- vmatchPattern(pattern = "CCGGA", genes) matches <- unlist(view, recursive = T, use.names = T) m <- as.matrix(matches) I retrieve a substring starting at the match and 20 positions upward: subseq(genes[rownames(m),], start = m[rownames(m),1], width = 20) What is a better way to do this that includes all possible matches? -- output of sessionInfo(): x -- Sent via the guest posting facility at bioconductor.org.
Biostrings Biostrings • 955 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 4 days ago
Seattle, WA, United States
Hi Tim, Here is how you can do this: library(Biostrings) genes <- DNAStringSet(c("GTTGATTAC", "AGGACCT", "AGTTTGTTCCGTTCACCTACC")) m0 <- vmatchPattern("GTT", genes) Then: > m0 MIndex object of length 3 [[1]] IRanges of length 1 start end width [1] 1 3 3 [[2]] IRanges of length 0 [[3]] IRanges of length 3 start end width [1] 2 4 3 [2] 6 8 3 [3] 11 13 3 Nb of matches per gene: > elementLengths(m0) # equivalent to vcountPattern("GTT", genes) [1] 1 0 3 To extend the ranges to the right, ideally we'd like to be able to do something like: resize(m0, width=8) # doesn't work on an MIndex object but this is not yet supported on MIndex objects (the type of 'm0'). A workaround for now is to turn 'm0' into a CompressedIRangesList object first: m1 <- as(m0, "CompressedIRangesList") Then: m2 <- resize(m1, width=8) Now we can use extractAt() to extract the corresponding sequences: > extractAt(genes, m2) DNAStringSetList of length 3 [[1]] GTTGATTA [[2]] A DNAStringSet instance of length 0 [[3]] GTTTGTTC GTTCCGTT GTTCACCT See ?extractAt for more information. The man page actually has an example that shows how to use extractAt() for extracting the match sequences corresponding to the result of vmatchPattern(). Cheers, H. On 12/20/2013 03:52 PM, Tim Homan [guest] wrote: > > Hello, > > I would like to get all the substrings of a patternmatch on a XStringSet-class. I now use the following code, but this ignores multiple matches and I have the feeling there is a better way to do it that uses biostrings fuctions. > > I load a fastafile into a XStringSet-class object and then search for a specific string using the vmatchPattern function: > > genes <- readDNAStringSet(File = "filename", format = "fasta", use.names = T) > view <- vmatchPattern(pattern = "CCGGA", genes) > matches <- unlist(view, recursive = T, use.names = T) > m <- as.matrix(matches) > > I retrieve a substring starting at the match and 20 positions upward: > > subseq(genes[rownames(m),], start = m[rownames(m),1], width = 20) > > What is a better way to do this that includes all possible matches? > > -- output of sessionInfo(): > > x > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 719 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6