substr on XStringSet-class

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hello, I would like to get all the substrings of a patternmatch on a XStringSet-class. I now use the following code, but this ignores multiple matches and I have the feeling there is a better way to do it that uses biostrings fuctions. I load a fastafile into a XStringSet-class object and then search for a specific string using the vmatchPattern function: genes <- readDNAStringSet(File = "filename", format = "fasta", use.names = T) view <- vmatchPattern(pattern = "CCGGA", genes) matches <- unlist(view, recursive = T, use.names = T) m <- as.matrix(matches) I retrieve a substring starting at the match and 20 positions upward: subseq(genes[rownames(m),], start = m[rownames(m),1], width = 20) What is a better way to do this that includes all possible matches? -- output of sessionInfo(): x -- Sent via the guest posting facility at bioconductor.org.

Biostrings Biostrings • 955 views

ADD COMMENT • link updated 10.3 years ago by Hervé Pagès 16k • written 10.3 years ago by Guest User ★ 13k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 4 days ago

Seattle, WA, United States

Hi Tim, Here is how you can do this: library(Biostrings) genes <- DNAStringSet(c("GTTGATTAC", "AGGACCT", "AGTTTGTTCCGTTCACCTACC")) m0 <- vmatchPattern("GTT", genes) Then: > m0 MIndex object of length 3 [[1]] IRanges of length 1 start end width [1] 1 3 3 [[2]] IRanges of length 0 [[3]] IRanges of length 3 start end width [1] 2 4 3 [2] 6 8 3 [3] 11 13 3 Nb of matches per gene: > elementLengths(m0) # equivalent to vcountPattern("GTT", genes) [1] 1 0 3 To extend the ranges to the right, ideally we'd like to be able to do something like: resize(m0, width=8) # doesn't work on an MIndex object but this is not yet supported on MIndex objects (the type of 'm0'). A workaround for now is to turn 'm0' into a CompressedIRangesList object first: m1 <- as(m0, "CompressedIRangesList") Then: m2 <- resize(m1, width=8) Now we can use extractAt() to extract the corresponding sequences: > extractAt(genes, m2) DNAStringSetList of length 3 [[1]] GTTGATTA [[2]] A DNAStringSet instance of length 0 [[3]] GTTTGTTC GTTCCGTT GTTCACCT See ?extractAt for more information. The man page actually has an example that shows how to use extractAt() for extracting the match sequences corresponding to the result of vmatchPattern(). Cheers, H. On 12/20/2013 03:52 PM, Tim Homan [guest] wrote: > > Hello, > > I would like to get all the substrings of a patternmatch on a XStringSet-class. I now use the following code, but this ignores multiple matches and I have the feeling there is a better way to do it that uses biostrings fuctions. > > I load a fastafile into a XStringSet-class object and then search for a specific string using the vmatchPattern function: > > genes <- readDNAStringSet(File = "filename", format = "fasta", use.names = T) > view <- vmatchPattern(pattern = "CCGGA", genes) > matches <- unlist(view, recursive = T, use.names = T) > m <- as.matrix(matches) > > I retrieve a substring starting at the match and 20 positions upward: > > subseq(genes[rownames(m),], start = m[rownames(m),1], width = 20) > > What is a better way to do this that includes all possible matches? > > -- output of sessionInfo(): > > x > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 10.3 years ago Hervé Pagès 16k

Login before adding your answer.