questions about matchPattern and vmatchPattern
2
0
Entering edit mode
wang peter ★ 2.0k
@wang-peter-4647
Last seen 10.3 years ago
dear ALL: Please this sample subject = "TGCATTT" Rpattern = "TGCAATTT" result <- matchPattern(Rpattern, subject, max.mismatch= 4, min.mismatch=0) result Views on a 7-letter BString subject subject: TGCATTT views: start end width [1] 0 7 8 [ TGCATTT] [2] 1 8 8 [TGCATTT ] is the start position and end position on the subject or pattern? and for vmatchPattern result if one pattern has many hits on one sequence, does it return only one hit or all of hits as results? thank u -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253
• 2.7k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 22 months ago
United States
Hi, You'd find it very informative if you did a bit more exploratory analysis (and documentation reading!) ... I think you will find that you can answer most of these question yourself. For example, see inline: On Thu, Nov 1, 2012 at 1:21 PM, wang peter <wng.peter at="" gmail.com=""> wrote: > dear ALL: > Please this sample > subject = "TGCATTT" > Rpattern = "TGCAATTT" > result <- matchPattern(Rpattern, subject, max.mismatch= 4, min.mismatch=0) > result > Views on a 7-letter BString subject > subject: TGCATTT > views: > start end width > [1] 0 7 8 [ TGCATTT] > [2] 1 8 8 [TGCATTT ] > > > > is the start position and end position on the subject or pattern? R> matchPattern("GATACA", "GTTGACGATAGATACATTCAAGATACAAA") Views on a 29-letter BString subject subject: GTTGACGATAGATACATTCAAGATACAAA views: start end width [1] 11 16 6 [GATACA] [2] 22 27 6 [GATACA] Given that the pattern is only 6 NT long, do you think the result returned is on the subject or the pattern? > and for vmatchPattern result > if one pattern has many hits on one sequence, does it return only > one hit or all of hits as results? Technically neither. If you look at the Value seciton of ?matchPattern, you will see that it returns an MIndex object. "But what's an MIndex object," you ask? R> ?MIndex HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
wang peter ★ 2.0k
@wang-peter-4647
Last seen 10.3 years ago
> thx for your reply > i donot think the manual can answer my question > >> For example, see inline: >> >>> subject = "TGCATTT" >>> Rpattern = "TGCAATTT" >>> result <- matchPattern(Rpattern, subject, max.mismatch= 4, min.mismatch=0) >>> result >>> Views on a 7-letter BString subject >>> subject: TGCATTT >>> views: >>> start end width >>> [1] 0 7 8 [ TGCATTT] >>> [2] 1 8 8 [TGCATTT ] > > using my ass,i think it is position on the subject. but 0 and 8 are > out of border of subject > > absolutely i know what is MIndex object > but you never answer me > if i use > > startIndex(result) > > it will return all of hits of on your subject or just the first one???? -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253
ADD COMMENT
1
Entering edit mode
Hi, On Thu, Nov 1, 2012 at 2:07 PM, wang peter <wng.peter at="" gmail.com=""> wrote: >> thx for your reply >> i donot think the manual can answer my question >> >>> For example, see inline: >>> >>>> subject = "TGCATTT" >>>> Rpattern = "TGCAATTT" >>>> result <- matchPattern(Rpattern, subject, max.mismatch= 4, min.mismatch=0) >>>> result >>>> Views on a 7-letter BString subject >>>> subject: TGCATTT >>>> views: >>>> start end width >>>> [1] 0 7 8 [ TGCATTT] >>>> [2] 1 8 8 [TGCATTT ] >> >> using my ass,i think it is position on the subject. but 0 and 8 are >> out of border of subject This is why, in general, it's a good idea not to think with your ass. If you read the description of matchPattern in its help file, you see right at the top: """ A set of functions for finding all the occurrences (aka "matches" or "hits") of a given pattern (typically short) in a (typically long) reference sequence or set of reference sequences (aka the subject) """ This case, you are doing the reverse -- searching for a longer pattern than the subject. This wasn't what it was intended for, but ... fine. The result is telling you where the theoretical begin and end would be given your constraints (subject, pattern, and max.mismatch values). The fact that these results seem weird to you -- one starts at 0 (and also has a space in its first postion), and the other overhangs the end -- should give you an idea of what to look for if you expect such error conditions. The fact that you are allowing for (so many) mismatches (half the length of your pattern) I guess also brought you to this place. If your problem is how this "oddity" is reported, then I grant that this might be something worth talking about, and you are free to raise the issue if you have a better way to handle this. FWIW, I think the current response is a reasonable result to return, but I'd grant that it's worth adding a note of in the docs for this case -- perhaps you would like to provide a patch to the documentation describing this scenario. Out of curiosity, what should the function do if the pattern is 2x, 5x, or 10x longer than the subject? Anything? Nothing? `stop()`? But, this wasn't your question. Your (paraphrased) question was wondering about the result of matchPattern and whether or not the coordinates returned are for the pattern or the subject ... and, as I suggested, by reading the docs and trying some toy examples, the answer is obvious. >> >> absolutely i know what is MIndex object >> but you never answer me >> if i use >> >> startIndex(result) >> >> it will return all of hits of on your subject or just the first one???? I didn't answer you because I suggested that you should (1) read the docs a bit more carefully; and (more importantly!) (2) do some exploratory analysis for yourself before you bring your question back to the list, but since you couldn't be bothered to do either, allow me to stop what I'm doing so that I can do it for you instead: R> m <- vmatchPattern("GATACA", DNAStringSet(c("GGATACACCCGATACACC", "CCCCCCCCCGATACA"))) R> startIndex(m) [[1]] [1] 2 11 [[2]] [1] 10 Is that clear now? Look: please try and read the docs and explore your "problems" a bit more before posting to the list -- everyone is quite busy, but still try to help. When it's clear that the poster doesn't do "their homework" before posting a question, it can become quite frustrating (for me, at least). I don't think anybody would mind suggested enhancements to the documentation, so if you have those -- feel free to share. For example, your first question might have been avoided if it was noted more clearly -- but if you read the docs and understand *the intention* of the function, then take a moment to think about the result you got, I think the results can be explained in a rather intuitive/obvious way. But still -- as I said -- I think *well thought out and written* suggestions to fix the documentation will generally be received warmly. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi, The general comments/recommendations Steve is giving you are worth reading and I hope they will help you improve how you ask questions on this list. I also wanted to mention that in Biostrings 2.27.6 (BioC devel) I've improved the "show" method for MIndex objects so now they are displayed like other RangesList objects (which make them more user-friendly). With Steve's example: > m <- vmatchPattern("GATACA", DNAStringSet(c("GGATACACCCGATACACC", "CCCCCCCCCGATACA"))) > m MIndex object of length 2 [[1]] IRanges of length 2 start end width [1] 2 7 6 [2] 11 16 6 [[2]] IRanges of length 1 start end width [1] 10 15 6 Note that, with matchPattern/vmatchPattern/matchPDict, overlapping matches are reported (which is not the case with grep and family): > vmatchPattern("GAGA", DNAStringSet(c("CCGAGAGAT", "GACGATA"))) MIndex object of length 2 [[1]] IRanges of length 2 start end width [1] 3 6 4 [2] 5 8 4 [[2]] IRanges of length 0 Historically MIndex objects predate RangesList objects and that explains the odd interface like startIndex etc... They are also lagging behind RangesList in terms of functionalities. I've had on my list for a long time to modernize them. Hopefully soon. Cheers, H. On 11/01/2012 11:44 AM, Steve Lianoglou wrote: > Hi, > > On Thu, Nov 1, 2012 at 2:07 PM, wang peter <wng.peter at="" gmail.com=""> wrote: >>> thx for your reply >>> i donot think the manual can answer my question >>> >>>> For example, see inline: >>>> >>>>> subject = "TGCATTT" >>>>> Rpattern = "TGCAATTT" >>>>> result <- matchPattern(Rpattern, subject, max.mismatch= 4, min.mismatch=0) >>>>> result >>>>> Views on a 7-letter BString subject >>>>> subject: TGCATTT >>>>> views: >>>>> start end width >>>>> [1] 0 7 8 [ TGCATTT] >>>>> [2] 1 8 8 [TGCATTT ] >>> >>> using my ass,i think it is position on the subject. but 0 and 8 are >>> out of border of subject > > This is why, in general, it's a good idea not to think with your ass. > > If you read the description of matchPattern in its help file, you see > right at the top: > > """ > A set of functions for finding all the occurrences (aka "matches" or > "hits") of a given pattern (typically short) in a (typically long) > reference sequence or set of reference sequences (aka the subject) > """ > > This case, you are doing the reverse -- searching for a longer pattern > than the subject. This wasn't what it was intended for, but ... fine. > > The result is telling you where the theoretical begin and end would be > given your constraints (subject, pattern, and max.mismatch values). > > The fact that these results seem weird to you -- one starts at 0 (and > also has a space in its first postion), and the other overhangs the > end -- should give you an idea of what to look for if you expect such > error conditions. > > The fact that you are allowing for (so many) mismatches (half the > length of your pattern) I guess also brought you to this place. > > If your problem is how this "oddity" is reported, then I grant that > this might be something worth talking about, and you are free to raise > the issue if you have a better way to handle this. > > FWIW, I think the current response is a reasonable result to return, > but I'd grant that it's worth adding a note of in the docs for this > case -- perhaps you would like to provide a patch to the documentation > describing this scenario. > > Out of curiosity, what should the function do if the pattern is 2x, > 5x, or 10x longer than the subject? Anything? Nothing? `stop()`? > > But, this wasn't your question. Your (paraphrased) question was > wondering about the result of matchPattern and whether or not the > coordinates returned are for the pattern or the subject ... and, as I > suggested, by reading the docs and trying some toy examples, the > answer is obvious. > >>> >>> absolutely i know what is MIndex object >>> but you never answer me >>> if i use >>> >>> startIndex(result) >>> >>> it will return all of hits of on your subject or just the first one???? > > I didn't answer you because I suggested that you should (1) read the > docs a bit more carefully; and (more importantly!) (2) do some > exploratory analysis for yourself before you bring your question back > to the list, but since you couldn't be bothered to do either, allow > me to stop what I'm doing so that I can do it for you instead: > > R> m <- vmatchPattern("GATACA", DNAStringSet(c("GGATACACCCGATACACC", > "CCCCCCCCCGATACA"))) > R> startIndex(m) > [[1]] > [1] 2 11 > > [[2]] > [1] 10 > > Is that clear now? > > Look: please try and read the docs and explore your "problems" a bit > more before posting to the list -- everyone is quite busy, but still > try to help. When it's clear that the poster doesn't do "their > homework" before posting a question, it can become quite frustrating > (for me, at least). > > I don't think anybody would mind suggested enhancements to the > documentation, so if you have those -- feel free to share. For > example, your first question might have been avoided if it was noted > more clearly -- but if you read the docs and understand *the > intention* of the function, then take a moment to think about the > result you got, I think the results can be explained in a rather > intuitive/obvious way. But still -- as I said -- I think *well thought > out and written* suggestions to fix the documentation will generally > be received warmly. > > -steve > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6