a problem of trimLRPatterns still confused me
1
0
Entering edit mode
@harris-a-jaffee-3972
Last seen 10.1 years ago
United States
See below. On Nov 30, 2012, at 3:36 PM, Wang Peter wrote: > thank you very much, Harris,you helped me again > > now i understand, see the below > > max.mismatchs <- 0.2*1:nchar(Rpattern) > subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGC CCCGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG" > > Rpattern = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG" > > sapply((nchar(subject)-nchar(Rpattern)+1):nchar(subject), function(j) { > s = substr(subject, j, nchar(subject)) > p = substr(Rpattern, 1, nchar(subject)-j+1) > neditEndingAtending.at=nchar(s), pattern = p, subject = s, > with.indels=TRUE) > }) > > all distance > [1] 32 33 33 32 31 32 31 30 29 28 27 26 27 26 25 25 24 23 22 22 21 20 20 20 > [25] 20 19 18 17 18 17 17 18 17 16 15 16 15 14 13 12 12 11 10 9 8 7 6 6 > [49] 6 6 6 5 4 3 (2) 3 3 3 3 3 2 1 0 1 > > max.mismatchs > [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (2.0) 2.2 2.4 2.6 > 2.8 3.0 3.2 3.4 3.6 3.8 > [20] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 > 6.6 6.8 7.0 7.2 7.4 7.6 > [39] 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 > 10.4 10.6 10.8 11.0 11.2 11.4 > [58] 11.6 11.8 12.0 12.2 12.4 12.6 12.8 > > when the function find a distance < = the corresponding mismatch. see > (2) and (2.0), the function stops. Yes. > but i think the distance between those 10bp kmer should be 4, not 2 > > CAAGATC AAG > AGATCGGAAG You are correct about the edit-distance between the 2 strings of length 10, but that is not relevant here. trimLRPatterns is based on ?`lowlevel-matching`: If 'with.indels' is 'TRUE', then the "edit distance" is used: for each position specified in 'at', P is compared to all the substrings S' of S starting at this position and the smallest distance is returned. This needs to be read as applying, inverted, to 'ending at' situations. In your case, _this position_ is the *end* of the the 10-letter subject S. The 10-letter pattern P is compared to all the substrings S' of S *ending* at the specified position. The smallest distance, 2, is the one between P and the 8-letter suffix S' of S. Now it is possible that longer suffixes S' of S would have the same distance 2 from P, and the *longest* of those would be desirable. I believe the low-level matching code actually finds this longest match (without any mechanism, at present, to report it).
• 960 views
ADD COMMENT
0
Entering edit mode
wang peter ★ 2.0k
@wang-peter-4647
Last seen 10.2 years ago
dear Harris thank you so much for your kindly explanation i am so ashamed to disturb u again. my understanding is when they low-level function to caculate the distance between S and P S= CAAGATC AAG P= AGATCGGAAG it will try CAAGATCAAG AAGATCAAG AGATCAAG GATCAAG ... G and get all the edit distance but 2 is the smallest one so it will take 2 as the distance between S and P S'= AGATC AAG P= AGATCGGAAG and then trim the whole S, rather than S' -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253
ADD COMMENT
0
Entering edit mode
On Dec 1, 2012, at 12:00 PM, Wang Peter wrote: > dear Harris > thank you so much for your kindly explanation > i am so ashamed to disturb u again. > > my understanding is > > when they low-level function to caculate the distance between S and P In this situation, the C function _nedit_for_Proffset() is called, but the purpose is much more than to calculate the edit distance between S and P. As I quoted before from ?`lowlevel-matching`, it is to determine the minimum edit distance between P and all the suffixes S' of S. > S= CAAGATC AAG > P= AGATCGGAAG > > > it will try > CAAGATCAAG > AAGATCAAG > AGATCAAG > GATCAAG > ... > G > > and get all the edit distance > but 2 is the smallest one Yes, 2 is the minimum described above, occurring for the 8-letter suffix S' = AAGATCAAG of S = CAAAGATCAAG. > so it will take 2 as the distance between S and P Not the distance between S and P, which you correctly observed in a previous post was 4, but the distance between the entire pattern P and some suffix S' of S, unknown to trimLRPatterns. > S'= AGATC AAG > P= AGATCGGAAG > > and then trim the whole S, rather than S' The whole S is taken by trimLRPatterns as its best guess at S'. In this case, a little more than necessary is trimmed, perhaps in other cases, a little less than necessary. > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253
ADD REPLY

Login before adding your answer.

Traffic: 740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6