a problem of trimLRPatterns still confused me
1
0
Entering edit mode
@harris-a-jaffee-3972
Last seen 8.3 years ago
United States
See below. On Nov 30, 2012, at 3:36 PM, Wang Peter wrote: > thank you very much, Harris,you helped me again > > now i understand, see the below > > max.mismatchs <- 0.2*1:nchar(Rpattern) > subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGC CCCGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG" > > Rpattern = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG" > > sapply((nchar(subject)-nchar(Rpattern)+1):nchar(subject), function(j) { > s = substr(subject, j, nchar(subject)) > p = substr(Rpattern, 1, nchar(subject)-j+1) > neditEndingAtending.at=nchar(s), pattern = p, subject = s, > with.indels=TRUE) > }) > > all distance > [1] 32 33 33 32 31 32 31 30 29 28 27 26 27 26 25 25 24 23 22 22 21 20 20 20 > [25] 20 19 18 17 18 17 17 18 17 16 15 16 15 14 13 12 12 11 10 9 8 7 6 6 > [49] 6 6 6 5 4 3 (2) 3 3 3 3 3 2 1 0 1 > > max.mismatchs > [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (2.0) 2.2 2.4 2.6 > 2.8 3.0 3.2 3.4 3.6 3.8 > [20] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 > 6.6 6.8 7.0 7.2 7.4 7.6 > [39] 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 > 10.4 10.6 10.8 11.0 11.2 11.4 > [58] 11.6 11.8 12.0 12.2 12.4 12.6 12.8 > > when the function find a distance < = the corresponding mismatch. see > (2) and (2.0), the function stops. Yes. > but i think the distance between those 10bp kmer should be 4, not 2 > > CAAGATC AAG > AGATCGGAAG You are correct about the edit-distance between the 2 strings of length 10, but that is not relevant here. trimLRPatterns is based on ?lowlevel-matching: If 'with.indels' is 'TRUE', then the "edit distance" is used: for each position specified in 'at', P is compared to all the substrings S' of S starting at this position and the smallest distance is returned. This needs to be read as applying, inverted, to 'ending at' situations. In your case, _this position_ is the *end* of the the 10-letter subject S. The 10-letter pattern P is compared to all the substrings S' of S *ending* at the specified position. The smallest distance, 2, is the one between P and the 8-letter suffix S' of S. Now it is possible that longer suffixes S' of S would have the same distance 2 from P, and the *longest* of those would be desirable. I believe the low-level matching code actually finds this longest match (without any mechanism, at present, to report it).
• 623 views
0
Entering edit mode
wang peter ★ 2.0k
@wang-peter-4647
Last seen 8.4 years ago
dear Harris thank you so much for your kindly explanation i am so ashamed to disturb u again. my understanding is when they low-level function to caculate the distance between S and P S= CAAGATC AAG P= AGATCGGAAG it will try CAAGATCAAG AAGATCAAG AGATCAAG GATCAAG ... G and get all the edit distance but 2 is the smallest one so it will take 2 as the distance between S and P S'= AGATC AAG P= AGATCGGAAG and then trim the whole S, rather than S' -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253
0
Entering edit mode
On Dec 1, 2012, at 12:00 PM, Wang Peter wrote: > dear Harris > thank you so much for your kindly explanation > i am so ashamed to disturb u again. > > my understanding is > > when they low-level function to caculate the distance between S and P In this situation, the C function _nedit_for_Proffset() is called, but the purpose is much more than to calculate the edit distance between S and P. As I quoted before from ?lowlevel-matching, it is to determine the minimum edit distance between P and all the suffixes S' of S. > S= CAAGATC AAG > P= AGATCGGAAG > > > it will try > CAAGATCAAG > AAGATCAAG > AGATCAAG > GATCAAG > ... > G > > and get all the edit distance > but 2 is the smallest one Yes, 2 is the minimum described above, occurring for the 8-letter suffix S' = AAGATCAAG of S = CAAAGATCAAG. > so it will take 2 as the distance between S and P Not the distance between S and P, which you correctly observed in a previous post was 4, but the distance between the entire pattern P and some suffix S' of S, unknown to trimLRPatterns. > S'= AGATC AAG > P= AGATCGGAAG > > and then trim the whole S, rather than S' The whole S is taken by trimLRPatterns as its best guess at S'. In this case, a little more than necessary is trimmed, perhaps in other cases, a little less than necessary. > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253