a problem of trimLRPatterns still confused me

0

Entering edit mode

wang peter ★ 2.0k

@wang-peter-4647

Last seen 9.6 years ago

sorry to disturb you again but i am still feeling confused see this problem subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCC CGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG" Rpattern = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG" max.mismatchs <- 0.2*1:nchar(DNAString(Rpattern)) trimLRPatterns(Rpattern = Rpattern, subject = subject, max.Rmismatch=max.mismatchs, with.Rindels=TRUE) [1] "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCCCGCTTT CACGGTCTGTATTCGTACTGAAAAT" CAAGATC AAG AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG overlap length is 10 bp, so the allowed distance is 10*0.2=2 so it should trim the "AGATCAAG", not include the "CA" I am confused why? -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253

• 804 views

ADD COMMENT • link updated 11.4 years ago by Harris A. Jaffee ▴ 590 • written 11.4 years ago by wang peter ★ 2.0k

0

Entering edit mode

Harris A. Jaffee ▴ 590

@harris-a-jaffee-3972

Last seen 9.5 years ago

United States

You are perfectly correct, but it is not terrible that we trimmed a little extra. On the other hand, there is a problem lurking here that, in opposite cases, we would not quite trim enough. To simplify the example, we are only concerned with the last 10 letters of your subject and the first 10 letters of Rpattern: > substr(subject, 91, 100) [1] "CAAGATCAAG" > substr(Rpattern, 1, 10) [1] "AGATCGGAAG" Then note that > neditEndingAt("AGATCGGAAG", "CAAGATCAAG", ending.at=10, with.indels=TRUE) [1] 2 And your result: > trimLRPatterns(Rpattern = "AGATCGGAAG", subject = "CAAGATCAAG", max.Rmismatch=2, with.Rindels=TRUE) [1] "" But you ask why, after deleting GG in the Rpattern, we don't trim only the last 8 letters of the subject rather than all 10. This is a valid question. In general, with max.Rmismatch=2 and with.Rindels=TRUE, the matching suffix of the subject ending.at=10) could be as short as 8 letters or as long as 12. The lower (C) level matching code does not actually report the matching portion of the subject -- at least, it didn't long ago. Thus, trimLRPatterns takes a lazy out and trims the subject suffix of the same length as the matching Rpattern prefix. Up to the length of the pattern prefix, this has the advantage of always removing any possible adapter even if more is removed than necessary, as you describe here. If the match used 2 insertions in the pattern, rather than deletions, so was 12 letters long, then we aren't removing all possible adapter, and this will hurt or prevent later alignment. One might say that we should trim 12 in all cases, even yours, and I have proposed that. Thanks, and let us know if I didn't get your point. On Nov 29, 2012, at 11:17 PM, Wang Peter wrote: > sorry to disturb you again > > but i am still feeling confused > see this problem > > subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGC CCCGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG" > > Rpattern = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG" > > max.mismatchs <- 0.2*1:nchar(DNAString(Rpattern)) > > trimLRPatterns(Rpattern = Rpattern, subject = subject, > max.Rmismatch=max.mismatchs, with.Rindels=TRUE) > [1] "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCCCGCT TTCACGGTCTGTATTCGTACTGAAAAT" > > CAAGATC AAG > AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG > > overlap length is 10 bp, so the allowed distance is 10*0.2=2 > > so it should trim the "AGATCAAG", not include the "CA" > > I am confused why? > > > > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.4 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

thank you very much, Harris,you helped me again now i understand, see the below max.mismatchs <- 0.2*1:nchar(Rpattern) subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCC CGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG" Rpattern = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG" sapply((nchar(subject)-nchar(Rpattern)+1):nchar(subject), function(j) { s = substr(subject, j, nchar(subject)) p = substr(Rpattern, 1, nchar(subject)-j+1) neditEndingAtending.at=nchar(s), pattern = p, subject = s, with.indels=TRUE) }) all distance [1] 32 33 33 32 31 32 31 30 29 28 27 26 27 26 25 25 24 23 22 22 21 20 20 20 [25] 20 19 18 17 18 17 17 18 17 16 15 16 15 14 13 12 12 11 10 9 8 7 6 6 [49] 6 6 6 5 4 3 (2) 3 3 3 3 3 2 1 0 1 max.mismatchs [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (2.0) 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 [20] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 [39] 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 11.2 11.4 [58] 11.6 11.8 12.0 12.2 12.4 12.6 12.8 when the function find a distance < = the corresponding mismatch. see (2) and (2.0), the function stops. but i think the distance between those 10bp kmer should be 4, not 2 CAAGATC AAG AGATCGGAAG

ADD REPLY • link 11.4 years ago wang peter ★ 2.0k

Login before adding your answer.