trim adapter with mismatch tolerance

0

Entering edit mode

Kunbin Qu ▴ 40

@kunbin-qu-5018

Last seen 11.4 years ago

Hi, I have some reads with 25 bp adapter at the five prime end. Some of the reads do not have perfect match with the adapter, even with some indels. I tried to use trimLRPatterns to trim them off, but I got confused with the max.Lmismatch parameter, as it is doing sub-string trimming too. The adapter in my reads are the full length of 25 bp, without shortening to any substring. How should I specify the parameter then? Or maybe I could use some other functions from ShortRead? Please help. Thanks. -Kunbin code I used, but this code cannot accommodate the mismatches. reads<-readFastq("test.fq") adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC") seqs<-sread(reads) trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch =1, ranges=T) ______________________________________________________________________ The contents of this electronic message, including any attachments, are intended only for the use of the individual or entity to which they are addressed and may contain confidential information. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this message or any attachment is strictly prohibited. If you have received this transmission in error, please send an e-mail to postmaster@genomichealth.com and delete this message, along with any attachments, from your computer. [[alternative HTML version deleted]]

• 1.7k views

ADD COMMENT • link updated 14.1 years ago by Harris A. Jaffee ▴ 590 • written 14.1 years ago by Kunbin Qu ▴ 40

0

Entering edit mode

Harris A. Jaffee ▴ 590

@harris-a-jaffee-3972

Last seen 11.3 years ago

United States

Please give a read example and exactly what you want to happen and not happen, and, optionally, exactly you tried and what did happen, along with the output of sessionInfo() in case that may be relevant. The spec and examples in the Biostrings::trimLRPatterns help page are not ideal. You are correct that a single integer in max.Lmismatch should prevent matches/trimming by anything but the entire Lpattern. I can't tell whether you are saying that you need with.Lindels=TRUE or Lfixed="pattern". On Dec 18, 2011, at 11:38 PM, Kunbin Qu wrote: > Hi, > > I have some reads with 25 bp adapter at the five prime end. Some of > the reads do not have perfect match with the adapter, even with > some indels. I tried to use trimLRPatterns to trim them off, but I > got confused with the max.Lmismatch parameter, as it is doing sub- > string trimming too. The adapter in my reads are the full length > of 25 bp, without shortening to any substring. How should I specify > the parameter then? Or maybe I could use some other functions from > ShortRead? Please help. Thanks. > > -Kunbin > > code I used, but this code cannot accommodate the mismatches. > > reads<-readFastq("test.fq") > adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC") > seqs<-sread(reads) > trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, > max.Lmismatch =1, ranges=T) > > ______________________________________________________________________ > The contents of this electronic message, including any attachments, > are intended only for the use of the individual or entity to which > they are addressed and may contain confidential information. If you > are not the intended recipient, you are hereby notified that any > use, dissemination, distribution, or copying of this message or any > attachment is strictly prohibited. If you have received this > transmission in error, please send an e-mail to > postmaster at genomichealth.com and delete this message, along with > any attachments, from your computer. > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.1 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

Harris, the followings are my code and intention and thanks a lot for your help! For read 2, it will be trimmed For read 3, there is a deletion at the 4th bp, CTG-CC, it was not trimmed by my code, but I wanted to For read 4, there are two mismatches starting from the 4th bp: CTGCAC, comparing with CTGGCC. I wanted this to be trimmed too, but it did not -Kunbin Reads input: @SEA055486C_0096:7:1101:2898:2204#ACAGTG/1 CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCCCGCAGAGGGTTGTATTGGTTCGGCACCATGCCGCTC TGCAGCCGGGACAGCCACTCGCA + gggggiiihihiiiiiiiiiiiihiiihiiiiiiigeecW`cc\`aabcdcccccccccccccccccaac cccccccccaaccccccccbcc] @SEA055486C_0096:7:1101:3620:2209#ACAGTG/1 CTGGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG GCTC + ccccc_ad`[``ed_bRP^aZbccbbc\_cccccchhhdbb`bM\`\``bbc]V^RZ^____\^\^W[YY ^^^^ @SEA055486C_0096:7:1101:3316:2429#ACAGTG/1 CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGG CTCCAGCAGCGGAGGGGCCCTG + gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi[ghhiiiiiihigggggfeeeeccddcccccccc c`bcccccccaccT]acaaa]a @SEA055486C_0096:7:1101:3316:2439#ACAGTG/1 CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG GCTCCAGCAGCGGAGGGGCCCTG + gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi[ghhiiiiiihigggggfeeeeccddcccccccc c`bcccccccaccT]acaaa]aa code: adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC") reads<-readFastq("t2.fq") seqs<-sread(reads) trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch = 1, ranges=T) seqsAR <- DNAStringSet(seqs, start=start(trimCoordsAR), end=end(trimCoordsAR)) seqsAR A DNAStringSet instance of length 5 width seq [1] 93 CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA [2] 49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC [3] 92 CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCA...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG [4] 93 CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCC...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.3 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0 [7] GenomicRanges_1.6.4 IRanges_1.12.5 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.0 [5] hwriter_1.3 RCurl_1.8-0 rtracklayer_1.14.4 tools_2.14.0 [9] XML_3.6-2 zlibbioc_1.0.0 > -----Original Message----- From: Harris A. Jaffee [mailto:hj@jhu.edu] Sent: Monday, December 19, 2011 4:58 AM To: Kunbin Qu Cc: bioconductor at r-project.org Subject: Re: [BioC] trim adapter with mismatch tolerance Please give a read example and exactly what you want to happen and not happen, and, optionally, exactly you tried and what did happen, along with the output of sessionInfo() in case that may be relevant. The spec and examples in the Biostrings::trimLRPatterns help page are not ideal. You are correct that a single integer in max.Lmismatch should prevent matches/trimming by anything but the entire Lpattern. I can't tell whether you are saying that you need with.Lindels=TRUE or Lfixed="pattern". On Dec 18, 2011, at 11:38 PM, Kunbin Qu wrote: > Hi, > > I have some reads with 25 bp adapter at the five prime end. Some of > the reads do not have perfect match with the adapter, even with some > indels. I tried to use trimLRPatterns to trim them off, but I got > confused with the max.Lmismatch parameter, as it is doing sub- string > trimming too. The adapter in my reads are the full length of 25 bp, > without shortening to any substring. How should I specify the > parameter then? Or maybe I could use some other functions from > ShortRead? Please help. Thanks. > > -Kunbin > > code I used, but this code cannot accommodate the mismatches. > > reads<-readFastq("test.fq") > adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC") > seqs<-sread(reads) > trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, > max.Lmismatch =1, ranges=T) > > ______________________________________________________________________ > The contents of this electronic message, including any attachments, > are intended only for the use of the individual or entity to which > they are addressed and may contain confidential information. If you > are not the intended recipient, you are hereby notified that any use, > dissemination, distribution, or copying of this message or any > attachment is strictly prohibited. If you have received this > transmission in error, please send an e-mail to > postmaster at genomichealth.com and delete this message, along with any > attachments, from your computer. > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor ______________________________________________________________________ The contents of this electronic message, including any attachments, are intended only for the use of the individual or entity to which they are addressed and may contain confidential information. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this message or any attachment is strictly prohibited. If you have received this transmission in error, please send an e-mail to postmaster at genomichealth.com and delete this message, along with any attachments, from your computer.

ADD REPLY • link 14.1 years ago Kunbin Qu ▴ 40

0

Entering edit mode

See below. On Dec 19, 2011, at 2:43 PM, Kunbin Qu wrote: > Harris, the followings are my code and intention and thanks a lot > for your help! > > For read 2, it will be trimmed > For read 3, there is a deletion at the 4th bp, CTG-CC, it was not > trimmed by my code, but I wanted to > For read 4, there are two mismatches starting from the 4th bp: > CTGCAC, comparing with CTGGCC. I wanted this to be trimmed too, but > it did not > > -Kunbin > > > Reads input: > > @SEA055486C_0096:7:1101:2898:2204#ACAGTG/1 > CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCCCGCAGAGGGTTGTATTGGTTCGGCACCATGCCGCTC > TGCAGCCGGGACAGCCACTCGCA > + > gggggiiihihiiiiiiiiiiiihiiihiiiiiiigeecW`cc > \`aabcdcccccccccccccccccaaccccccccccaaccccccccbcc] > @SEA055486C_0096:7:1101:3620:2209#ACAGTG/1 > CTGGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG > GCTC > + > ccccc_ad`[``ed_bRP^aZbccbbc\_cccccchhhdbb`bM\`\``bbc]V^RZ^____\^\^W > [YY^^^^ > @SEA055486C_0096:7:1101:3316:2429#ACAGTG/1 > CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGG > CTCCAGCAGCGGAGGGGCCCTG > + > gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi > [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]a > @SEA055486C_0096:7:1101:3316:2439#ACAGTG/1 > CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG > GCTCCAGCAGCGGAGGGGCCCTG > + > gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi > [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]aa > > > code: > > adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC") > reads<-readFastq("t2.fq") > seqs<-sread(reads) > trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, > max.Lmismatch = 1, ranges=T) > seqsAR <- DNAStringSet(seqs, start=start(trimCoordsAR), end=end > (trimCoordsAR)) > seqsAR > A DNAStringSet instance of length 5 > width seq > [1] 93 > CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA > [2] 49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC > [3] 92 > CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCA...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG > [4] 93 > CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCC...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG Basically, you just need (1) to explicitly allow indels, since unfortunately the default is not to allow them, and (2) your mismatch tolerance isn't large enough. To get your feet wet, start with some exploratory data analysis such as: > neditStartingAtstarting.at=1, pattern=adapter, subject=seqs) [,1] [,2] [,3] [,4] [1,] 16 0 15 2 > neditStartingAtstarting.at=1, pattern=adapter, subject=seqs, with.indels=TRUE) [,1] [,2] [,3] [,4] [1,] 12 0 1 2 So then, aside from possible non-calls ("N"), of which there aren't any in these 4 reads, we can now see pretty much what we need at least for reads 2, 3, and 4: > trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch=2, with.Lindels=TRUE) A DNAStringSet instance of length 4 width seq [1] 93 CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA [2] 49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC [3] 67 TCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG [4] 68 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG Read 1 could be more subtle, or (more likely, I guess) just does not contain any adapter to trim in the first place. >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] ShortRead_1.12.3 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0 > [7] GenomicRanges_1.6.4 IRanges_1.12.5 > > loaded via a namespace (and not attached): > [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 > grid_2.14.0 > [5] hwriter_1.3 RCurl_1.8-0 rtracklayer_1.14.4 > tools_2.14.0 > [9] XML_3.6-2 zlibbioc_1.0.0 >> > > -----Original Message----- > From: Harris A. Jaffee [mailto:hj at jhu.edu] > Sent: Monday, December 19, 2011 4:58 AM > To: Kunbin Qu > Cc: bioconductor at r-project.org > Subject: Re: [BioC] trim adapter with mismatch tolerance > > Please give a read example and exactly what you want to happen and > not happen, and, optionally, exactly you tried and what did happen, > along with the output of sessionInfo() in case that may be > relevant. The spec and examples in the Biostrings::trimLRPatterns > help page are not ideal. > > You are correct that a single integer in max.Lmismatch should > prevent matches/trimming by anything but the entire Lpattern. > > I can't tell whether you are saying that you need with.Lindels=TRUE > or Lfixed="pattern". > > On Dec 18, 2011, at 11:38 PM, Kunbin Qu wrote: > >> Hi, >> >> I have some reads with 25 bp adapter at the five prime end. Some of >> the reads do not have perfect match with the adapter, even with some >> indels. I tried to use trimLRPatterns to trim them off, but I got >> confused with the max.Lmismatch parameter, as it is doing sub- string >> trimming too. The adapter in my reads are the full length of 25 bp, >> without shortening to any substring. How should I specify the >> parameter then? Or maybe I could use some other functions from >> ShortRead? Please help. Thanks. >> >> -Kunbin >> >> code I used, but this code cannot accommodate the mismatches. >> >> reads<-readFastq("test.fq") >> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC") >> seqs<-sread(reads) >> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, >> max.Lmismatch =1, ranges=T) >> >> _____________________________________________________________________ >> _ >> The contents of this electronic message, including any attachments, >> are intended only for the use of the individual or entity to which >> they are addressed and may contain confidential information. If you >> are not the intended recipient, you are hereby notified that any use, >> dissemination, distribution, or copying of this message or any >> attachment is strictly prohibited. If you have received this >> transmission in error, please send an e-mail to >> postmaster at genomichealth.com and delete this message, along with any >> attachments, from your computer. >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The contents of this electronic message, including any attachments, > are intended only for the use of the individual or entity to which > they are addressed and may contain confidential information. If you > are not the intended recipient, you are hereby notified that any > use, dissemination, distribution, or copying of this message or any > attachment is strictly prohibited. If you have received this > transmission in error, please send an e-mail to > postmaster at genomichealth.com and delete this message, along with > any attachments, from your computer.

ADD REPLY • link 14.1 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

Another way, below. On Dec 19, 2011, at 4:14 PM, Harris A. Jaffee wrote: > See below. > > On Dec 19, 2011, at 2:43 PM, Kunbin Qu wrote: >> Harris, the followings are my code and intention and thanks a lot >> for your help! >> >> For read 2, it will be trimmed >> For read 3, there is a deletion at the 4th bp, CTG-CC, it was not >> trimmed by my code, but I wanted to >> For read 4, there are two mismatches starting from the 4th bp: >> CTGCAC, comparing with CTGGCC. I wanted this to be trimmed too, >> but it did not >> >> -Kunbin >> >> >> Reads input: >> >> @SEA055486C_0096:7:1101:2898:2204#ACAGTG/1 >> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCCCGCAGAGGGTTGTATTGGTTCGGCACCATGCCGCT >> CTGCAGCCGGGACAGCCACTCGCA >> + >> gggggiiihihiiiiiiiiiiiihiiihiiiiiiigeecW`cc >> \`aabcdcccccccccccccccccaaccccccccccaaccccccccbcc] >> @SEA055486C_0096:7:1101:3620:2209#ACAGTG/1 >> CTGGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTG >> GGCTC >> + >> ccccc_ad`[``ed_bRP^aZbccbbc\_cccccchhhdbb`bM\`\``bbc]V^RZ^____\^\^W >> [YY^^^^ >> @SEA055486C_0096:7:1101:3316:2429#ACAGTG/1 >> CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG >> GCTCCAGCAGCGGAGGGGCCCTG >> + >> gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi >> [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]a >> @SEA055486C_0096:7:1101:3316:2439#ACAGTG/1 >> CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTG >> GGCTCCAGCAGCGGAGGGGCCCTG >> + >> gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi >> [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]aa >> >> >> code: >> >> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC") >> reads<-readFastq("t2.fq") >> seqs<-sread(reads) >> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, >> max.Lmismatch = 1, ranges=T) >> seqsAR <- DNAStringSet(seqs, start=start(trimCoordsAR), end=end >> (trimCoordsAR)) >> seqsAR >> A DNAStringSet instance of length 5 >> width seq >> [1] 93 >> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGC >> A >> [2] 49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC >> [3] 92 >> CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCA...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCT >> G >> [4] 93 >> CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCC...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCT >> G > > Basically, you just need (1) to explicitly allow indels, since > unfortunately the > default is not to allow them, and (2) your mismatch tolerance isn't > large enough. > > To get your feet wet, start with some exploratory data analysis > such as: > > > neditStartingAtstarting.at=1, pattern=adapter, subject=seqs) > [,1] [,2] [,3] [,4] > [1,] 16 0 15 2 > > > neditStartingAtstarting.at=1, pattern=adapter, subject=seqs, > with.indels=TRUE) > [,1] [,2] [,3] [,4] > [1,] 12 0 1 2 Or just use trimLRPatterns itself with very liberal matching, for example trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch = 6, with.Lindels=TRUE) If your reads can contain partial adaptors, set max.Lmismatch to a "rate": trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch = .2, with.Lindels=TRUE) > So then, aside from possible non-calls ("N"), of which there aren't > any in these > 4 reads, we can now see pretty much what we need at least for reads > 2, 3, and 4: > > > trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch=2, > with.Lindels=TRUE) > A DNAStringSet instance of length 4 > width seq > [1] 93 > CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA > [2] 49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC > [3] 67 > TCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG > [4] 68 > GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG > > Read 1 could be more subtle, or (more likely, I guess) just does > not contain any > adapter to trim in the first place. > >>> sessionInfo() >> R version 2.14.0 (2011-10-31) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> other attached packages: >> [1] ShortRead_1.12.3 latticeExtra_0.6-19 RColorBrewer_1.0-5 >> [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0 >> [7] GenomicRanges_1.6.4 IRanges_1.12.5 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 >> grid_2.14.0 >> [5] hwriter_1.3 RCurl_1.8-0 rtracklayer_1.14.4 >> tools_2.14.0 >> [9] XML_3.6-2 zlibbioc_1.0.0 >>> >> >> -----Original Message----- >> From: Harris A. Jaffee [mailto:hj at jhu.edu] >> Sent: Monday, December 19, 2011 4:58 AM >> To: Kunbin Qu >> Cc: bioconductor at r-project.org >> Subject: Re: [BioC] trim adapter with mismatch tolerance >> >> Please give a read example and exactly what you want to happen and >> not happen, and, optionally, exactly you tried and what did >> happen, along with the output of sessionInfo() in case that may be >> relevant. The spec and examples in the Biostrings::trimLRPatterns >> help page are not ideal. >> >> You are correct that a single integer in max.Lmismatch should >> prevent matches/trimming by anything but the entire Lpattern. >> >> I can't tell whether you are saying that you need >> with.Lindels=TRUE or Lfixed="pattern". >> >> On Dec 18, 2011, at 11:38 PM, Kunbin Qu wrote: >> >>> Hi, >>> >>> I have some reads with 25 bp adapter at the five prime end. Some of >>> the reads do not have perfect match with the adapter, even with some >>> indels. I tried to use trimLRPatterns to trim them off, but I got >>> confused with the max.Lmismatch parameter, as it is doing sub- >>> string >>> trimming too. The adapter in my reads are the full length of 25 bp, >>> without shortening to any substring. How should I specify the >>> parameter then? Or maybe I could use some other functions from >>> ShortRead? Please help. Thanks. >>> >>> -Kunbin >>> >>> code I used, but this code cannot accommodate the mismatches. >>> >>> reads<-readFastq("test.fq") >>> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC") >>> seqs<-sread(reads) >>> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs, >>> max.Lmismatch =1, ranges=T) >>> >>> ____________________________________________________________________ >>> __ >>> The contents of this electronic message, including any attachments, >>> are intended only for the use of the individual or entity to which >>> they are addressed and may contain confidential information. If you >>> are not the intended recipient, you are hereby notified that any >>> use, >>> dissemination, distribution, or copying of this message or any >>> attachment is strictly prohibited. If you have received this >>> transmission in error, please send an e-mail to >>> postmaster at genomichealth.com and delete this message, along with any >>> attachments, from your computer. >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/ >>> gmane.science.biology.informatics.conductor >> >> >> _____________________________________________________________________ >> _ >> The contents of this electronic message, including any >> attachments, are intended only for the use of the individual or >> entity to which they are addressed and may contain confidential >> information. If you are not the intended recipient, you are hereby >> notified that any use, dissemination, distribution, or copying of >> this message or any attachment is strictly prohibited. If you have >> received this transmission in error, please send an e-mail to >> postmaster at genomichealth.com and delete this message, along with >> any attachments, from your computer. >

ADD REPLY • link 14.1 years ago Harris A. Jaffee ▴ 590

Login before adding your answer.