6
0
Entering edit mode
@taylor-sean-d-5800
Last seen 8.4 years ago
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 2.4 years ago
Scripps Research, La Jolla, CA
0
Entering edit mode
@james-w-macdonald-5106
Last seen 41 minutes ago
United States
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
0
Entering edit mode
0
Entering edit mode
Thanks all for your input. I will give some of these solutions a try and will probably go with whichever is faster and integrates well with the remainder of our analyses in R. Herve, thanks for the clarification. Looking back I see that I hadn't explored the function fully. I think it will work, but I have a few follow-up questions. Here is the sample data set that I'm playing around with: library(Rsamtools) bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") param <- ScanBamParam(what=c("seq", "qual")) gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param) These sequences are all relatively short, about 35nts each. I set the first read as my "adaptor" and pulled out all the reads that have a start position overlapping the 'adaptor' sequence: gal<-gal[which(start(gal)<=width(gal[1]))] reads <- setNames(mcols(gal)$seq, names(gal)) seqnames<-seqnames(gal) adaptor<-reads[[1]] The data are from two 'chromosomes', seq1 and seq2. My adaptor is from seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me approximate best and worst case scenarios: seq1<-reads[seqnames=='seq1'] seq2<-reads[seqnames=='seq2'] > adaptor 36-letter "DNAString" instance seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > seq1 A DNAStringSet instance of length 16 width seq names [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG B7_591:4:96:693:509 [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT EAS54_65:7:152:36... [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC EAS51_64:8:5:734:57 [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT B7_591:1:289:587:906 [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG EAS56_59:8:38:671... ... ... ... [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC B7_591:3:188:662:155 [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT EAS56_59:2:225:60... [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG EAS51_66:7:328:39... [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA EAS51_64:5:257:96... [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG EAS54_61:4:143:69... > seq2 A DNAStringSet instance of length 25 width seq names [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA B7_591:8:4:841:340 [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:4:142:94... [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:6:43:859... [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA EAS1_93:2:286:923... [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA EAS1_99:8:117:578... ... ... ... [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA EAS54_65:8:305:81... [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG EAS114_26:7:13:17... [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT EAS56_63:7:34:334... [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC EAS114_45:3:32:13... [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA EAS139_19:5:70:31... If I use the basic parameters for trimLRPatterns on seq1: > trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1) My expected sequence widths after trimming are: > width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1) [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34 Actually trimmed sequence widths: > width(trimmed_reads1) [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 Here, the 4th sequence isn't trimmed because of some mismatches in the sequence. I expect that my data will have some mismatches, so I need to be able to work with those. So I tried setting the max.Lmismatch=5 to allow for up to 5 mismatches, but got no trimming: > trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=5) > width(trimmed_reads2) [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 In my experimenting, I had gone straight to this and saw that it didn't trim partial sequence with these settings, so I had concluded that it wouldn't do the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow 50% mismatches (if I understand that correctly) and this time get too much trimming on the last sequence: > trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.5) > width(trimmed_reads3) [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27 I tried setting max.Lmismatch =0.1 but got too little trimming again: > trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.1) > width(trimmed_reads4) [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 I could probably optimize until I got it just right, but with a large data set that might not be too practical. If I try these same settings against seq2, where there is not supposed to be overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a few of them: > trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.5) > trimmed_reads6<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.1) > width(seq2) [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 40 > width(trimmed_reads5) [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 33 27 40 > width(trimmed_reads6) [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 35 35 40 So what is the best way to optimize this? Obviously you are trying to balance recognizing the whole adaptor sequence without trimming off too many false positives. Also, can I set a limit that requires to have at least n bases overlap with the adaptor suffix? Thanks for your help (again)! Sean > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Monday, July 22, 2013 5:37 PM > To: Taylor, Sean D > Cc: bioconductor at r-project.org > Subject: Re: Trimming of partial adaptor sequences > > Hi Sean, > > On 07/22/2013 01:02 PM, Taylor, Sean D wrote: > > We have been experimenting with a NGS protocol in which we insert > > sheared genomic fragments into a custom plasmid for sequencing on an > > Illumina MiSeq instrument. The insertion site of this plasmid is > > flanked by our own custom barcodes (N7) and ~80 nt Illumina-based > > adaptor sequence. We then PCR out the insert with barcodes and > > adaptors for sequencing. Our adaptor sequence is similar to the > > Illumina adaptor, but we use custom primer binding sites. We are not > > sure if the Illumina software will be able to recognize and trim our > > custom adaptors. We are trying to figure out the best way to trim read > through into the 3' > > adaptor ourselves. We have roughly three scenarios: > > > > (1) The insert is long enough that we have no read through > > > > (2) The vector is empty, in which case the entire adaptor sequence is > > present > > > > (3) The insert is long enough to have useful data, but we get > > read-through into the 3' adaptor sequence that must be trimmed. > > > > The solution we are currently working on is to identify the minimal > > sequence that is recognizable as the adaptor sequence and trim that > > using trimLRPatterns() in the Biostrings package. Ideally we would > > like it if we could give trimLRPatterns() the entire adaptor sequence > > and have it recognize it on our reads even if it is only partially present. > > May be I misunderstand what you are trying to do exactly but yes, you can > give the entire adaptor sequence to trimLRPatterns() and it will recognize it > on our reads even if it's only partially present: > > library(Biostrings) > > adaptor <- DNAString("ACCAGGACAA") # entire adaptor > reads <- DNAStringSet(c( > "GACAATTTATTT", # adaptor partially present on the left > "CAATTTATTTGC", # adaptor partially present on the left > "TTTATTTACCAG", # adaptor partially present on the right > "CAATTTTTTACC" # adaptor partially present on both ends > )) > > Then: > > > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads) > A DNAStringSet instance of length 4 > width seq > [1] 7 TTTATTT > [2] 9 TTTATTTGC > [3] 7 TTTATTT > [4] 6 TTTTTT > > Note that trimLRPatterns() expects that, when the adaptor is partially > present on the left (resp. right), what's present is a suffix (resp. > prefix) of the adaptor, and not an arbitrary substring of it. Is it what you > expect too? > > Thanks, > H. > > > However, in my experimenting it did not seem to be able to this. I > > thought I would ask the Bioconductor community if there are any better > > solutions to recognizing and trimming partial adaptor sequences. > > > > Thanks in advance for any input. > > > > Sean Taylor > > > > Post-doctoral Fellow > > > > Fred Hutchinson Cancer Research Center > > > > 206-667-5544 > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 ADD REPLY 0 Entering edit mode Hi Sean, On 07/23/2013 03:26 PM, Taylor, Sean D wrote: > Thanks all for your input. I will give some of these solutions a try and will probably go with whichever is faster and integrates well with the remainder of our analyses in R. > > Herve, thanks for the clarification. Looking back I see that I hadn't explored the function fully. I think it will work, but I have a few follow-up questions. Here is the sample data set that I'm playing around with: > > library(Rsamtools) > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what=c("seq", "qual")) > gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param) > > These sequences are all relatively short, about 35nts each. I set the first read as my "adaptor" and pulled out all the reads that have a start position overlapping the 'adaptor' sequence: > > gal<-gal[which(start(gal)<=width(gal[1]))] > reads <- setNames(mcols(gal)$seq, names(gal)) > seqnames<-seqnames(gal) > adaptor<-reads[[1]] > > The data are from two 'chromosomes', seq1 and seq2. My adaptor is from seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me approximate best and worst case scenarios: > > seq1<-reads[seqnames=='seq1'] > seq2<-reads[seqnames=='seq2'] > >> adaptor > 36-letter "DNAString" instance > seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > >> seq1 > A DNAStringSet instance of length 16 > width seq names > [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG B7_591:4:96:693:509 > [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT EAS54_65:7:152:36... > [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC EAS51_64:8:5:734:57 > [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT B7_591:1:289:587:906 > [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG EAS56_59:8:38:671... > ... ... ... > [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC B7_591:3:188:662:155 > [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT EAS56_59:2:225:60... > [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG EAS51_66:7:328:39... > [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA EAS51_64:5:257:96... > [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG EAS54_61:4:143:69... > >> seq2 > A DNAStringSet instance of length 25 > width seq names > [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA B7_591:8:4:841:340 > [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:4:142:94... > [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:6:43:859... > [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA EAS1_93:2:286:923... > [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA EAS1_99:8:117:578... > ... ... ... > [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA EAS54_65:8:305:81... > [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG EAS114_26:7:13:17... > [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT EAS56_63:7:34:334... > [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC EAS114_45:3:32:13... > [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA EAS139_19:5:70:31... > > If I use the basic parameters for trimLRPatterns on seq1: > >> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1) > > My expected sequence widths after trimming are: >> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1) > [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34 > > Actually trimmed sequence widths: >> width(trimmed_reads1) > [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 > > Here, the 4th sequence isn't trimmed because of some mismatches in the sequence. I expect that my data will have some mismatches, so I need to be able to work with those. So I tried setting the max.Lmismatch=5 to allow for up to 5 mismatches, but got no trimming: > >> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=5) >> width(trimmed_reads2) > [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 Very strange indeed! I initially thought it was a bug and I was about to commit a fix but then I double-checked the man page and discovered the following "feature": max.Lmismatch: Either an integer vector of length ?nLp = nchar(Lpattern)? representing an absolute number of mismatches (or edit distance if ?with.Lindels? is ?TRUE?) or a single numeric value in the interval ?[0, 1)? representing a mismatch rate when aligning terminal substrings (suffixes) of ?Lpattern? with the beginning (prefix) of ?subject? following the conventions set by ?neditStartingAt?, ?isMatchingStartingAt?, etc. When ?max.Lmismatch? is ?0L? or a numeric value in the interval ?[0, 1)?, it is taken as a "rate" and is converted to ?as.integer(1:nLp * max.Lmismatch)?, analogous to agrep (which, however, employs ?ceiling?). Otherwise, ?max.Lmismatch? is treated as an integer vector where negative numbers are used to prevent trimming at the ?i?-th location. When an input integer vector is shorter than ?nLp?, it is augmented with enough ?-1?s at the beginning to bring its length up to ?nLp?. Elements of ?max.Lmismatch? beyond the first ?nLp? are ignored. I know, all this is a little bit tedious to read and not intuitive at all (at least to me, but I'm probably not the only one), but basically it means that if you want to allow up to 5 mismatches, you need to specify a vector of length 'Lpattern' and filled with the value 5: > trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=rep(5, length(adaptor))) A DNAStringSet instance of length 16 width seq names [1] 0 B7_591:4:96:693:509 [2] 1 T EAS54_65:7:152:36... [3] 3 TCC EAS51_64:8:5:734:57 [4] 5 TCTCT B7_591:1:289:587:906 [5] 7 TCCATGG EAS56_59:8:38:671... ... ... ... [12] 23 TCCATGGCCCAGCATTAGGGAGC B7_591:3:188:662:155 [13] 26 TCCATGGCCCAGCATTAGGGATCTGT EAS56_59:2:225:60... [14] 27 TCCATGGCCCAGCATTAGGGAGCTGTG EAS51_66:7:328:39... [15] 29 TCCATGGCCCAGCATTAGGGAGCTGTGGA EAS51_64:5:257:96... [16] 27 CCCAGCATTAGGGAGCTGTGGACCCCG EAS54_61:4:143:69... Unfortunately this feature has been in trimLRPatterns() since the function was added to Biostrings 4 years ago so I guess I shouldn't touch it. Let's just say that "if it's documented then it's not a bug" and try to find some reconfort with that :-/ H. > > In my experimenting, I had gone straight to this and saw that it didn't trim partial sequence with these settings, so I had concluded that it wouldn't do the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow 50% mismatches (if I understand that correctly) and this time get too much trimming on the last sequence: > >> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.5) >> width(trimmed_reads3) > [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27 > > I tried setting max.Lmismatch =0.1 but got too little trimming again: > >> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.1) >> width(trimmed_reads4) > [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 > > I could probably optimize until I got it just right, but with a large data set that might not be too practical. > If I try these same settings against seq2, where there is not supposed to be overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a few of them: > >> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.5) >> trimmed_reads6<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.1) >> width(seq2) > [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 40 >> width(trimmed_reads5) > [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 33 27 40 >> width(trimmed_reads6) > [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 35 35 40 > > So what is the best way to optimize this? Obviously you are trying to balance recognizing the whole adaptor sequence without trimming off too many false positives. Also, can I set a limit that requires to have at least n bases overlap with the adaptor suffix? > > Thanks for your help (again)! > Sean > > >> -----Original Message----- >> From: Hervé Pagès [mailto:hpages at fhcrc.org] >> Sent: Monday, July 22, 2013 5:37 PM >> To: Taylor, Sean D >> Cc: bioconductor at r-project.org >> Subject: Re: Trimming of partial adaptor sequences >> >> Hi Sean, >> >> On 07/22/2013 01:02 PM, Taylor, Sean D wrote: >>> We have been experimenting with a NGS protocol in which we insert >>> sheared genomic fragments into a custom plasmid for sequencing on an >>> Illumina MiSeq instrument. The insertion site of this plasmid is >>> flanked by our own custom barcodes (N7) and ~80 nt Illumina-based >>> adaptor sequence. We then PCR out the insert with barcodes and >>> adaptors for sequencing. Our adaptor sequence is similar to the >>> Illumina adaptor, but we use custom primer binding sites. We are not >>> sure if the Illumina software will be able to recognize and trim our >>> custom adaptors. We are trying to figure out the best way to trim read >> through into the 3' >>> adaptor ourselves. We have roughly three scenarios: >>> >>> (1) The insert is long enough that we have no read through >>> >>> (2) The vector is empty, in which case the entire adaptor sequence is >>> present >>> >>> (3) The insert is long enough to have useful data, but we get >>> read-through into the 3' adaptor sequence that must be trimmed. >>> >>> The solution we are currently working on is to identify the minimal >>> sequence that is recognizable as the adaptor sequence and trim that >>> using trimLRPatterns() in the Biostrings package. Ideally we would >>> like it if we could give trimLRPatterns() the entire adaptor sequence >>> and have it recognize it on our reads even if it is only partially present. >> >> May be I misunderstand what you are trying to do exactly but yes, you can >> give the entire adaptor sequence to trimLRPatterns() and it will recognize it >> on our reads even if it's only partially present: >> >> library(Biostrings) >> >> adaptor <- DNAString("ACCAGGACAA") # entire adaptor >> reads <- DNAStringSet(c( >> "GACAATTTATTT", # adaptor partially present on the left >> "CAATTTATTTGC", # adaptor partially present on the left >> "TTTATTTACCAG", # adaptor partially present on the right >> "CAATTTTTTACC" # adaptor partially present on both ends >> )) >> >> Then: >> >> > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads) >> A DNAStringSet instance of length 4 >> width seq >> [1] 7 TTTATTT >> [2] 9 TTTATTTGC >> [3] 7 TTTATTT >> [4] 6 TTTTTT >> >> Note that trimLRPatterns() expects that, when the adaptor is partially >> present on the left (resp. right), what's present is a suffix (resp. >> prefix) of the adaptor, and not an arbitrary substring of it. Is it what you >> expect too? >> >> Thanks, >> H. >> >>> However, in my experimenting it did not seem to be able to this. I >>> thought I would ask the Bioconductor community if there are any better >>> solutions to recognizing and trimming partial adaptor sequences. >>> >>> Thanks in advance for any input. >>> >>> Sean Taylor >>> >>> Post-doctoral Fellow >>> >>> Fred Hutchinson Cancer Research Center >>> >>> 206-667-5544 >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
See below. On Jul 24, 2013, at 1:34 AM, Hervé Pagès wrote: > Hi Sean, > > On 07/23/2013 03:26 PM, Taylor, Sean D wrote: >> Thanks all for your input. I will give some of these solutions a try and will probably go with whichever is faster and integrates well with the remainder of our analyses in R. >> >> Herve, thanks for the clarification. Looking back I see that I hadn't explored the function fully. I think it will work, but I have a few follow-up questions. Here is the sample data set that I'm playing around with: >> >> library(Rsamtools) >> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") >> param <- ScanBamParam(what=c("seq", "qual")) >> gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param) >> >> These sequences are all relatively short, about 35nts each. I set the first read as my "adaptor" and pulled out all the reads that have a start position overlapping the 'adaptor' sequence: >> >> gal<-gal[which(start(gal)<=width(gal[1]))] >> reads <- setNames(mcols(gal)seq, names(gal)) >> seqnames<-seqnames(gal) >> adaptor<-reads[[1]] >> >> The data are from two 'chromosomes', seq1 and seq2. My adaptor is from seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me approximate best and worst case scenarios: >> >> seq1<-reads[seqnames=='seq1'] >> seq2<-reads[seqnames=='seq2'] >> >>> adaptor >> 36-letter "DNAString" instance >> seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG >> >>> seq1 >> A DNAStringSet instance of length 16 >> width seq names >> [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG B7_591:4:96:693:509 >> [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT EAS54_65:7:152:36... >> [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC EAS51_64:8:5:734:57 >> [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT B7_591:1:289:587:906 >> [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG EAS56_59:8:38:671... >> ... ... ... >> [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC B7_591:3:188:662:155 >> [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT EAS56_59:2:225:60... >> [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG EAS51_66:7:328:39... >> [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA EAS51_64:5:257:96... >> [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG EAS54_61:4:143:69... >> >>> seq2 >> A DNAStringSet instance of length 25 >> width seq names >> [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA B7_591:8:4:841:340 >> [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:4:142:94... >> [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:6:43:859... >> [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA EAS1_93:2:286:923... >> [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA EAS1_99:8:117:578... >> ... ... ... >> [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA EAS54_65:8:305:81... >> [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG EAS114_26:7:13:17... >> [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT EAS56_63:7:34:334... >> [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC EAS114_45:3:32:13... >> [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA EAS139_19:5:70:31... >> >> If I use the basic parameters for trimLRPatterns on seq1: >> >>> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1) >> >> My expected sequence widths after trimming are: >>> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1) >> [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34 >> >> Actually trimmed sequence widths: >>> width(trimmed_reads1) >> [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 >> >> Here, the 4th sequence isn't trimmed because of some mismatches in the sequence. I expect that my data will have some mismatches, so I need to be able to work with those. So I tried setting the max.Lmismatch=5 to allow for up to 5 mismatches, but got no trimming: >> >>> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=5) >>> width(trimmed_reads2) >> [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 > > Very strange indeed! I initially thought it was a bug and I was about > to commit a fix but then I double-checked the man page and discovered > the following "feature": > > max.Lmismatch: Either an integer vector of length ?nLp = > nchar(Lpattern)? representing an absolute number of > mismatches (or edit distance if ?with.Lindels? is ?TRUE?) or > a single numeric value in the interval ?[0, 1)? representing > a mismatch rate when aligning terminal substrings (suffixes) > of ?Lpattern? with the beginning (prefix) of ?subject? > following the conventions set by ?neditStartingAt?, > ?isMatchingStartingAt?, etc. > > When ?max.Lmismatch? is ?0L? or a numeric value in the > interval ?[0, 1)?, it is taken as a "rate" and is converted > to ?as.integer(1:nLp * max.Lmismatch)?, analogous to agrep > (which, however, employs ?ceiling?). > > Otherwise, ?max.Lmismatch? is treated as an integer vector > where negative numbers are used to prevent trimming at the > ?i?-th location. When an input integer vector is shorter than > ?nLp?, it is augmented with enough ?-1?s at the beginning to > bring its length up to ?nLp?. Elements of ?max.Lmismatch? > beyond the first ?nLp? are ignored. > > I know, all this is a little bit tedious to read and not intuitive > at all (at least to me, but I'm probably not the only one), but > basically it means that if you want to allow up to 5 mismatches, > you need to specify a vector of length 'Lpattern' and filled with > the value 5: > > > trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=rep(5, length(adaptor))) > A DNAStringSet instance of length 16 > width seq names > [1] 0 B7_591:4:96:693:509 > [2] 1 T EAS54_65:7:152:36... > [3] 3 TCC EAS51_64:8:5:734:57 > [4] 5 TCTCT B7_591:1:289:587:906 > [5] 7 TCCATGG EAS56_59:8:38:671... > ... ... ... > [12] 23 TCCATGGCCCAGCATTAGGGAGC B7_591:3:188:662:155 > [13] 26 TCCATGGCCCAGCATTAGGGATCTGT EAS56_59:2:225:60... > [14] 27 TCCATGGCCCAGCATTAGGGAGCTGTG EAS51_66:7:328:39... > [15] 29 TCCATGGCCCAGCATTAGGGAGCTGTGGA EAS51_64:5:257:96... > [16] 27 CCCAGCATTAGGGAGCTGTGGACCCCG EAS54_61:4:143:69... > > Unfortunately this feature has been in trimLRPatterns() since the > function was added to Biostrings 4 years ago so I guess I shouldn't > touch it. Let's just say that "if it's documented then it's not a > bug" and try to find some reconfort with that :-/ Thanks, I always forgot this. I believe it was Patrick's design. The vector giving the mismatch limits does not have to be constant, of course. Its last value could be 5, the number of mismatches allowed for whole matches of the pattern, but it would normally get smaller toward the low end, probably containing some zeros and then -1's at the bottom. > H. > > >> >> In my experimenting, I had gone straight to this and saw that it didn't trim partial sequence with these settings, so I had concluded that it wouldn't do the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow 50% mismatches (if I understand that correctly) and this time get too much trimming on the last sequence: >> >>> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.5) >>> width(trimmed_reads3) >> [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27 >> >> I tried setting max.Lmismatch =0.1 but got too little trimming again: >> >>> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.1) >>> width(trimmed_reads4) >> [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 >> >> I could probably optimize until I got it just right, but with a large data set that might not be too practical. >> If I try these same settings against seq2, where there is not supposed to be overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a few of them: >> >>> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.5) >>> trimmed_reads6<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.1) >>> width(seq2) >> [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 40 >>> width(trimmed_reads5) >> [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 33 27 40 >>> width(trimmed_reads6) >> [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 35 35 40 >> >> So what is the best way to optimize this? Obviously you are trying to balance recognizing the whole adaptor sequence without trimming off too many false positives. Also, can I set a limit that requires to have at least n bases overlap with the adaptor suffix? >> >> Thanks for your help (again)! >> Sean >> >> >>> -----Original Message----- >>> From: Hervé Pagès [mailto:hpages at fhcrc.org] >>> Sent: Monday, July 22, 2013 5:37 PM >>> To: Taylor, Sean D >>> Cc: bioconductor at r-project.org >>> Subject: Re: Trimming of partial adaptor sequences >>> >>> Hi Sean, >>> >>> On 07/22/2013 01:02 PM, Taylor, Sean D wrote: >>>> We have been experimenting with a NGS protocol in which we insert >>>> sheared genomic fragments into a custom plasmid for sequencing on an >>>> Illumina MiSeq instrument. The insertion site of this plasmid is >>>> flanked by our own custom barcodes (N7) and ~80 nt Illumina-based >>>> adaptor sequence. We then PCR out the insert with barcodes and >>>> adaptors for sequencing. Our adaptor sequence is similar to the >>>> Illumina adaptor, but we use custom primer binding sites. We are not >>>> sure if the Illumina software will be able to recognize and trim our >>>> custom adaptors. We are trying to figure out the best way to trim read >>> through into the 3' >>>> adaptor ourselves. We have roughly three scenarios: >>>> >>>> (1) The insert is long enough that we have no read through >>>> >>>> (2) The vector is empty, in which case the entire adaptor sequence is >>>> present >>>> >>>> (3) The insert is long enough to have useful data, but we get >>>> read-through into the 3' adaptor sequence that must be trimmed. >>>> >>>> The solution we are currently working on is to identify the minimal >>>> sequence that is recognizable as the adaptor sequence and trim that >>>> using trimLRPatterns() in the Biostrings package. Ideally we would >>>> like it if we could give trimLRPatterns() the entire adaptor sequence >>>> and have it recognize it on our reads even if it is only partially present. >>> >>> May be I misunderstand what you are trying to do exactly but yes, you can >>> give the entire adaptor sequence to trimLRPatterns() and it will recognize it >>> on our reads even if it's only partially present: >>> >>> library(Biostrings) >>> >>> adaptor <- DNAString("ACCAGGACAA") # entire adaptor >>> reads <- DNAStringSet(c( >>> "GACAATTTATTT", # adaptor partially present on the left >>> "CAATTTATTTGC", # adaptor partially present on the left >>> "TTTATTTACCAG", # adaptor partially present on the right >>> "CAATTTTTTACC" # adaptor partially present on both ends >>> )) >>> >>> Then: >>> >>> > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads) >>> A DNAStringSet instance of length 4 >>> width seq >>> [1] 7 TTTATTT >>> [2] 9 TTTATTTGC >>> [3] 7 TTTATTT >>> [4] 6 TTTTTT >>> >>> Note that trimLRPatterns() expects that, when the adaptor is partially >>> present on the left (resp. right), what's present is a suffix (resp. >>> prefix) of the adaptor, and not an arbitrary substring of it. Is it what you >>> expect too? >>> >>> Thanks, >>> H. >>> >>>> However, in my experimenting it did not seem to be able to this. I >>>> thought I would ask the Bioconductor community if there are any better >>>> solutions to recognizing and trimming partial adaptor sequences. >>>> >>>> Thanks in advance for any input. >>>> >>>> Sean Taylor >>>> >>>> Post-doctoral Fellow >>>> >>>> Fred Hutchinson Cancer Research Center >>>> >>>> 206-667-5544 >>>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ADD REPLY 0 Entering edit mode Sorry to chime in late; I've been on the road. I'm still not clear on your specifications. Herve's answer showed you how to do trimming of the same adaptor, so to speak, at both ends of your reads, but I took your original question to be about trimming at the 3' end only (in which case your adaptor goes as 'Rpattern', with nothing in 'Lpattern'). But I could be all wrong. Please explain in more detail what you want, if you aren't happy yet. You are right about a given "max.L/Rmismatch" value not being suitable for all your data. There is no way to beat that except to subdivide your data. You may be able to gather some intuition using neditStartingAt() or neditEndingAt(). To prevent (tiny) "false positive" trimming, construct a max.mismatch *vector* of the same length as your adaptor string, with -1 in the low spots. By "low", I mean as many letters as you want never to be trimmed. For example, to set a trimming lower limit to 3 letters, do something along these lines: rate <- .25 # pretty big rate max.Lmismatch (or *R*, whichever you really want) <- rate * 1:nchar(adaptor) max.Lmismatch[1:2] <- -1 Equivalently, send a shorter vector ("missing the low spots"): max.Lmismatch <- (rate * 1:nchar(adaptor)) [-(1:2)] There isn't an argument to do this for you. Maybe there should be. There is also a trick to locate adaptors in "accidental places" in your data, such as 3' adaptors in the middle of reads, like [good_data][adaptor]AAAAAAAAAAAAAAAAAAAAA For your Rpattern value, append many N's to your adaptor string, and set Rfixed="subject" (or Rfixed=FALSE, just *not* Rfixed="pattern"). I suppose, if you knew that the only garbage that might occur to the right of such an accident was always poly-*A*, then you could just append A's instead of N's, and then you wouldn't need to relax Rfixed. This line of attack, with the N's, is due to Patrick Aboyoun, the original author. Let me know if I'm way off. On Jul 23, 2013, at 6:26 PM, Taylor, Sean D wrote: > Thanks all for your input. I will give some of these solutions a try and will probably go with whichever is faster and integrates well with the remainder of our analyses in R. > > Herve, thanks for the clarification. Looking back I see that I hadn't explored the function fully. I think it will work, but I have a few follow-up questions. Here is the sample data set that I'm playing around with: > > library(Rsamtools) > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what=c("seq", "qual")) > gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param) > > These sequences are all relatively short, about 35nts each. I set the first read as my "adaptor" and pulled out all the reads that have a start position overlapping the 'adaptor' sequence: > > gal<-gal[which(start(gal)<=width(gal[1]))] > reads <- setNames(mcols(gal)seq, names(gal)) > seqnames<-seqnames(gal) > adaptor<-reads[[1]] > > The data are from two 'chromosomes', seq1 and seq2. My adaptor is from seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me approximate best and worst case scenarios: > > seq1<-reads[seqnames=='seq1'] > seq2<-reads[seqnames=='seq2'] > >> adaptor > 36-letter "DNAString" instance > seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > >> seq1 > A DNAStringSet instance of length 16 > width seq names > [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG B7_591:4:96:693:509 > [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT EAS54_65:7:152:36... > [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC EAS51_64:8:5:734:57 > [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT B7_591:1:289:587:906 > [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG EAS56_59:8:38:671... > ... ... ... > [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC B7_591:3:188:662:155 > [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT EAS56_59:2:225:60... > [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG EAS51_66:7:328:39... > [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA EAS51_64:5:257:96... > [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG EAS54_61:4:143:69... > >> seq2 > A DNAStringSet instance of length 25 > width seq names > [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA B7_591:8:4:841:340 > [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:4:142:94... > [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:6:43:859... > [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA EAS1_93:2:286:923... > [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA EAS1_99:8:117:578... > ... ... ... > [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA EAS54_65:8:305:81... > [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG EAS114_26:7:13:17... > [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT EAS56_63:7:34:334... > [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC EAS114_45:3:32:13... > [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA EAS139_19:5:70:31... > > If I use the basic parameters for trimLRPatterns on seq1: > >> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1) > > My expected sequence widths after trimming are: >> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1) > [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34 > > Actually trimmed sequence widths: >> width(trimmed_reads1) > [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 > > Here, the 4th sequence isn't trimmed because of some mismatches in the sequence. I expect that my data will have some mismatches, so I need to be able to work with those. So I tried setting the max.Lmismatch=5 to allow for up to 5 mismatches, but got no trimming: > >> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=5) >> width(trimmed_reads2) > [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 > > In my experimenting, I had gone straight to this and saw that it didn't trim partial sequence with these settings, so I had concluded that it wouldn't do the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow 50% mismatches (if I understand that correctly) and this time get too much trimming on the last sequence: > >> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.5) >> width(trimmed_reads3) > [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27 > > I tried setting max.Lmismatch =0.1 but got too little trimming again: > >> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.1) >> width(trimmed_reads4) > [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 > > I could probably optimize until I got it just right, but with a large data set that might not be too practical. > If I try these same settings against seq2, where there is not supposed to be overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a few of them: > >> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.5) >> trimmed_reads6<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.1) >> width(seq2) > [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 40 >> width(trimmed_reads5) > [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 33 27 40 >> width(trimmed_reads6) > [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 35 35 40 > > So what is the best way to optimize this? Obviously you are trying to balance recognizing the whole adaptor sequence without trimming off too many false positives. Also, can I set a limit that requires to have at least n bases overlap with the adaptor suffix? > > Thanks for your help (again)! > Sean > > >> -----Original Message----- >> From: Hervé Pagès [mailto:hpages at fhcrc.org] >> Sent: Monday, July 22, 2013 5:37 PM >> To: Taylor, Sean D >> Cc: bioconductor at r-project.org >> Subject: Re: Trimming of partial adaptor sequences >> >> Hi Sean, >> >> On 07/22/2013 01:02 PM, Taylor, Sean D wrote: >>> We have been experimenting with a NGS protocol in which we insert >>> sheared genomic fragments into a custom plasmid for sequencing on an >>> Illumina MiSeq instrument. The insertion site of this plasmid is >>> flanked by our own custom barcodes (N7) and ~80 nt Illumina-based >>> adaptor sequence. We then PCR out the insert with barcodes and >>> adaptors for sequencing. Our adaptor sequence is similar to the >>> Illumina adaptor, but we use custom primer binding sites. We are not >>> sure if the Illumina software will be able to recognize and trim our >>> custom adaptors. We are trying to figure out the best way to trim read >> through into the 3' >>> adaptor ourselves. We have roughly three scenarios: >>> >>> (1) The insert is long enough that we have no read through >>> >>> (2) The vector is empty, in which case the entire adaptor sequence is >>> present >>> >>> (3) The insert is long enough to have useful data, but we get >>> read-through into the 3' adaptor sequence that must be trimmed. >>> >>> The solution we are currently working on is to identify the minimal >>> sequence that is recognizable as the adaptor sequence and trim that >>> using trimLRPatterns() in the Biostrings package. Ideally we would >>> like it if we could give trimLRPatterns() the entire adaptor sequence >>> and have it recognize it on our reads even if it is only partially present. >> >> May be I misunderstand what you are trying to do exactly but yes, you can >> give the entire adaptor sequence to trimLRPatterns() and it will recognize it >> on our reads even if it's only partially present: >> >> library(Biostrings) >> >> adaptor <- DNAString("ACCAGGACAA") # entire adaptor >> reads <- DNAStringSet(c( >> "GACAATTTATTT", # adaptor partially present on the left >> "CAATTTATTTGC", # adaptor partially present on the left >> "TTTATTTACCAG", # adaptor partially present on the right >> "CAATTTTTTACC" # adaptor partially present on both ends >> )) >> >> Then: >> >>> trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads) >> A DNAStringSet instance of length 4 >> width seq >> [1] 7 TTTATTT >> [2] 9 TTTATTTGC >> [3] 7 TTTATTT >> [4] 6 TTTTTT >> >> Note that trimLRPatterns() expects that, when the adaptor is partially >> present on the left (resp. right), what's present is a suffix (resp. >> prefix) of the adaptor, and not an arbitrary substring of it. Is it what you >> expect too? >> >> Thanks, >> H. >> >>> However, in my experimenting it did not seem to be able to this. I >>> thought I would ask the Bioconductor community if there are any better >>> solutions to recognizing and trimming partial adaptor sequences. >>> >>> Thanks in advance for any input. >>> >>> Sean Taylor >>> >>> Post-doctoral Fellow >>> >>> Fred Hutchinson Cancer Research Center >>> >>> 206-667-5544 >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
0
Entering edit mode
Hi Harris, No you are right, I will mostly do 3' trimming. But for my test data it just seemed more convenient to test using Lpattern and I assumed (correctly I hope) that all the same things apply to Rpattern so long as you keep your prefixes and suffixes straight. It took me a little while to figure out what was going on with the max.Lmismatch vector and the negative values (@Herve, I'm glad I'm not the only one who finds the language on the man page to be a bit opaque at times!) but with some playing around and with the help of your example I think I understand how it's working now. Just to clarify, the low spots are always set at the beginning of the vector, regardless of if you are doing L or R patterns, correct? Also thanks for the tip on finding the internal adaptors. I tried it out and it worked pretty well, although I discovered that it is easy to overkill with a bunch of Ns, so I'll have to use it conservatively. My adaptor sequence is pretty big though (~80nts) so I should be able to tolerate a decent number of N's on the end, but I will experiment and see. When you use this trick, are the N's being counted as mismatches and should I alter max.L/Rmismatch? Otherwise, I think I'm pretty happy with these solutions. Thanks so much for your help! Sean > -----Original Message----- > From: Harris A. Jaffee [mailto:hj at jhu.edu] > Sent: Tuesday, July 23, 2013 9:23 PM > To: Taylor, Sean D > Cc: Pages, Herve; bioconductor at r-project.org; James W. MacDonald; > dpryan79 at gmail.com > Subject: Re: [BioC] Trimming of partial adaptor sequences > > Sorry to chime in late; I've been on the road. I'm still not clear on your > specifications. > Herve's answer showed you how to do trimming of the same adaptor, so to > speak, at both ends of your reads, but I took your original question to be > about trimming at the 3' end only (in which case your adaptor goes as > 'Rpattern', with nothing in 'Lpattern'). But I could be all wrong. Please > explain in more detail what you want, if you aren't happy yet. > > You are right about a given "max.L/Rmismatch" value not being suitable for > all your data. > There is no way to beat that except to subdivide your data. You may be able > to gather some intuition using neditStartingAt() or neditEndingAt(). > > To prevent (tiny) "false positive" trimming, construct a max.mismatch > *vector* of the same length as your adaptor string, with -1 in the low spots. > By "low", I mean as many letters as you want never to be trimmed. For > example, to set a trimming lower limit to 3 letters, do something along these > lines: > > rate <- .25 # pretty big rate > max.Lmismatch (or *R*, whichever you really want) <- rate * > 1:nchar(adaptor) > max.Lmismatch[1:2] <- -1 > > Equivalently, send a shorter vector ("missing the low spots"): > > max.Lmismatch <- (rate * 1:nchar(adaptor)) [-(1:2)] > > There isn't an argument to do this for you. Maybe there should be. > > There is also a trick to locate adaptors in "accidental places" in your data, such > as 3' > adaptors in the middle of reads, like > > [good_data][adaptor]AAAAAAAAAAAAAAAAAAAAA > > For your Rpattern value, append many N's to your adaptor string, and set > Rfixed="subject" > (or Rfixed=FALSE, just *not* Rfixed="pattern"). I suppose, if you knew that > the only garbage that might occur to the right of such an accident was always > poly-*A*, then you could just append A's instead of N's, and then you > wouldn't need to relax Rfixed. This line of attack, with the N's, is due to > Patrick Aboyoun, the original author. > > Let me know if I'm way off. > > On Jul 23, 2013, at 6:26 PM, Taylor, Sean D wrote: > > > Thanks all for your input. I will give some of these solutions a try and will > probably go with whichever is faster and integrates well with the remainder > of our analyses in R. > > > > Herve, thanks for the clarification. Looking back I see that I hadn't explored > the function fully. I think it will work, but I have a few follow-up questions. > Here is the sample data set that I'm playing around with: > > > > library(Rsamtools) > > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") > > param <- ScanBamParam(what=c("seq", "qual")) gal <- > > readGappedAlignments(bamfile, use.names=TRUE, param=param) > > > > These sequences are all relatively short, about 35nts each. I set the first > read as my "adaptor" and pulled out all the reads that have a start position > overlapping the 'adaptor' sequence: > > > > gal<-gal[which(start(gal)<=width(gal[1]))] > > reads <- setNames(mcols(gal)$seq, names(gal)) > > seqnames<-seqnames(gal) > > adaptor<-reads[[1]] > > > > The data are from two 'chromosomes', seq1 and seq2. My adaptor is from > seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me > approximate best and worst case scenarios: > > > > seq1<-reads[seqnames=='seq1'] > > seq2<-reads[seqnames=='seq2'] > > > >> adaptor > > 36-letter "DNAString" instance > > seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > > > >> seq1 > > A DNAStringSet instance of length 16 > > width seq names > > [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > B7_591:4:96:693:509 > > [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT > EAS54_65:7:152:36... > > [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC > EAS51_64:8:5:734:57 > > [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT > B7_591:1:289:587:906 > > [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG > EAS56_59:8:38:671... > > ... ... ... > > [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC > B7_591:3:188:662:155 > > [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT > EAS56_59:2:225:60... > > [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG > EAS51_66:7:328:39... > > [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA > EAS51_64:5:257:96... > > [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG > EAS54_61:4:143:69... > > > >> seq2 > > A DNAStringSet instance of length 25 > > width seq names > > [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA > B7_591:8:4:841:340 > > [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA > EAS54_67:4:142:94... > > [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA > EAS54_67:6:43:859... > > [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA > EAS1_93:2:286:923... > > [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA > EAS1_99:8:117:578... > > ... ... ... > > [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA > EAS54_65:8:305:81... > > [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG > EAS114_26:7:13:17... > > [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT > EAS56_63:7:34:334... > > [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC > EAS114_45:3:32:13... > > [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA > EAS139_19:5:70:31... > > > > If I use the basic parameters for trimLRPatterns on seq1: > > > >> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1) > > > > My expected sequence widths after trimming are: > >> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1) > > [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34 > > > > Actually trimmed sequence widths: > >> width(trimmed_reads1) > > [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 > > > > Here, the 4th sequence isn't trimmed because of some mismatches in the > sequence. I expect that my data will have some mismatches, so I need to be > able to work with those. So I tried setting the max.Lmismatch=5 to allow for > up to 5 mismatches, but got no trimming: > > > >> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, > >> max.Lmismatch=5) > >> width(trimmed_reads2) > > [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 > > > > In my experimenting, I had gone straight to this and saw that it didn't trim > partial sequence with these settings, so I had concluded that it wouldn't do > the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow > 50% mismatches (if I understand that correctly) and this time get too much > trimming on the last sequence: > > > >> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, > >> max.Lmismatch=0.5) > >> width(trimmed_reads3) > > [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27 > > > > I tried setting max.Lmismatch =0.1 but got too little trimming again: > > > >> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, > >> max.Lmismatch=0.1) > >> width(trimmed_reads4) > > [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 > > > > I could probably optimize until I got it just right, but with a large data set > that might not be too practical. > > If I try these same settings against seq2, where there is not supposed to be > overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With > mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a > few of them: > > > >> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, > >> max.Lmismatch=0.5) trimmed_reads6<- > trimLRPatterns(Lpattern=adaptor, > >> subject=seq2, max.Lmismatch=0.1) > >> width(seq2) > > [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 > > 35 35 40 > >> width(trimmed_reads5) > > [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 > > 33 27 40 > >> width(trimmed_reads6) > > [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 > > 35 35 40 > > > > So what is the best way to optimize this? Obviously you are trying to > balance recognizing the whole adaptor sequence without trimming off too > many false positives. Also, can I set a limit that requires to have at least n > bases overlap with the adaptor suffix? > > > > Thanks for your help (again)! > > Sean > > > > > >> -----Original Message----- > >> From: Hervé Pagès [mailto:hpages at fhcrc.org] > >> Sent: Monday, July 22, 2013 5:37 PM > >> To: Taylor, Sean D > >> Cc: bioconductor at r-project.org > >> Subject: Re: Trimming of partial adaptor sequences > >> > >> Hi Sean, > >> > >> On 07/22/2013 01:02 PM, Taylor, Sean D wrote: > >>> We have been experimenting with a NGS protocol in which we insert > >>> sheared genomic fragments into a custom plasmid for sequencing on an > >>> Illumina MiSeq instrument. The insertion site of this plasmid is > >>> flanked by our own custom barcodes (N7) and ~80 nt Illumina- based > >>> adaptor sequence. We then PCR out the insert with barcodes and > >>> adaptors for sequencing. Our adaptor sequence is similar to the > >>> Illumina adaptor, but we use custom primer binding sites. We are not > >>> sure if the Illumina software will be able to recognize and trim our > >>> custom adaptors. We are trying to figure out the best way to trim > >>> read > >> through into the 3' > >>> adaptor ourselves. We have roughly three scenarios: > >>> > >>> (1) The insert is long enough that we have no read through > >>> > >>> (2) The vector is empty, in which case the entire adaptor sequence > >>> is present > >>> > >>> (3) The insert is long enough to have useful data, but we get > >>> read-through into the 3' adaptor sequence that must be trimmed. > >>> > >>> The solution we are currently working on is to identify the minimal > >>> sequence that is recognizable as the adaptor sequence and trim that > >>> using trimLRPatterns() in the Biostrings package. Ideally we would > >>> like it if we could give trimLRPatterns() the entire adaptor > >>> sequence and have it recognize it on our reads even if it is only partially > present. > >> > >> May be I misunderstand what you are trying to do exactly but yes, you > >> can give the entire adaptor sequence to trimLRPatterns() and it will > >> recognize it on our reads even if it's only partially present: > >> > >> library(Biostrings) > >> > >> adaptor <- DNAString("ACCAGGACAA") # entire adaptor > >> reads <- DNAStringSet(c( > >> "GACAATTTATTT", # adaptor partially present on the left > >> "CAATTTATTTGC", # adaptor partially present on the left > >> "TTTATTTACCAG", # adaptor partially present on the right > >> "CAATTTTTTACC" # adaptor partially present on both ends > >> )) > >> > >> Then: > >> > >>> trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads) > >> A DNAStringSet instance of length 4 > >> width seq > >> [1] 7 TTTATTT > >> [2] 9 TTTATTTGC > >> [3] 7 TTTATTT > >> [4] 6 TTTTTT > >> > >> Note that trimLRPatterns() expects that, when the adaptor is > >> partially present on the left (resp. right), what's present is a suffix (resp. > >> prefix) of the adaptor, and not an arbitrary substring of it. Is it > >> what you expect too? > >> > >> Thanks, > >> H. > >> > >>> However, in my experimenting it did not seem to be able to this. I > >>> thought I would ask the Bioconductor community if there are any > >>> better solutions to recognizing and trimming partial adaptor sequences. > >>> > >>> Thanks in advance for any input. > >>> > >>> Sean Taylor > >>> > >>> Post-doctoral Fellow > >>> > >>> Fred Hutchinson Cancer Research Center > >>> > >>> 206-667-5544 > >>> > >> > >> -- > >> Hervé Pagès > >> > >> Program in Computational Biology > >> Division of Public Health Sciences > >> Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N, M1-B514 > >> P.O. Box 19024 > >> Seattle, WA 98109-1024 > >> > >> E-mail: hpages at fhcrc.org > >> Phone: (206) 667-5791 > >> Fax: (206) 667-1319 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor ADD REPLY 0 Entering edit mode On Jul 24, 2013, at 12:26 PM, Taylor, Sean D wrote: > Hi Harris, > > No you are right, I will mostly do 3' trimming. But for my test data it just seemed more convenient to test using Lpattern and I assumed (correctly I hope) that all the same things apply to Rpattern so long as you keep your prefixes and suffixes straight. Yes. > It took me a little while to figure out what was going on with the max.Lmismatch vector and the negative values (@Herve, I'm glad I'm not the only one who finds the language on the man page to be a bit opaque at times!) Opaqueness due to my writing skills and an effort not to overload with detail. It needs work and examples. > but with some playing around and with the help of your example I think I understand how it's working now. Just to clarify, the low spots are always set at the beginning of the vector, regardless of if you are doing L or R patterns, correct? Yes, it (and supporting code underneath) is symmetric, at least intended to be. > Also thanks for the tip on finding the internal adaptors. I tried it out and it worked pretty well, although I discovered that it is easy to overkill with a bunch of Ns, so I'll have to use it conservatively. My adaptor sequence is pretty big though (~80nts) so I should be able to tolerate a decent number of N's on the end, but I will experiment and see. When you use this trick, are the N's being counted as mismatches and should I alter max.L/Rmismatch? No. The "non-fixed" matching of IUPAC wild-card letters is free. (Use neditAt to see this.) I point out 'with.indels=TRUE' in case you missed it. They count as 1 mismatch per indel needed. Previously (and still in the release branch, I think), you could not get indels and non-fixed at the same time. But it was just implemented last week, in Biostrings devel anyway, by Herve. In my limited view, you always want to allow indels and almost always some degree of non-fixed (i.e. fixed="pattern" or, if you add N's to your adaptor, fixed=FALSE), because there will be N's in your data. Although, you have to decide whether these N's should match your adaptor freely or by using your allowed mismatch limit. > Otherwise, I think I'm pretty happy with these solutions. Thanks so much for your help! > Sean > >> -----Original Message----- >> From: Harris A. Jaffee [mailto:hj at jhu.edu] >> Sent: Tuesday, July 23, 2013 9:23 PM >> To: Taylor, Sean D >> Cc: Pages, Herve; bioconductor at r-project.org; James W. MacDonald; >> dpryan79 at gmail.com >> Subject: Re: [BioC] Trimming of partial adaptor sequences >> >> Sorry to chime in late; I've been on the road. I'm still not clear on your >> specifications. >> Herve's answer showed you how to do trimming of the same adaptor, so to >> speak, at both ends of your reads, but I took your original question to be >> about trimming at the 3' end only (in which case your adaptor goes as >> 'Rpattern', with nothing in 'Lpattern'). But I could be all wrong. Please >> explain in more detail what you want, if you aren't happy yet. >> >> You are right about a given "max.L/Rmismatch" value not being suitable for >> all your data. >> There is no way to beat that except to subdivide your data. You may be able >> to gather some intuition using neditStartingAt() or neditEndingAt(). >> >> To prevent (tiny) "false positive" trimming, construct a max.mismatch >> *vector* of the same length as your adaptor string, with -1 in the low spots. >> By "low", I mean as many letters as you want never to be trimmed. For >> example, to set a trimming lower limit to 3 letters, do something along these >> lines: >> >> rate <- .25 # pretty big rate >> max.Lmismatch (or *R*, whichever you really want) <- rate * >> 1:nchar(adaptor) >> max.Lmismatch[1:2] <- -1 >> >> Equivalently, send a shorter vector ("missing the low spots"): >> >> max.Lmismatch <- (rate * 1:nchar(adaptor)) [-(1:2)] >> >> There isn't an argument to do this for you. Maybe there should be. >> >> There is also a trick to locate adaptors in "accidental places" in your data, such >> as 3' >> adaptors in the middle of reads, like >> >> [good_data][adaptor]AAAAAAAAAAAAAAAAAAAAA >> >> For your Rpattern value, append many N's to your adaptor string, and set >> Rfixed="subject" >> (or Rfixed=FALSE, just *not* Rfixed="pattern"). I suppose, if you knew that >> the only garbage that might occur to the right of such an accident was always >> poly-*A*, then you could just append A's instead of N's, and then you >> wouldn't need to relax Rfixed. This line of attack, with the N's, is due to >> Patrick Aboyoun, the original author. >> >> Let me know if I'm way off. >> >> On Jul 23, 2013, at 6:26 PM, Taylor, Sean D wrote: >> >>> Thanks all for your input. I will give some of these solutions a try and will >> probably go with whichever is faster and integrates well with the remainder >> of our analyses in R. >>> >>> Herve, thanks for the clarification. Looking back I see that I hadn't explored >> the function fully. I think it will work, but I have a few follow- up questions. >> Here is the sample data set that I'm playing around with: >>> >>> library(Rsamtools) >>> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") >>> param <- ScanBamParam(what=c("seq", "qual")) gal <- >>> readGappedAlignments(bamfile, use.names=TRUE, param=param) >>> >>> These sequences are all relatively short, about 35nts each. I set the first >> read as my "adaptor" and pulled out all the reads that have a start position >> overlapping the 'adaptor' sequence: >>> >>> gal<-gal[which(start(gal)<=width(gal[1]))] >>> reads <- setNames(mcols(gal)$seq, names(gal)) >>> seqnames<-seqnames(gal) >>> adaptor<-reads[[1]] >>> >>> The data are from two 'chromosomes', seq1 and seq2. My adaptor is from >> seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me >> approximate best and worst case scenarios: >>> >>> seq1<-reads[seqnames=='seq1'] >>> seq2<-reads[seqnames=='seq2'] >>> >>>> adaptor >>> 36-letter "DNAString" instance >>> seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG >>> >>>> seq1 >>> A DNAStringSet instance of length 16 >>> width seq names >>> [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG >> B7_591:4:96:693:509 >>> [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT >> EAS54_65:7:152:36... >>> [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC >> EAS51_64:8:5:734:57 >>> [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT >> B7_591:1:289:587:906 >>> [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG >> EAS56_59:8:38:671... >>> ... ... ... >>> [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC >> B7_591:3:188:662:155 >>> [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT >> EAS56_59:2:225:60... >>> [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG >> EAS51_66:7:328:39... >>> [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA >> EAS51_64:5:257:96... >>> [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG >> EAS54_61:4:143:69... >>> >>>> seq2 >>> A DNAStringSet instance of length 25 >>> width seq names >>> [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA >> B7_591:8:4:841:340 >>> [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA >> EAS54_67:4:142:94... >>> [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA >> EAS54_67:6:43:859... >>> [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA >> EAS1_93:2:286:923... >>> [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA >> EAS1_99:8:117:578... >>> ... ... ... >>> [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA >> EAS54_65:8:305:81... >>> [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG >> EAS114_26:7:13:17... >>> [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT >> EAS56_63:7:34:334... >>> [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC >> EAS114_45:3:32:13... >>> [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA >> EAS139_19:5:70:31... >>> >>> If I use the basic parameters for trimLRPatterns on seq1: >>> >>>> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1) >>> >>> My expected sequence widths after trimming are: >>>> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1) >>> [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34 >>> >>> Actually trimmed sequence widths: >>>> width(trimmed_reads1) >>> [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 >>> >>> Here, the 4th sequence isn't trimmed because of some mismatches in the >> sequence. I expect that my data will have some mismatches, so I need to be >> able to work with those. So I tried setting the max.Lmismatch=5 to allow for >> up to 5 mismatches, but got no trimming: >>> >>>> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, >>>> max.Lmismatch=5) >>>> width(trimmed_reads2) >>> [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 >>> >>> In my experimenting, I had gone straight to this and saw that it didn't trim >> partial sequence with these settings, so I had concluded that it wouldn't do >> the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow >> 50% mismatches (if I understand that correctly) and this time get too much >> trimming on the last sequence: >>> >>>> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, >>>> max.Lmismatch=0.5) >>>> width(trimmed_reads3) >>> [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27 >>> >>> I tried setting max.Lmismatch =0.1 but got too little trimming again: >>> >>>> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, >>>> max.Lmismatch=0.1) >>>> width(trimmed_reads4) >>> [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 >>> >>> I could probably optimize until I got it just right, but with a large data set >> that might not be too practical. >>> If I try these same settings against seq2, where there is not supposed to be >> overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With >> mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a >> few of them: >>> >>>> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, >>>> max.Lmismatch=0.5) trimmed_reads6<- >> trimLRPatterns(Lpattern=adaptor, >>>> subject=seq2, max.Lmismatch=0.1) >>>> width(seq2) >>> [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 >>> 35 35 40 >>>> width(trimmed_reads5) >>> [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 >>> 33 27 40 >>>> width(trimmed_reads6) >>> [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 >>> 35 35 40 >>> >>> So what is the best way to optimize this? Obviously you are trying to >> balance recognizing the whole adaptor sequence without trimming off too >> many false positives. Also, can I set a limit that requires to have at least n >> bases overlap with the adaptor suffix? >>> >>> Thanks for your help (again)! >>> Sean >>> >>> >>>> -----Original Message----- >>>> From: Hervé Pagès [mailto:hpages at fhcrc.org] >>>> Sent: Monday, July 22, 2013 5:37 PM >>>> To: Taylor, Sean D >>>> Cc: bioconductor at r-project.org >>>> Subject: Re: Trimming of partial adaptor sequences >>>> >>>> Hi Sean, >>>> >>>> On 07/22/2013 01:02 PM, Taylor, Sean D wrote: >>>>> We have been experimenting with a NGS protocol in which we insert >>>>> sheared genomic fragments into a custom plasmid for sequencing on an >>>>> Illumina MiSeq instrument. The insertion site of this plasmid is >>>>> flanked by our own custom barcodes (N7) and ~80 nt Illumina- based >>>>> adaptor sequence. We then PCR out the insert with barcodes and >>>>> adaptors for sequencing. Our adaptor sequence is similar to the >>>>> Illumina adaptor, but we use custom primer binding sites. We are not >>>>> sure if the Illumina software will be able to recognize and trim our >>>>> custom adaptors. We are trying to figure out the best way to trim >>>>> read >>>> through into the 3' >>>>> adaptor ourselves. We have roughly three scenarios: >>>>> >>>>> (1) The insert is long enough that we have no read through >>>>> >>>>> (2) The vector is empty, in which case the entire adaptor sequence >>>>> is present >>>>> >>>>> (3) The insert is long enough to have useful data, but we get >>>>> read-through into the 3' adaptor sequence that must be trimmed. >>>>> >>>>> The solution we are currently working on is to identify the minimal >>>>> sequence that is recognizable as the adaptor sequence and trim that >>>>> using trimLRPatterns() in the Biostrings package. Ideally we would >>>>> like it if we could give trimLRPatterns() the entire adaptor >>>>> sequence and have it recognize it on our reads even if it is only partially >> present. >>>> >>>> May be I misunderstand what you are trying to do exactly but yes, you >>>> can give the entire adaptor sequence to trimLRPatterns() and it will >>>> recognize it on our reads even if it's only partially present: >>>> >>>> library(Biostrings) >>>> >>>> adaptor <- DNAString("ACCAGGACAA") # entire adaptor >>>> reads <- DNAStringSet(c( >>>> "GACAATTTATTT", # adaptor partially present on the left >>>> "CAATTTATTTGC", # adaptor partially present on the left >>>> "TTTATTTACCAG", # adaptor partially present on the right >>>> "CAATTTTTTACC" # adaptor partially present on both ends >>>> )) >>>> >>>> Then: >>>> >>>>> trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads) >>>> A DNAStringSet instance of length 4 >>>> width seq >>>> [1] 7 TTTATTT >>>> [2] 9 TTTATTTGC >>>> [3] 7 TTTATTT >>>> [4] 6 TTTTTT >>>> >>>> Note that trimLRPatterns() expects that, when the adaptor is >>>> partially present on the left (resp. right), what's present is a suffix (resp. >>>> prefix) of the adaptor, and not an arbitrary substring of it. Is it >>>> what you expect too? >>>> >>>> Thanks, >>>> H. >>>> >>>>> However, in my experimenting it did not seem to be able to this. I >>>>> thought I would ask the Bioconductor community if there are any >>>>> better solutions to recognizing and trimming partial adaptor sequences. >>>>> >>>>> Thanks in advance for any input. >>>>> >>>>> Sean Taylor >>>>> >>>>> Post-doctoral Fellow >>>>> >>>>> Fred Hutchinson Cancer Research Center >>>>> >>>>> 206-667-5544 >>>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >
0
Entering edit mode
Hi Harris, On 07/24/2013 10:41 AM, Harris A. Jaffee wrote: > > On Jul 24, 2013, at 12:26 PM, Taylor, Sean D wrote: > >> Hi Harris, >> >> No you are right, I will mostly do 3' trimming. But for my test data it just seemed more convenient to test using Lpattern and I assumed (correctly I hope) that all the same things apply to Rpattern so long as you keep your prefixes and suffixes straight. > > Yes. > >> It took me a little while to figure out what was going on with the max.Lmismatch vector and the negative values (@Herve, I'm glad I'm not the only one who finds the language on the man page to be a bit opaque at times!) > > Opaqueness due to my writing skills and an effort not to overload with detail. It needs work and examples. > >> but with some playing around and with the help of your example I think I understand how it's working now. Just to clarify, the low spots are always set at the beginning of the vector, regardless of if you are doing L or R patterns, correct? > > Yes, it (and supporting code underneath) is symmetric, at least intended to be. > >> Also thanks for the tip on finding the internal adaptors. I tried it out and it worked pretty well, although I discovered that it is easy to overkill with a bunch of Ns, so I'll have to use it conservatively. My adaptor sequence is pretty big though (~80nts) so I should be able to tolerate a decent number of N's on the end, but I will experiment and see. When you use this trick, are the N's being counted as mismatches and should I alter max.L/Rmismatch? > > No. The "non-fixed" matching of IUPAC wild-card letters is free. (Use neditAt to see this.) > > I point out 'with.indels=TRUE' in case you missed it. They count as 1 mismatch per indel needed. > Previously (and still in the release branch, I think), you could not get indels and non-fixed at > the same time. But it was just implemented last week, in Biostrings devel anyway, by Herve. Yes the matching functions in Biostrings (in devel) finally support indels and IUPAC ambiguities *together*. You get the credit for requesting this feature off-list and sending me an initial patch. Thanks! H. > In > my limited view, you always want to allow indels and almost always some degree of non-fixed (i.e. > fixed="pattern" or, if you add N's to your adaptor, fixed=FALSE), because there will be N's in > your data. Although, you have to decide whether these N's should match your adaptor freely or by > using your allowed mismatch limit. > >> Otherwise, I think I'm pretty happy with these solutions. Thanks so much for your help! >> Sean >> >>> -----Original Message----- >>> From: Harris A. Jaffee [mailto:hj at jhu.edu] >>> Sent: Tuesday, July 23, 2013 9:23 PM >>> To: Taylor, Sean D >>> Cc: Pages, Herve; bioconductor at r-project.org; James W. MacDonald; >>> dpryan79 at gmail.com >>> Subject: Re: [BioC] Trimming of partial adaptor sequences >>> >>> Sorry to chime in late; I've been on the road. I'm still not clear on your >>> specifications. >>> Herve's answer showed you how to do trimming of the same adaptor, so to >>> speak, at both ends of your reads, but I took your original question to be >>> about trimming at the 3' end only (in which case your adaptor goes as >>> 'Rpattern', with nothing in 'Lpattern'). But I could be all wrong. Please >>> explain in more detail what you want, if you aren't happy yet. >>> >>> You are right about a given "max.L/Rmismatch" value not being suitable for >>> all your data. >>> There is no way to beat that except to subdivide your data. You may be able >>> to gather some intuition using neditStartingAt() or neditEndingAt(). >>> >>> To prevent (tiny) "false positive" trimming, construct a max.mismatch >>> *vector* of the same length as your adaptor string, with -1 in the low spots. >>> By "low", I mean as many letters as you want never to be trimmed. For >>> example, to set a trimming lower limit to 3 letters, do something along these >>> lines: >>> >>> rate <- .25 # pretty big rate >>> max.Lmismatch (or *R*, whichever you really want) <- rate * >>> 1:nchar(adaptor) >>> max.Lmismatch[1:2] <- -1 >>> >>> Equivalently, send a shorter vector ("missing the low spots"): >>> >>> max.Lmismatch <- (rate * 1:nchar(adaptor)) [-(1:2)] >>> >>> There isn't an argument to do this for you. Maybe there should be. >>> >>> There is also a trick to locate adaptors in "accidental places" in your data, such >>> as 3' >>> adaptors in the middle of reads, like >>> >>> [good_data][adaptor]AAAAAAAAAAAAAAAAAAAAA >>> >>> For your Rpattern value, append many N's to your adaptor string, and set >>> Rfixed="subject" >>> (or Rfixed=FALSE, just *not* Rfixed="pattern"). I suppose, if you knew that >>> the only garbage that might occur to the right of such an accident was always >>> poly-*A*, then you could just append A's instead of N's, and then you >>> wouldn't need to relax Rfixed. This line of attack, with the N's, is due to >>> Patrick Aboyoun, the original author. >>> >>> Let me know if I'm way off. >>> >>> On Jul 23, 2013, at 6:26 PM, Taylor, Sean D wrote: >>> >>>> Thanks all for your input. I will give some of these solutions a try and will >>> probably go with whichever is faster and integrates well with the remainder >>> of our analyses in R. >>>> >>>> Herve, thanks for the clarification. Looking back I see that I hadn't explored >>> the function fully. I think it will work, but I have a few follow- up questions. >>> Here is the sample data set that I'm playing around with: >>>> >>>> library(Rsamtools) >>>> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") >>>> param <- ScanBamParam(what=c("seq", "qual")) gal <- >>>> readGappedAlignments(bamfile, use.names=TRUE, param=param) >>>> >>>> These sequences are all relatively short, about 35nts each. I set the first >>> read as my "adaptor" and pulled out all the reads that have a start position >>> overlapping the 'adaptor' sequence: >>>> >>>> gal<-gal[which(start(gal)<=width(gal[1]))] >>>> reads <- setNames(mcols(gal)$seq, names(gal)) >>>> seqnames<-seqnames(gal) >>>> adaptor<-reads[[1]] >>>> >>>> The data are from two 'chromosomes', seq1 and seq2. My adaptor is from >>> seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me >>> approximate best and worst case scenarios: >>>> >>>> seq1<-reads[seqnames=='seq1'] >>>> seq2<-reads[seqnames=='seq2'] >>>> >>>>> adaptor >>>> 36-letter "DNAString" instance >>>> seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG >>>> >>>>> seq1 >>>> A DNAStringSet instance of length 16 >>>> width seq names >>>> [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG >>> B7_591:4:96:693:509 >>>> [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT >>> EAS54_65:7:152:36... >>>> [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC >>> EAS51_64:8:5:734:57 >>>> [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT >>> B7_591:1:289:587:906 >>>> [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG >>> EAS56_59:8:38:671... >>>> ... ... ... >>>> [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC >>> B7_591:3:188:662:155 >>>> [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT >>> EAS56_59:2:225:60... >>>> [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG >>> EAS51_66:7:328:39... >>>> [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA >>> EAS51_64:5:257:96... >>>> [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG >>> EAS54_61:4:143:69... >>>> >>>>> seq2 >>>> A DNAStringSet instance of length 25 >>>> width seq names >>>> [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA >>> B7_591:8:4:841:340 >>>> [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA >>> EAS54_67:4:142:94... >>>> [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA >>> EAS54_67:6:43:859... >>>> [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA >>> EAS1_93:2:286:923... >>>> [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA >>> EAS1_99:8:117:578... >>>> ... ... ... >>>> [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA >>> EAS54_65:8:305:81... >>>> [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG >>> EAS114_26:7:13:17... >>>> [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT >>> EAS56_63:7:34:334... >>>> [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC >>> EAS114_45:3:32:13... >>>> [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA >>> EAS139_19:5:70:31... >>>> >>>> If I use the basic parameters for trimLRPatterns on seq1: >>>> >>>>> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1) >>>> >>>> My expected sequence widths after trimming are: >>>>> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1) >>>> [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34 >>>> >>>> Actually trimmed sequence widths: >>>>> width(trimmed_reads1) >>>> [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 >>>> >>>> Here, the 4th sequence isn't trimmed because of some mismatches in the >>> sequence. I expect that my data will have some mismatches, so I need to be >>> able to work with those. So I tried setting the max.Lmismatch=5 to allow for >>> up to 5 mismatches, but got no trimming: >>>> >>>>> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, >>>>> max.Lmismatch=5) >>>>> width(trimmed_reads2) >>>> [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 >>>> >>>> In my experimenting, I had gone straight to this and saw that it didn't trim >>> partial sequence with these settings, so I had concluded that it wouldn't do >>> the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow >>> 50% mismatches (if I understand that correctly) and this time get too much >>> trimming on the last sequence: >>>> >>>>> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, >>>>> max.Lmismatch=0.5) >>>>> width(trimmed_reads3) >>>> [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27 >>>> >>>> I tried setting max.Lmismatch =0.1 but got too little trimming again: >>>> >>>>> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, >>>>> max.Lmismatch=0.1) >>>>> width(trimmed_reads4) >>>> [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 >>>> >>>> I could probably optimize until I got it just right, but with a large data set >>> that might not be too practical. >>>> If I try these same settings against seq2, where there is not supposed to be >>> overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With >>> mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a >>> few of them: >>>> >>>>> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, >>>>> max.Lmismatch=0.5) trimmed_reads6<- >>> trimLRPatterns(Lpattern=adaptor, >>>>> subject=seq2, max.Lmismatch=0.1) >>>>> width(seq2) >>>> [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 >>>> 35 35 40 >>>>> width(trimmed_reads5) >>>> [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 >>>> 33 27 40 >>>>> width(trimmed_reads6) >>>> [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 >>>> 35 35 40 >>>> >>>> So what is the best way to optimize this? Obviously you are trying to >>> balance recognizing the whole adaptor sequence without trimming off too >>> many false positives. Also, can I set a limit that requires to have at least n >>> bases overlap with the adaptor suffix? >>>> >>>> Thanks for your help (again)! >>>> Sean >>>> >>>> >>>>> -----Original Message----- >>>>> From: Hervé Pagès [mailto:hpages at fhcrc.org] >>>>> Sent: Monday, July 22, 2013 5:37 PM >>>>> To: Taylor, Sean D >>>>> Cc: bioconductor at r-project.org >>>>> Subject: Re: Trimming of partial adaptor sequences >>>>> >>>>> Hi Sean, >>>>> >>>>> On 07/22/2013 01:02 PM, Taylor, Sean D wrote: >>>>>> We have been experimenting with a NGS protocol in which we insert >>>>>> sheared genomic fragments into a custom plasmid for sequencing on an >>>>>> Illumina MiSeq instrument. The insertion site of this plasmid is >>>>>> flanked by our own custom barcodes (N7) and ~80 nt Illumina- based >>>>>> adaptor sequence. We then PCR out the insert with barcodes and >>>>>> adaptors for sequencing. Our adaptor sequence is similar to the >>>>>> Illumina adaptor, but we use custom primer binding sites. We are not >>>>>> sure if the Illumina software will be able to recognize and trim our >>>>>> custom adaptors. We are trying to figure out the best way to trim >>>>>> read >>>>> through into the 3' >>>>>> adaptor ourselves. We have roughly three scenarios: >>>>>> >>>>>> (1) The insert is long enough that we have no read through >>>>>> >>>>>> (2) The vector is empty, in which case the entire adaptor sequence >>>>>> is present >>>>>> >>>>>> (3) The insert is long enough to have useful data, but we get >>>>>> read-through into the 3' adaptor sequence that must be trimmed. >>>>>> >>>>>> The solution we are currently working on is to identify the minimal >>>>>> sequence that is recognizable as the adaptor sequence and trim that >>>>>> using trimLRPatterns() in the Biostrings package. Ideally we would >>>>>> like it if we could give trimLRPatterns() the entire adaptor >>>>>> sequence and have it recognize it on our reads even if it is only partially >>> present. >>>>> >>>>> May be I misunderstand what you are trying to do exactly but yes, you >>>>> can give the entire adaptor sequence to trimLRPatterns() and it will >>>>> recognize it on our reads even if it's only partially present: >>>>> >>>>> library(Biostrings) >>>>> >>>>> adaptor <- DNAString("ACCAGGACAA") # entire adaptor >>>>> reads <- DNAStringSet(c( >>>>> "GACAATTTATTT", # adaptor partially present on the left >>>>> "CAATTTATTTGC", # adaptor partially present on the left >>>>> "TTTATTTACCAG", # adaptor partially present on the right >>>>> "CAATTTTTTACC" # adaptor partially present on both ends >>>>> )) >>>>> >>>>> Then: >>>>> >>>>>> trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads) >>>>> A DNAStringSet instance of length 4 >>>>> width seq >>>>> [1] 7 TTTATTT >>>>> [2] 9 TTTATTTGC >>>>> [3] 7 TTTATTT >>>>> [4] 6 TTTTTT >>>>> >>>>> Note that trimLRPatterns() expects that, when the adaptor is >>>>> partially present on the left (resp. right), what's present is a suffix (resp. >>>>> prefix) of the adaptor, and not an arbitrary substring of it. Is it >>>>> what you expect too? >>>>> >>>>> Thanks, >>>>> H. >>>>> >>>>>> However, in my experimenting it did not seem to be able to this. I >>>>>> thought I would ask the Bioconductor community if there are any >>>>>> better solutions to recognizing and trimming partial adaptor sequences. >>>>>> >>>>>> Thanks in advance for any input. >>>>>> >>>>>> Sean Taylor >>>>>> >>>>>> Post-doctoral Fellow >>>>>> >>>>>> Fred Hutchinson Cancer Research Center >>>>>> >>>>>> 206-667-5544 >>>>>> >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 ADD REPLY 0 Entering edit mode Devon Ryan ▴ 200 @devon-ryan-6054 Last seen 7.1 years ago Germany Hi Sean, Have you tried just using a read trimmer, such as trim_galore or trimmomatic? That would seem much easier than rolling your own solution in R. Cheers, Devon ____________________________________________ Devon Ryan, Ph.D. Email: dpryan at dpryan.com Molecular and Cellular Cognition Lab German Centre for Neurodegenerative Diseases (DZNE) Ludwig-Erhard-Allee 2 53175 Bonn, Germany On Jul 22, 2013, at 10:02 PM, Taylor, Sean D wrote: > We have been experimenting with a NGS protocol in which we insert sheared genomic fragments into a custom plasmid for sequencing on an Illumina MiSeq instrument. The insertion site of this plasmid is flanked by our own custom barcodes (N7) and ~80 nt Illumina-based adaptor sequence. We then PCR out the insert with barcodes and adaptors for sequencing. Our adaptor sequence is similar to the Illumina adaptor, but we use custom primer binding sites. We are not sure if the Illumina software will be able to recognize and trim our custom adaptors. We are trying to figure out the best way to trim read through into the 3' adaptor ourselves. We have roughly three scenarios: > > (1) The insert is long enough that we have no read through > (2) The vector is empty, in which case the entire adaptor sequence is present > (3) The insert is long enough to have useful data, but we get read- through into the 3' adaptor sequence that must be trimmed. > > The solution we are currently working on is to identify the minimal sequence that is recognizable as the adaptor sequence and trim that using trimLRPatterns() in the Biostrings package. Ideally we would like it if we could give trimLRPatterns() the entire adaptor sequence and have it recognize it on our reads even if it is only partially present. However, in my experimenting it did not seem to be able to this. I thought I would ask the Bioconductor community if there are any better solutions to recognizing and trimming partial adaptor sequences. > > Thanks in advance for any input. > > Sean Taylor > Post-doctoral Fellow > Fred Hutchinson Cancer Research Center > 206-667-5544 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ADD COMMENT 0 Entering edit mode @stijn-van-dongen-3896 Last seen 7.9 years ago United Kingdom This is an answer I sent off-list to Sean, but I am sending to the list as some other responses may indicate this is topical: We recently published 'Kraken', a suite of tools for NGS processing (the umpteenth, indeed). It has extensive facities for adapter trimming though, as we do a lot of small RNA sequencing, and adpters are often present but sometimes partially. The tool that does this is reaper. It has a mechanism where it recognises successively weaker types of matches, all under control of the user. 1) good match; a stretch of N nucletodides matches with at most E edit distance and optionally G gaps between adapter and read (N,E,G user-controlled) 2) prefix match; similar as above, but between the end of the read and the start of the adapter. Separate parameters are provided for this. 3) exact head-to-tail match, for example where the last two or three bases of the read match the adapter start. This is user- controlled too. reaper will read (optionally gzipp'ed) fastq and output fastq, and has many more filtering options (e.g. quality, low-complexity). It can do somewhere between 60M-200M reads per hour, depending on read length. More information: http://www.sciencedirect.com/science/article/pii/S1046202313002399 http://www.ebi.ac.uk/research/enright/software/kraken ftp://ftp.ebi.ac.uk/pub/contrib/enrightlab/kraken/reaper/src /reaper-latest/doc/reaper.html best, Stijn On Mon, Jul 22, 2013 at 08:02:24PM +0000, Taylor, Sean D wrote: > We have been experimenting with a NGS protocol in which we insert sheared genomic fragments into a custom plasmid for sequencing on an Illumina MiSeq instrument. The insertion site of this plasmid is flanked by our own custom barcodes (N7) and ~80 nt Illumina-based adaptor sequence. We then PCR out the insert with barcodes and adaptors for sequencing. Our adaptor sequence is similar to the Illumina adaptor, but we use custom primer binding sites. We are not sure if the Illumina software will be able to recognize and trim our custom adaptors. We are trying to figure out the best way to trim read through into the 3' adaptor ourselves. We have roughly three scenarios: > > (1) The insert is long enough that we have no read through > (2) The vector is empty, in which case the entire adaptor sequence is present > (3) The insert is long enough to have useful data, but we get read- through into the 3' adaptor sequence that must be trimmed. > > The solution we are currently working on is to identify the minimal sequence that is recognizable as the adaptor sequence and trim that using trimLRPatterns() in the Biostrings package. Ideally we would like it if we could give trimLRPatterns() the entire adaptor sequence and have it recognize it on our reads even if it is only partially present. However, in my experimenting it did not seem to be able to this. I thought I would ask the Bioconductor community if there are any better solutions to recognizing and trimming partial adaptor sequences. > > Thanks in advance for any input. > > Sean Taylor > Post-doctoral Fellow > Fred Hutchinson Cancer Research Center > 206-667-5544 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Stijn van Dongen >8< -o) O< forename pronunciation: [Stan] EMBL-EBI /\\ Tel: +44-(0)1223-492675 Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn ADD COMMENT 0 Entering edit mode @taylor-sean-d-5800 Last seen 8.4 years ago Another question: Can trimLRPatterns() recognize internal sequences or just terminal ones? For example, using the toy data I generated before: library(Rsamtools) bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") param <- ScanBamParam(what=c("seq", "qual")) gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param) gal<-gal[which(start(gal)<=width(gal[1]))] reads <- setNames(mcols(gal)$seq, names(gal)) seqnames<-seqnames(gal) seq1<-reads[seqnames=='seq1'] seq2<-reads[seqnames=='seq2'] adaptor2<-subseq(seq1[[10]],start=1, end=10) This simulates an adaptor sequence pattern that is internal. This is a relevant scenario for us as we are cloning our sheared library into a vector that has the adaptor sequences. We then PCR amplify out the library with adaptors. However, we know we get instances of empty vector and other instances where the adaptor sequence is not terminal. When I use this adaptor sequence, I only trim the reads that have the sequence on the end rather than internally: > trimmed_reads7<-trimLRPatterns(Lpattern=adaptor2, subject=seq1, max.Lmismatch=0.2) > width(seq1) [1] 36 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 > width(trimmed_reads7) [1] 36 35 34 36 35 34 35 35 34 25 25 28 31 32 34 35 Sean > -----Original Message----- > From: Taylor, Sean D > Sent: Tuesday, July 23, 2013 3:27 PM > To: 'Hervé Pagès' > Cc: bioconductor at r-project.org; 'dpryan79 at gmail.com'; 'James W. > MacDonald'; 'Ryan C. Thompson'; 'Stijn van Dongen' > Subject: RE: Trimming of partial adaptor sequences > > Thanks all for your input. I will give some of these solutions a try and will > probably go with whichever is faster and integrates well with the remainder > of our analyses in R. > > Herve, thanks for the clarification. Looking back I see that I hadn't explored > the function fully. I think it will work, but I have a few follow-up questions. > Here is the sample data set that I'm playing around with: > > library(Rsamtools) > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools") param > <- ScanBamParam(what=c("seq", "qual")) gal <- > readGappedAlignments(bamfile, use.names=TRUE, param=param) > > These sequences are all relatively short, about 35nts each. I set the first read > as my "adaptor" and pulled out all the reads that have a start position > overlapping the 'adaptor' sequence: > > gal<-gal[which(start(gal)<=width(gal[1]))] > reads <- setNames(mcols(gal)\$seq, names(gal)) > seqnames<-seqnames(gal) > adaptor<-reads[[1]] > > The data are from two 'chromosomes', seq1 and seq2. My adaptor is from > seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me > approximate best and worst case scenarios: > > seq1<-reads[seqnames=='seq1'] > seq2<-reads[seqnames=='seq2'] > > > adaptor > 36-letter "DNAString" instance > seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > > > seq1 > A DNAStringSet instance of length 16 > width seq names > [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG > B7_591:4:96:693:509 > [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT > EAS54_65:7:152:36... > [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC > EAS51_64:8:5:734:57 > [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT > B7_591:1:289:587:906 > [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG > EAS56_59:8:38:671... > ... ... ... > [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC > B7_591:3:188:662:155 > [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT > EAS56_59:2:225:60... > [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG > EAS51_66:7:328:39... > [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA > EAS51_64:5:257:96... > [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG > EAS54_61:4:143:69... > > > seq2 > A DNAStringSet instance of length 25 > width seq names > [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA > B7_591:8:4:841:340 > [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA > EAS54_67:4:142:94... > [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA > EAS54_67:6:43:859... > [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA > EAS1_93:2:286:923... > [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA > EAS1_99:8:117:578... > ... ... ... > [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA > EAS54_65:8:305:81... > [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG > EAS114_26:7:13:17... > [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT > EAS56_63:7:34:334... > [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC > EAS114_45:3:32:13... > [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA > EAS139_19:5:70:31... > > If I use the basic parameters for trimLRPatterns on seq1: > > > trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1) > > My expected sequence widths after trimming are: > > width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1) > [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34 > > Actually trimmed sequence widths: > > width(trimmed_reads1) > [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 > > Here, the 4th sequence isn't trimmed because of some mismatches in the > sequence. I expect that my data will have some mismatches, so I need to be > able to work with those. So I tried setting the max.Lmismatch=5 to allow for > up to 5 mismatches, but got no trimming: > > > trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, > > max.Lmismatch=5) > > width(trimmed_reads2) > [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35 > > In my experimenting, I had gone straight to this and saw that it didn't trim > partial sequence with these settings, so I had concluded that it wouldn't do > the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow > 50% mismatches (if I understand that correctly) and this time get too much > trimming on the last sequence: > > > trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, > > max.Lmismatch=0.5) > > width(trimmed_reads3) > [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27 > > I tried setting max.Lmismatch =0.1 but got too little trimming again: > > > trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, > > max.Lmismatch=0.1) > > width(trimmed_reads4) > [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34 > > I could probably optimize until I got it just right, but with a large data set that > might not be too practical. > If I try these same settings against seq2, where there is not supposed to be > overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With > mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a > few of them: > > > trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, > > max.Lmismatch=0.5) trimmed_reads6<-trimLRPatterns(Lpattern=adaptor, > > subject=seq2, max.Lmismatch=0.1) > > width(seq2) > [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 40 > > width(trimmed_reads5) > [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 33 27 40 > > width(trimmed_reads6) > [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 35 35 40 > > So what is the best way to optimize this? Obviously you are trying to balance > recognizing the whole adaptor sequence without trimming off too many > false positives. Also, can I set a limit that requires to have at least n bases > overlap with the adaptor suffix? > > Thanks for your help (again)! > Sean > > > > -----Original Message----- > > From: Hervé Pagès [mailto:hpages at fhcrc.org] > > Sent: Monday, July 22, 2013 5:37 PM > > To: Taylor, Sean D > > Cc: bioconductor at r-project.org > > Subject: Re: Trimming of partial adaptor sequences > > > > Hi Sean, > > > > On 07/22/2013 01:02 PM, Taylor, Sean D wrote: > > > We have been experimenting with a NGS protocol in which we insert > > > sheared genomic fragments into a custom plasmid for sequencing on an > > > Illumina MiSeq instrument. The insertion site of this plasmid is > > > flanked by our own custom barcodes (N7) and ~80 nt Illumina- based > > > adaptor sequence. We then PCR out the insert with barcodes and > > > adaptors for sequencing. Our adaptor sequence is similar to the > > > Illumina adaptor, but we use custom primer binding sites. We are not > > > sure if the Illumina software will be able to recognize and trim our > > > custom adaptors. We are trying to figure out the best way to trim > > > read > > through into the 3' > > > adaptor ourselves. We have roughly three scenarios: > > > > > > (1) The insert is long enough that we have no read through > > > > > > (2) The vector is empty, in which case the entire adaptor sequence > > > is present > > > > > > (3) The insert is long enough to have useful data, but we get > > > read-through into the 3' adaptor sequence that must be trimmed. > > > > > > The solution we are currently working on is to identify the minimal > > > sequence that is recognizable as the adaptor sequence and trim that > > > using trimLRPatterns() in the Biostrings package. Ideally we would > > > like it if we could give trimLRPatterns() the entire adaptor > > > sequence and have it recognize it on our reads even if it is only partially > present. > > > > May be I misunderstand what you are trying to do exactly but yes, you > > can give the entire adaptor sequence to trimLRPatterns() and it will > > recognize it on our reads even if it's only partially present: > > > > library(Biostrings) > > > > adaptor <- DNAString("ACCAGGACAA") # entire adaptor > > reads <- DNAStringSet(c( > > "GACAATTTATTT", # adaptor partially present on the left > > "CAATTTATTTGC", # adaptor partially present on the left > > "TTTATTTACCAG", # adaptor partially present on the right > > "CAATTTTTTACC" # adaptor partially present on both ends > > )) > > > > Then: > > > > > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads) > > A DNAStringSet instance of length 4 > > width seq > > [1] 7 TTTATTT > > [2] 9 TTTATTTGC > > [3] 7 TTTATTT > > [4] 6 TTTTTT > > > > Note that trimLRPatterns() expects that, when the adaptor is partially > > present on the left (resp. right), what's present is a suffix (resp. > > prefix) of the adaptor, and not an arbitrary substring of it. Is it > > what you expect too? > > > > Thanks, > > H. > > > > > However, in my experimenting it did not seem to be able to this. I > > > thought I would ask the Bioconductor community if there are any > > > better solutions to recognizing and trimming partial adaptor sequences. > > > > > > Thanks in advance for any input. > > > > > > Sean Taylor > > > > > > Post-doctoral Fellow > > > > > > Fred Hutchinson Cancer Research Center > > > > > > 206-667-5544 > > > > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319