Another question: Can trimLRPatterns() recognize internal sequences or
just terminal ones?
For example, using the toy data I generated before:
library(Rsamtools)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what=c("seq", "qual"))
gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param)
gal<-gal[which(start(gal)<=width(gal[1]))]
reads <- setNames(mcols(gal)$seq, names(gal))
seqnames<-seqnames(gal)
seq1<-reads[seqnames=='seq1']
seq2<-reads[seqnames=='seq2']
adaptor2<-subseq(seq1[[10]],start=1, end=10)
This simulates an adaptor sequence pattern that is internal. This is a
relevant scenario for us as we are cloning our sheared library into a
vector that has the adaptor sequences. We then PCR amplify out the
library with adaptors. However, we know we get instances of empty
vector and other instances where the adaptor sequence is not terminal.
When I use this adaptor sequence, I only trim the reads that have the
sequence on the end rather than internally:
> trimmed_reads7<-trimLRPatterns(Lpattern=adaptor2, subject=seq1,
max.Lmismatch=0.2)
> width(seq1)
[1] 36 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35
> width(trimmed_reads7)
[1] 36 35 34 36 35 34 35 35 34 25 25 28 31 32 34 35
Sean
> -----Original Message-----
> From: Taylor, Sean D
> Sent: Tuesday, July 23, 2013 3:27 PM
> To: 'Hervé Pagès'
> Cc: bioconductor at r-project.org; 'dpryan79 at gmail.com'; 'James
W.
> MacDonald'; 'Ryan C. Thompson'; 'Stijn van Dongen'
> Subject: RE: Trimming of partial adaptor sequences
>
> Thanks all for your input. I will give some of these solutions a try
and will
> probably go with whichever is faster and integrates well with the
remainder
> of our analyses in R.
>
> Herve, thanks for the clarification. Looking back I see that I
hadn't explored
> the function fully. I think it will work, but I have a few follow-up
questions.
> Here is the sample data set that I'm playing around with:
>
> library(Rsamtools)
> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
param
> <- ScanBamParam(what=c("seq", "qual")) gal <-
> readGappedAlignments(bamfile, use.names=TRUE, param=param)
>
> These sequences are all relatively short, about 35nts each. I set
the first read
> as my "adaptor" and pulled out all the reads that have a start
position
> overlapping the 'adaptor' sequence:
>
> gal<-gal[which(start(gal)<=width(gal[1]))]
> reads <- setNames(mcols(gal)$seq, names(gal))
> seqnames<-seqnames(gal)
> adaptor<-reads[[1]]
>
> The data are from two 'chromosomes', seq1 and seq2. My adaptor is
from
> seq1, so I expect it to overlap on all seq1, but not on seq2. This
lets me
> approximate best and worst case scenarios:
>
> seq1<-reads[seqnames=='seq1']
> seq2<-reads[seqnames=='seq2']
>
> > adaptor
> 36-letter "DNAString" instance
> seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
>
> > seq1
> A DNAStringSet instance of length 16
> width seq names
> [1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
> B7_591:4:96:693:509
> [2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
> EAS54_65:7:152:36...
> [3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
> EAS51_64:8:5:734:57
> [4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
> B7_591:1:289:587:906
> [5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
> EAS56_59:8:38:671...
> ... ... ...
> [12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC
> B7_591:3:188:662:155
> [13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT
> EAS56_59:2:225:60...
> [14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG
> EAS51_66:7:328:39...
> [15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA
> EAS51_64:5:257:96...
> [16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG
> EAS54_61:4:143:69...
>
> > seq2
> A DNAStringSet instance of length 25
> width seq names
> [1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
> B7_591:8:4:841:340
> [2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA
> EAS54_67:4:142:94...
> [3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA
> EAS54_67:6:43:859...
> [4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
> EAS1_93:2:286:923...
> [5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA
> EAS1_99:8:117:578...
> ... ... ...
> [21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA
> EAS54_65:8:305:81...
> [22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG
> EAS114_26:7:13:17...
> [23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT
> EAS56_63:7:34:334...
> [24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC
> EAS114_45:3:32:13...
> [25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA
> EAS139_19:5:70:31...
>
> If I use the basic parameters for trimLRPatterns on seq1:
>
> > trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1)
>
> My expected sequence widths after trimming are:
> >
width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1)
> [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34
>
> Actually trimmed sequence widths:
> > width(trimmed_reads1)
> [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34
>
> Here, the 4th sequence isn't trimmed because of some mismatches in
the
> sequence. I expect that my data will have some mismatches, so I need
to be
> able to work with those. So I tried setting the max.Lmismatch=5 to
allow for
> up to 5 mismatches, but got no trimming:
>
> > trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
> > max.Lmismatch=5)
> > width(trimmed_reads2)
> [1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35
>
> In my experimenting, I had gone straight to this and saw that it
didn't trim
> partial sequence with these settings, so I had concluded that it
wouldn't do
> the job. After playing with it, I tried adjusting trimLRPatterns=0.5
to allow
> 50% mismatches (if I understand that correctly) and this time get
too much
> trimming on the last sequence:
>
> > trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
> > max.Lmismatch=0.5)
> > width(trimmed_reads3)
> [1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27
>
> I tried setting max.Lmismatch =0.1 but got too little trimming
again:
>
> > trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
> > max.Lmismatch=0.1)
> > width(trimmed_reads4)
> [1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34
>
> I could probably optimize until I got it just right, but with a
large data set that
> might not be too practical.
> If I try these same settings against seq2, where there is not
supposed to be
> overlap with the adaptor, I get erroneous trimming with mismatch
=0.5. With
> mismatch = 0.1, it is better, but it decides to trim off the first
nucleotide on a
> few of them:
>
> > trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2,
> > max.Lmismatch=0.5)
trimmed_reads6<-trimLRPatterns(Lpattern=adaptor,
> > subject=seq2, max.Lmismatch=0.1)
> > width(seq2)
> [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35
35 35 35 40
> > width(trimmed_reads5)
> [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28
29 33 27 40
> > width(trimmed_reads6)
> [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35
35 35 35 40
>
> So what is the best way to optimize this? Obviously you are trying
to balance
> recognizing the whole adaptor sequence without trimming off too many
> false positives. Also, can I set a limit that requires to have at
least n bases
> overlap with the adaptor suffix?
>
> Thanks for your help (again)!
> Sean
>
>
> > -----Original Message-----
> > From: Hervé Pagès [mailto:hpages at fhcrc.org]
> > Sent: Monday, July 22, 2013 5:37 PM
> > To: Taylor, Sean D
> > Cc: bioconductor at r-project.org
> > Subject: Re: Trimming of partial adaptor sequences
> >
> > Hi Sean,
> >
> > On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
> > > We have been experimenting with a NGS protocol in which we
insert
> > > sheared genomic fragments into a custom plasmid for sequencing
on an
> > > Illumina MiSeq instrument. The insertion site of this plasmid is
> > > flanked by our own custom barcodes (N7) and ~80 nt Illumina-
based
> > > adaptor sequence. We then PCR out the insert with barcodes and
> > > adaptors for sequencing. Our adaptor sequence is similar to the
> > > Illumina adaptor, but we use custom primer binding sites. We are
not
> > > sure if the Illumina software will be able to recognize and trim
our
> > > custom adaptors. We are trying to figure out the best way to
trim
> > > read
> > through into the 3'
> > > adaptor ourselves. We have roughly three scenarios:
> > >
> > > (1) The insert is long enough that we have no read through
> > >
> > > (2) The vector is empty, in which case the entire adaptor
sequence
> > > is present
> > >
> > > (3) The insert is long enough to have useful data, but we get
> > > read-through into the 3' adaptor sequence that must be trimmed.
> > >
> > > The solution we are currently working on is to identify the
minimal
> > > sequence that is recognizable as the adaptor sequence and trim
that
> > > using trimLRPatterns() in the Biostrings package. Ideally we
would
> > > like it if we could give trimLRPatterns() the entire adaptor
> > > sequence and have it recognize it on our reads even if it is
only partially
> present.
> >
> > May be I misunderstand what you are trying to do exactly but yes,
you
> > can give the entire adaptor sequence to trimLRPatterns() and it
will
> > recognize it on our reads even if it's only partially present:
> >
> > library(Biostrings)
> >
> > adaptor <- DNAString("ACCAGGACAA") # entire adaptor
> > reads <- DNAStringSet(c(
> > "GACAATTTATTT", # adaptor partially present on the left
> > "CAATTTATTTGC", # adaptor partially present on the left
> > "TTTATTTACCAG", # adaptor partially present on the right
> > "CAATTTTTTACC" # adaptor partially present on both ends
> > ))
> >
> > Then:
> >
> > > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor,
subject=reads)
> > A DNAStringSet instance of length 4
> > width seq
> > [1] 7 TTTATTT
> > [2] 9 TTTATTTGC
> > [3] 7 TTTATTT
> > [4] 6 TTTTTT
> >
> > Note that trimLRPatterns() expects that, when the adaptor is
partially
> > present on the left (resp. right), what's present is a suffix
(resp.
> > prefix) of the adaptor, and not an arbitrary substring of it. Is
it
> > what you expect too?
> >
> > Thanks,
> > H.
> >
> > > However, in my experimenting it did not seem to be able to this.
I
> > > thought I would ask the Bioconductor community if there are any
> > > better solutions to recognizing and trimming partial adaptor
sequences.
> > >
> > > Thanks in advance for any input.
> > >
> > > Sean Taylor
> > >
> > > Post-doctoral Fellow
> > >
> > > Fred Hutchinson Cancer Research Center
> > >
> > > 206-667-5544
> > >
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319