fast iterator over DNAString's?

0

Entering edit mode

Paul Shannon ★ 1.1k

@paul-shannon-578

Last seen 11.3 years ago

I wish to trim a variable length sequence from the end of many thousands of DNAStrings in a DNAStringSet. The sequence to be trimmed is any recognizable chunk of a solexa short read adapter, which ends up on the end of, for example, 22nt miRNAs. The adapter chunk might be found in the middle of a 35 base read, or it might be closer to the end. In every case, I want to delete every base from the start of the adapter chunk to the end of the read. I imagine there might be a BString operation equivalent to sed. See could be used ike this: echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed s/TCGTATGCCGTC.*$// --> GAAGCGGGATGATCTATC (where TCGTATGCCGTC is only part of the 21-base adapter, but is probably a long enough portion to be representative) Any way to do this with BStrings and friends? Thanks! - Paul

• 1.9k views

ADD COMMENT • link updated 15.8 years ago by Paul Leo ▴ 970 • written 15.8 years ago by Paul Shannon ★ 1.1k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 5 weeks ago

United States

Hi Paul, On Wed, Mar 10, 2010 at 7:30 PM, Paul Shannon <pshannon at="" systemsbiology.org=""> wrote: > I wish to trim a variable length sequence from the end of many thousands of DNAStrings in a DNAStringSet. > > The sequence to be trimmed is any recognizable chunk of a solexa short read adapter, which ends up on the end of, for example, 22nt miRNAs. ?The adapter chunk might be found in the middle of a 35 base read, or it might be closer to the end. ?In every case, I want to delete every base from the start of the adapter chunk to the end of the read. > > I imagine there might be a BString operation equivalent to sed. ?See could be used ike this: > > ?echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed s/TCGTATGCCGTC.*$// ? ? ?--> GAAGCGGGATGATCTATC > > (where TCGTATGCCGTC is only part of the 21-base adapter, but is probably a long enough portion to be representative) > > Any way to do this with BStrings and friends? There are a couple of ways you can go about this. Before discovering the Biostrings::trimLRPatterns function, I rigged together something to do this like so: 1. use a call to vmatchPattern(my.adapter.sequence, myDNAStringSet) with some fast&loose values for max.mismatch to find the position in each read in myDNAStringSet that the adapter might be in. 2. you can call the "startIndex" function on the object returned my vmatchPattern to get a vector of start positions to cut against your dnastringset. I'll leave it as an exercise to the reader to stitch that together, but as I mentioned before, I think this is what the trimLRPatterns function is supposed to do, so you might just want to start/play with that. Also, there was some discussion about doing this in the bioc-sig-sequencing mailing list. You might want to subscribe to that and search the archive for some inspiration: https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 15.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 hour ago

Seattle, WA, United States

Hi Paul, Paul Shannon wrote: > I wish to trim a variable length sequence from the end of many thousands of DNAStrings in a DNAStringSet. > > The sequence to be trimmed is any recognizable chunk of a solexa short read adapter, which ends up on the end of, for example, 22nt miRNAs. The adapter chunk might be found in the middle of a 35 base read, or it might be closer to the end. In every case, I want to delete every base from the start of the adapter chunk to the end of the read. > > I imagine there might be a BString operation equivalent to sed. See could be used ike this: > > echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed s/TCGTATGCCGTC.*$// --> GAAGCGGGATGATCTATC With standard string functions: x <- "CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT" pattern <- "TCGTATGCCGTC" > sub(paste(pattern, ".*$", sep=""), "", x) [1] "CGAAGCGGGATGATCTATC" Note that the 1st letter ("C") has disappeared in your example. > > (where TCGTATGCCGTC is only part of the 21-base adapter, but is probably a long enough portion to be representative) > > Any way to do this with BStrings and friends? Here is the solution to Steve's exercise. As he suggested, the building blocks are vmatchPattern(), startIndex() and subseq<-(): lefttrimFromFirstPatternStart <- function(pattern, subject) { mi <- vmatchPattern(pattern, subject) start_at <- sapply(startIndex(mi), function(start) start[1L]) start_at[is.na(start_at)] <- width(subject)[is.na(start_at)] + 1L subseq(subject, start=start_at) <- "" subject } Note that it could be that the pattern has 0 or more than 1 hit in a given subject element. The above function tries to accommodate this. Then: x <- DNAStringSet(c(x, pasterep.int("A", 35), collapse=""))) > x A DNAStringSet instance of length 2 width seq [1] 35 CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT [2] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > lefttrimFromFirstPatternStart(pattern, x) A DNAStringSet instance of length 2 width seq [1] 19 CGAAGCGGGATGATCTATC [2] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Also note that vmatchPattern() lets the user have some control over mismatches/indels (see ?vmatchPattern) so you can add this level of control to lefttrimFromFirstPatternStart() too by adding the 'max.mismatch', 'min.mismatch', 'with.indels' and 'fixed' arguments to it (with the same default values as in vmatchPattern) and just passing their values to vmatchPattern(). Cheers, H. > > Thanks! > > - Paul > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 15.8 years ago Hervé Pagès 16k

0

Entering edit mode

One more thing. If you want exact matches, the sub() solution is *much* faster. Here the main advantage of using a Biostring-based solution like lefttrimFromFirstPatternStart() is if you want to support mismatches/indels, which, AFAIK, can't easily be described with a regular expression. H. Hervé Pagès wrote: > Hi Paul, > > Paul Shannon wrote: >> I wish to trim a variable length sequence from the end of many >> thousands of DNAStrings in a DNAStringSet. >> The sequence to be trimmed is any recognizable chunk of a solexa short >> read adapter, which ends up on the end of, for example, 22nt miRNAs. >> The adapter chunk might be found in the middle of a 35 base read, or >> it might be closer to the end. In every case, I want to delete every >> base from the start of the adapter chunk to the end of the read. >> >> I imagine there might be a BString operation equivalent to sed. See >> could be used ike this: >> >> echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed >> s/TCGTATGCCGTC.*$// --> GAAGCGGGATGATCTATC > > With standard string functions: > > x <- "CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT" > pattern <- "TCGTATGCCGTC" > > > sub(paste(pattern, ".*$", sep=""), "", x) > [1] "CGAAGCGGGATGATCTATC" > > Note that the 1st letter ("C") has disappeared in your example. > >> >> (where TCGTATGCCGTC is only part of the 21-base adapter, but is >> probably a long enough portion to be representative) >> >> Any way to do this with BStrings and friends? > > Here is the solution to Steve's exercise. As he suggested, the > building blocks are vmatchPattern(), startIndex() and subseq<-(): > > lefttrimFromFirstPatternStart <- function(pattern, subject) > { > mi <- vmatchPattern(pattern, subject) > start_at <- sapply(startIndex(mi), function(start) start[1L]) > start_at[is.na(start_at)] <- width(subject)[is.na(start_at)] + 1L > subseq(subject, start=start_at) <- "" > subject > } > > Note that it could be that the pattern has 0 or more than 1 hit in a > given subject element. The above function tries to accommodate this. > > Then: > > x <- DNAStringSet(c(x, pasterep.int("A", 35), collapse=""))) > > > x > A DNAStringSet instance of length 2 > width seq > [1] 35 CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT > [2] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > > lefttrimFromFirstPatternStart(pattern, x) > A DNAStringSet instance of length 2 > width seq > [1] 19 CGAAGCGGGATGATCTATC > [2] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > Also note that vmatchPattern() lets the user have some control over > mismatches/indels (see ?vmatchPattern) so you can add this level > of control to lefttrimFromFirstPatternStart() too by adding > the 'max.mismatch', 'min.mismatch', 'with.indels' and 'fixed' > arguments to it (with the same default values as in vmatchPattern) > and just passing their values to vmatchPattern(). > > Cheers, > H. > > >> >> Thanks! >> >> - Paul >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 15.8 years ago Hervé Pagès 16k

0

Entering edit mode

Paul Leo ▴ 970

@paul-leo-2092

Last seen 11.3 years ago

Think there has been some discussion about trimming specifically on Bioc-sig-seq (recently about 08/02/10) ago so you might want to look at that. IN GENERAL you want to use a "iterate" fast , use XStringViews, then it goes very fast! For example for motif searching in many regions: myb<-"YAACKG" myb.dna<-DNAString(myb) myb.dna.comp.rev<-reverseComplement(myb.dna) all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends) # get some sequences can read fasta etc length(all.genomic) > [1] 81639 > class(all.genomic) [1] "character" > starts[1:5] [1] 3187620 3487470 3777371 4144203 4274129 > ends[1:5] [1] 3187780 3487630 3777531 4144363 4274289 > the.chrom[1:5] [1] "chr1" "chr1" "chr1" "chr1" "chr1" system.time(x<- XStringViews(all.genomic, "DNAString")) x.labels<-paste(the.chrom,starts,ends,sep=":") # give the sequences/reads names if you want names(x)<-x.labels ###################### forward counts ################# all.matches<-matchPattern(myb.dna,x,max.mismatch=0, fixed=FALSE) # needs a stringView to vectorize the.cov<-coverage(all.matches) counts<-aggregate(the.cov,start=start(x),end=end(x),FUN=sum)/length(my b.dna) ###################################################### ###################### reverse counts ################# all.matches.comp<-matchPattern(myb.dna.comp.rev,x,max.mismatch=0, fixed=FALSE) # needs a stringView to vectorize the.cov.comp<-coverage(all.matches.comp) counts.comp<-aggregate(the.cov.comp,start=start(x),end=end(x),FUN=sum) /length(myb.dna) ###################################################### counts.t<-counts+counts.comp # all sequences/reads with a match ####################### forward posns ################# loc.f=rep(NA,times=dim(peaks)[1]) loc.hit<-start(all.matches) region.start<-start(x) hits<-grep(TRUE,counts>0) # places where a hits is found k<-1 no.hits<-counts[counts>0] for(i in 1:length(hits)){ loc.f[hits[i]]<-toString((loc.hit[k:(k +no.hits[i]-1)]-region.start[hits[i]])+1) k<-k+no.hits[i] } loc.f[1:50] ###################################################### ###################### revearse posns ################# loc.c=rep(NA,times=dim(peaks)[1]) loc.hit<-start(all.matches.comp) region.start<-start(x) hits<-grep(TRUE,counts.comp>0) # places where a hits is found k<-1 no.hits<-counts.comp[counts.comp>0] for(i in 1:length(hits)){ loc.c[hits[i]]<-toString((loc.hit[k:(k +no.hits[i]-1)]-region.start[hits[i]])+1) k<-k+no.hits[i] } ###################################################### loc.f and loc.c are locations of the pattern w.r.t forward direction Any If you want to do trimming see the bioc-sig-seq discussion too....too they probably have a more elegant solution Cheers Paul -----Original Message----- From: Paul Shannon <pshannon@systemsbiology.org> To: bioc <bioconductor@stat.math.ethz.ch> Subject: [BioC] fast iterator over DNAString's? Date: Wed, 10 Mar 2010 16:30:40 -0800 I wish to trim a variable length sequence from the end of many thousands of DNAStrings in a DNAStringSet. The sequence to be trimmed is any recognizable chunk of a solexa short read adapter, which ends up on the end of, for example, 22nt miRNAs. The adapter chunk might be found in the middle of a 35 base read, or it might be closer to the end. In every case, I want to delete every base from the start of the adapter chunk to the end of the read. I imagine there might be a BString operation equivalent to sed. See could be used ike this: echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed s/TCGTATGCCGTC.*$// --> GAAGCGGGATGATCTATC (where TCGTATGCCGTC is only part of the 21-base adapter, but is probably a long enough portion to be representative) Any way to do this with BStrings and friends? Thanks! - Paul _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 15.8 years ago Paul Leo ▴ 970

Login before adding your answer.