Question

Biostrings::matchPattern,extract sequences

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 6 months ago

United States

Herve, Is there a more elegant way to get all matched reference sequences besides using subject(matches)[start:end], e.g, subject(matches)[3010894 : 3010916] for each matched record? Thanks! 23-letter "DNAString" instance seq: GCGGAGCCTGAGGCAGAAACCTC matches is the object returned by matchPattern function call. matches Views on a 197195432-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNN...CCTATTCTAGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGA ATTC views: start end width [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] .... Best regards, Julie sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 REDseq_1.6.0 [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 limma_3.16.7 [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 RSQLite_0.11.4 [10] DBI_0.2-7 AnnotationDbi_1.22.6 BSgenome.Ecoli.NCBI.20080805_1.3.17 [13] biomaRt_2.16.0 VennDiagram_1.6.5 multtest_2.16.0 [16] Biobase_2.20.1 BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 [19] ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 [22] Rsamtools_1.12.4 lattice_0.20-23 Biostrings_2.28.0 [25] GenomicRanges_1.12.5 IRanges_1.18.3 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 stats4_3.0.1 [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 zlibbioc_1.6.0 [[alternative HTML version deleted]]

GO BSgenome BSgenome GO BSgenome BSgenome • 2.9k views

ADD COMMENT • link updated 10.6 years ago by Hervé Pagès 16k • written 10.6 years ago by Julie Zhu ★ 4.3k

score 0 · Answer 1 · 2013-09-17

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 29 minutes ago

Seattle, WA, United States

Hi Julie, Sorry for the late answer. On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: > Herve, > > Is there a more elegant way to get all matched reference sequences > besides using subject(matches)[start:end], e.g, subject(matches)[3010894 > : 3010916] for each matched record? Thanks! > 23-letter "DNAString" instance > seq: GCGGAGCCTGAGGCAGAAACCTC > > matches is the object returned by matchPattern function call. > > matches > Views on a 197195432-letter DNAString subject > subject: > NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN.. .CCTATTCTAGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC > views: > start end width > [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] > [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] > [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] > [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] > [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] > [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] > .... Just turn this into a DNAStringSet object with a coercion: as(matches, "DNAStringSet") or by calling the DNAStringSet() constructor on it: DNAStringSet(matches) Cheers, H. > > Best regards, > > Julie > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] grid parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 > REDseq_1.6.0 > [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 > limma_3.16.7 > [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 > RSQLite_0.11.4 > [10] DBI_0.2-7 AnnotationDbi_1.22.6 > BSgenome.Ecoli.NCBI.20080805_1.3.17 > [13] biomaRt_2.16.0 VennDiagram_1.6.5 > multtest_2.16.0 > [16] Biobase_2.20.1 > BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 > [19] ShortRead_1.18.0 latticeExtra_0.6-26 > RColorBrewer_1.0-5 > [22] Rsamtools_1.12.4 lattice_0.20-23 > Biostrings_2.28.0 > [25] GenomicRanges_1.12.5 IRanges_1.18.3 > BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 > RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 > stats4_3.0.1 > [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 > zlibbioc_1.6.0 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 10.6 years ago Hervé Pagès 16k

0

Entering edit mode

Cool. Thanks, Herve! Is there a method to extract the mismatch position for all the matches? Right now, I am using pairwiseAlignment for each matched subsequence. However, this could become very slow when the number of matched sequences gets large. Best regards, Julie On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at="" fhcrc.org=""> wrote: > Hi Julie, > > Sorry for the late answer. > > On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >> Herve, >> >> Is there a more elegant way to get all matched reference sequences >> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >> : 3010916] for each matched record? Thanks! >> 23-letter "DNAString" instance >> seq: GCGGAGCCTGAGGCAGAAACCTC >> >> matches is the object returned by matchPattern function call. >> >> matches >> Views on a 197195432-letter DNAString subject >> subject: >> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN. ..CCTATTCT >> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC >> views: >> start end width >> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >> .... > > Just turn this into a DNAStringSet object with a coercion: > > as(matches, "DNAStringSet") > > or by calling the DNAStringSet() constructor on it: > > DNAStringSet(matches) > > Cheers, > H. > >> >> Best regards, >> >> Julie >> >> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] grid parallel stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 >> REDseq_1.6.0 >> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >> limma_3.16.7 >> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >> RSQLite_0.11.4 >> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >> BSgenome.Ecoli.NCBI.20080805_1.3.17 >> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >> multtest_2.16.0 >> [16] Biobase_2.20.1 >> BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 >> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >> RColorBrewer_1.0-5 >> [22] Rsamtools_1.12.4 lattice_0.20-23 >> Biostrings_2.28.0 >> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >> BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >> stats4_3.0.1 >> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >> zlibbioc_1.6.0

ADD REPLY • link 10.6 years ago Julie Zhu ★ 4.3k

0

Entering edit mode

On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: > Cool. Thanks, Herve! > > Is there a method to extract the mismatch position for all the matches? > Right now, I am using pairwiseAlignment for each matched subsequence. No easy way at the moment (and this has been on the TODO list for a while). A faster workaround than pairwiseAlignment() if you don't use with.indels=TRUE or fixed=FALSE is: is_mismatch <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) 'is_mismatch' should be a LogicalList object with the same shape as 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch positions. The result will be an IntegerList object. HTH, H. > However, this could become very slow when the number of matched sequences > gets large. > > > Best regards, > > Julie > > > On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at="" fhcrc.org=""> wrote: > >> Hi Julie, >> >> Sorry for the late answer. >> >> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >>> Herve, >>> >>> Is there a more elegant way to get all matched reference sequences >>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >>> : 3010916] for each matched record? Thanks! >>> 23-letter "DNAString" instance >>> seq: GCGGAGCCTGAGGCAGAAACCTC >>> >>> matches is the object returned by matchPattern function call. >>> >>> matches >>> Views on a 197195432-letter DNAString subject >>> subject: >>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ...CCTATTCT >>> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC >>> views: >>> start end width >>> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >>> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >>> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >>> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >>> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >>> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >>> .... >> >> Just turn this into a DNAStringSet object with a coercion: >> >> as(matches, "DNAStringSet") >> >> or by calling the DNAStringSet() constructor on it: >> >> DNAStringSet(matches) >> >> Cheers, >> H. >> >>> >>> Best regards, >>> >>> Julie >>> >>> sessionInfo() >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] grid parallel stats graphics grDevices utils datasets >>> methods base >>> >>> other attached packages: >>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 >>> REDseq_1.6.0 >>> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >>> limma_3.16.7 >>> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >>> RSQLite_0.11.4 >>> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >>> BSgenome.Ecoli.NCBI.20080805_1.3.17 >>> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >>> multtest_2.16.0 >>> [16] Biobase_2.20.1 >>> BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 >>> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >>> RColorBrewer_1.0-5 >>> [22] Rsamtools_1.12.4 lattice_0.20-23 >>> Biostrings_2.28.0 >>> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >>> BiocGenerics_0.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >>> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >>> stats4_3.0.1 >>> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >>> zlibbioc_1.6.0 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.6 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Julie, also check out help("lowlevel-matching"). I used hasLetterAt() as an equivalent test in the example I created below which again requires the match to be the same shape. Marcus mpattern <- pattern <- "AATCGGATC" ## mpattern contains a mismatch subseq(mpattern,2,2) <- "G" subject <- DNAString(paste("ATTCG", pattern, "NATTCGGAGGTC", mpattern, "NNNGN",mpattern, "CAAA", sep="")) matches <- matchPattern(pattern, DNAString(subject), max.mismatch=1) print(matches) Lmismatch <- (!hasLetterAt(as(matches, "DNAStringSet"), pattern, seq(nchar(pattern)))) print(Lmismatch) Lmismatch2 <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) print(Lmismatch2) ## 'row' is the Matches views index position, 'col' is the nucleotide mismatch position ind <- which(Lmismatch, arr.ind=TRUE) print(ind) ## Sanity check: subseq(as(matches[ind[,1]], "DNAStringSet"), ind[,2], ind[,2]) On Wed, Sep 18, 2013 at 12:08 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: > >> Cool. Thanks, Herve! >> >> Is there a method to extract the mismatch position for all the matches? >> Right now, I am using pairwiseAlignment for each matched subsequence. >> > > No easy way at the moment (and this has been on the TODO list for a > while). A faster workaround than pairwiseAlignment() if you don't use > with.indels=TRUE or fixed=FALSE is: > > is_mismatch <- relist(as.raw(unlist(matches)) != > as.raw(DNAString(pattern)), matches) > > 'is_mismatch' should be a LogicalList object with the same shape as > 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch > positions. The result will be an IntegerList object. > > HTH, > > H. > > However, this could become very slow when the number of matched sequences >> gets large. >> >> >> Best regards, >> >> Julie >> >> >> On 9/17/13 6:52 PM, "Hervé Pagès" <hpages@fhcrc.org> wrote: >> >> Hi Julie, >>> >>> Sorry for the late answer. >>> >>> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >>> >>>> Herve, >>>> >>>> Is there a more elegant way to get all matched reference sequences >>>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >>>> : 3010916] for each matched record? Thanks! >>>> 23-letter "DNAString" instance >>>> seq: GCGGAGCCTGAGGCAGAAACCTC >>>> >>>> matches is the object returned by matchPattern function call. >>>> >>>> matches >>>> Views on a 197195432-letter DNAString subject >>>> subject: >>>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN**NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN** >>>> NNNNNN...CCTATTCT >>>> AGTAAAAATTTTATTTCATTCTGTAAAGAA**TTTGGTATTAAACTTAAAACTGGAATTC >>>> views: >>>> start end width >>>> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >>>> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >>>> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >>>> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >>>> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >>>> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >>>> .... >>>> >>> >>> Just turn this into a DNAStringSet object with a coercion: >>> >>> as(matches, "DNAStringSet") >>> >>> or by calling the DNAStringSet() constructor on it: >>> >>> DNAStringSet(matches) >>> >>> Cheers, >>> H. >>> >>> >>>> Best regards, >>>> >>>> Julie >>>> >>>> sessionInfo() >>>> R version 3.0.1 (2013-05-16) >>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>> >>>> locale: >>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >>>> >>>> attached base packages: >>>> [1] grid parallel stats graphics grDevices utils datasets >>>> methods base >>>> >>>> other attached packages: >>>> [1] BSgenome.Mmusculus.UCSC.mm9_1.**3.19 BiocInstaller_1.10.3 >>>> REDseq_1.6.0 >>>> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >>>> limma_3.16.7 >>>> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >>>> RSQLite_0.11.4 >>>> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >>>> BSgenome.Ecoli.NCBI.20080805_**1.3.17 >>>> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >>>> multtest_2.16.0 >>>> [16] Biobase_2.20.1 >>>> BSgenome.Celegans.UCSC.ce2_1.**3.19 >>>> BSgenome_1.28.0 >>>> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >>>> RColorBrewer_1.0-5 >>>> [22] Rsamtools_1.12.4 lattice_0.20-23 >>>> Biostrings_2.28.0 >>>> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >>>> BiocGenerics_0.6.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >>>> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >>>> stats4_3.0.1 >>>> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >>>> zlibbioc_1.6.0 >>>> >>> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 10.6 years ago Marcus Davy ▴ 390

0

Entering edit mode

Herve and Marcus, Thank you both for the great suggestions! Best regards, Julie On 9/17/13 8:27 PM, "Marcus Davy" <mdavy86@gmail.com> wrote: Hi Julie, also check out help("lowlevel-matching"). I used hasLetterAt() as an equivalent test in the example I created below which again requires the match to be the same shape. Marcus mpattern <- pattern <- "AATCGGATC" ## mpattern contains a mismatch subseq(mpattern,2,2) <- "G" subject <- DNAString(paste("ATTCG", pattern, "NATTCGGAGGTC", mpattern, "NNNGN",mpattern, "CAAA", sep="")) matches <- matchPattern(pattern, DNAString(subject), max.mismatch=1) print(matches) Lmismatch <- (!hasLetterAt(as(matches, "DNAStringSet"), pattern, seq(nchar(pattern)))) print(Lmismatch) Lmismatch2 <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) print(Lmismatch2) ## 'row' is the Matches views index position, 'col' is the nucleotide mismatch position ind <- which(Lmismatch, arr.ind=TRUE) print(ind) ## Sanity check: subseq(as(matches[ind[,1]], "DNAStringSet"), ind[,2], ind[,2]) On Wed, Sep 18, 2013 at 12:08 PM, Hervé Pagès <hpages@fhcrc.org> wrote: On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: Cool. Thanks, Herve! Is there a method to extract the mismatch position for all the matches? Right now, I am using pairwiseAlignment for each matched subsequence. No easy way at the moment (and this has been on the TODO list for a while). A faster workaround than pairwiseAlignment() if you don't use with.indels=TRUE or fixed=FALSE is: is_mismatch <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) 'is_mismatch' should be a LogicalList object with the same shape as 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch positions. The result will be an IntegerList object. HTH, H. However, this could become very slow when the number of matched sequences gets large. Best regards, Julie On 9/17/13 6:52 PM, "Hervé Pagès" <hpages@fhcrc.org> wrote: Hi Julie, Sorry for the late answer. On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: Herve, Is there a more elegant way to get all matched reference sequences besides using subject(matches)[start:end], e.g, subject(matches)[3010894 : 3010916] for each matched record? Thanks! 23-letter "DNAString" instance seq: GCGGAGCCTGAGGCAGAAACCTC matches is the object returned by matchPattern function call. matches Views on a 197195432-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...C CTATTCT AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC views: start end width [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] .... Just turn this into a DNAStringSet object with a coercion: as(matches, "DNAStringSet") or by calling the DNAStringSet() constructor on it: DNAStringSet(matches) Cheers, H. Best regards, Julie sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 REDseq_1.6.0 [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 limma_3.16.7 [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 RSQLite_0.11.4 [10] DBI_0.2-7 AnnotationDbi_1.22.6 BSgenome.Ecoli.NCBI.20080805_1.3.17 [13] biomaRt_2.16.0 VennDiagram_1.6.5 multtest_2.16.0 [16] Biobase_2.20.1 BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 [19] ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 [22] Rsamtools_1.12.4 lattice_0.20-23 Biostrings_2.28.0 [25] GenomicRanges_1.12.5 IRanges_1.18.3 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 stats4_3.0.1 [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 zlibbioc_1.6.0 [[alternative HTML version deleted]]

ADD REPLY • link 10.6 years ago Julie Zhu ★ 4.3k