Biostrings::matchPattern,extract sequences
1
0
Entering edit mode
Julie Zhu ★ 4.3k
@julie-zhu-3596
Last seen 13 months ago
United States
Herve, Is there a more elegant way to get all matched reference sequences besides using subject(matches)[start:end], e.g, subject(matches)[3010894 : 3010916] for each matched record? Thanks! 23-letter "DNAString" instance seq: GCGGAGCCTGAGGCAGAAACCTC matches is the object returned by matchPattern function call. matches Views on a 197195432-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNN...CCTATTCTAGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGA ATTC views: start end width [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] .... Best regards, Julie sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 REDseq_1.6.0 [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 limma_3.16.7 [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 RSQLite_0.11.4 [10] DBI_0.2-7 AnnotationDbi_1.22.6 BSgenome.Ecoli.NCBI.20080805_1.3.17 [13] biomaRt_2.16.0 VennDiagram_1.6.5 multtest_2.16.0 [16] Biobase_2.20.1 BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 [19] ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 [22] Rsamtools_1.12.4 lattice_0.20-23 Biostrings_2.28.0 [25] GenomicRanges_1.12.5 IRanges_1.18.3 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 stats4_3.0.1 [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 zlibbioc_1.6.0 [[alternative HTML version deleted]]
GO BSgenome BSgenome GO BSgenome BSgenome • 3.1k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 8 hours ago
Seattle, WA, United States
Hi Julie, Sorry for the late answer. On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: > Herve, > > Is there a more elegant way to get all matched reference sequences > besides using subject(matches)[start:end], e.g, subject(matches)[3010894 > : 3010916] for each matched record? Thanks! > 23-letter "DNAString" instance > seq: GCGGAGCCTGAGGCAGAAACCTC > > matches is the object returned by matchPattern function call. > > matches > Views on a 197195432-letter DNAString subject > subject: > NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN.. .CCTATTCTAGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC > views: > start end width > [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] > [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] > [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] > [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] > [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] > [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] > .... Just turn this into a DNAStringSet object with a coercion: as(matches, "DNAStringSet") or by calling the DNAStringSet() constructor on it: DNAStringSet(matches) Cheers, H. > > Best regards, > > Julie > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] grid parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 > REDseq_1.6.0 > [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 > limma_3.16.7 > [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 > RSQLite_0.11.4 > [10] DBI_0.2-7 AnnotationDbi_1.22.6 > BSgenome.Ecoli.NCBI.20080805_1.3.17 > [13] biomaRt_2.16.0 VennDiagram_1.6.5 > multtest_2.16.0 > [16] Biobase_2.20.1 > BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 > [19] ShortRead_1.18.0 latticeExtra_0.6-26 > RColorBrewer_1.0-5 > [22] Rsamtools_1.12.4 lattice_0.20-23 > Biostrings_2.28.0 > [25] GenomicRanges_1.12.5 IRanges_1.18.3 > BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 > RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 > stats4_3.0.1 > [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 > zlibbioc_1.6.0 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Cool. Thanks, Herve! Is there a method to extract the mismatch position for all the matches? Right now, I am using pairwiseAlignment for each matched subsequence. However, this could become very slow when the number of matched sequences gets large. Best regards, Julie On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at="" fhcrc.org=""> wrote: > Hi Julie, > > Sorry for the late answer. > > On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >> Herve, >> >> Is there a more elegant way to get all matched reference sequences >> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >> : 3010916] for each matched record? Thanks! >> 23-letter "DNAString" instance >> seq: GCGGAGCCTGAGGCAGAAACCTC >> >> matches is the object returned by matchPattern function call. >> >> matches >> Views on a 197195432-letter DNAString subject >> subject: >> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN. ..CCTATTCT >> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC >> views: >> start end width >> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >> .... > > Just turn this into a DNAStringSet object with a coercion: > > as(matches, "DNAStringSet") > > or by calling the DNAStringSet() constructor on it: > > DNAStringSet(matches) > > Cheers, > H. > >> >> Best regards, >> >> Julie >> >> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] grid parallel stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 >> REDseq_1.6.0 >> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >> limma_3.16.7 >> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >> RSQLite_0.11.4 >> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >> BSgenome.Ecoli.NCBI.20080805_1.3.17 >> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >> multtest_2.16.0 >> [16] Biobase_2.20.1 >> BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 >> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >> RColorBrewer_1.0-5 >> [22] Rsamtools_1.12.4 lattice_0.20-23 >> Biostrings_2.28.0 >> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >> BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >> stats4_3.0.1 >> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >> zlibbioc_1.6.0
ADD REPLY
0
Entering edit mode
On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: > Cool. Thanks, Herve! > > Is there a method to extract the mismatch position for all the matches? > Right now, I am using pairwiseAlignment for each matched subsequence. No easy way at the moment (and this has been on the TODO list for a while). A faster workaround than pairwiseAlignment() if you don't use with.indels=TRUE or fixed=FALSE is: is_mismatch <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) 'is_mismatch' should be a LogicalList object with the same shape as 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch positions. The result will be an IntegerList object. HTH, H. > However, this could become very slow when the number of matched sequences > gets large. > > > Best regards, > > Julie > > > On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at="" fhcrc.org=""> wrote: > >> Hi Julie, >> >> Sorry for the late answer. >> >> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >>> Herve, >>> >>> Is there a more elegant way to get all matched reference sequences >>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >>> : 3010916] for each matched record? Thanks! >>> 23-letter "DNAString" instance >>> seq: GCGGAGCCTGAGGCAGAAACCTC >>> >>> matches is the object returned by matchPattern function call. >>> >>> matches >>> Views on a 197195432-letter DNAString subject >>> subject: >>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ...CCTATTCT >>> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC >>> views: >>> start end width >>> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >>> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >>> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >>> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >>> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >>> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >>> .... >> >> Just turn this into a DNAStringSet object with a coercion: >> >> as(matches, "DNAStringSet") >> >> or by calling the DNAStringSet() constructor on it: >> >> DNAStringSet(matches) >> >> Cheers, >> H. >> >>> >>> Best regards, >>> >>> Julie >>> >>> sessionInfo() >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] grid parallel stats graphics grDevices utils datasets >>> methods base >>> >>> other attached packages: >>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 >>> REDseq_1.6.0 >>> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >>> limma_3.16.7 >>> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >>> RSQLite_0.11.4 >>> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >>> BSgenome.Ecoli.NCBI.20080805_1.3.17 >>> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >>> multtest_2.16.0 >>> [16] Biobase_2.20.1 >>> BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 >>> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >>> RColorBrewer_1.0-5 >>> [22] Rsamtools_1.12.4 lattice_0.20-23 >>> Biostrings_2.28.0 >>> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >>> BiocGenerics_0.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >>> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >>> stats4_3.0.1 >>> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >>> zlibbioc_1.6.0 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi Julie, also check out help("lowlevel-matching"). I used hasLetterAt() as an equivalent test in the example I created below which again requires the match to be the same shape. Marcus mpattern <- pattern <- "AATCGGATC" ## mpattern contains a mismatch subseq(mpattern,2,2) <- "G" subject <- DNAString(paste("ATTCG", pattern, "NATTCGGAGGTC", mpattern, "NNNGN",mpattern, "CAAA", sep="")) matches <- matchPattern(pattern, DNAString(subject), max.mismatch=1) print(matches) Lmismatch <- (!hasLetterAt(as(matches, "DNAStringSet"), pattern, seq(nchar(pattern)))) print(Lmismatch) Lmismatch2 <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) print(Lmismatch2) ## 'row' is the Matches views index position, 'col' is the nucleotide mismatch position ind <- which(Lmismatch, arr.ind=TRUE) print(ind) ## Sanity check: subseq(as(matches[ind[,1]], "DNAStringSet"), ind[,2], ind[,2]) On Wed, Sep 18, 2013 at 12:08 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: > >> Cool. Thanks, Herve! >> >> Is there a method to extract the mismatch position for all the matches? >> Right now, I am using pairwiseAlignment for each matched subsequence. >> > > No easy way at the moment (and this has been on the TODO list for a > while). A faster workaround than pairwiseAlignment() if you don't use > with.indels=TRUE or fixed=FALSE is: > > is_mismatch <- relist(as.raw(unlist(matches)) != > as.raw(DNAString(pattern)), matches) > > 'is_mismatch' should be a LogicalList object with the same shape as > 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch > positions. The result will be an IntegerList object. > > HTH, > > H. > > However, this could become very slow when the number of matched sequences >> gets large. >> >> >> Best regards, >> >> Julie >> >> >> On 9/17/13 6:52 PM, "Hervé Pagès" <hpages@fhcrc.org> wrote: >> >> Hi Julie, >>> >>> Sorry for the late answer. >>> >>> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >>> >>>> Herve, >>>> >>>> Is there a more elegant way to get all matched reference sequences >>>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >>>> : 3010916] for each matched record? Thanks! >>>> 23-letter "DNAString" instance >>>> seq: GCGGAGCCTGAGGCAGAAACCTC >>>> >>>> matches is the object returned by matchPattern function call. >>>> >>>> matches >>>> Views on a 197195432-letter DNAString subject >>>> subject: >>>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN**NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN** >>>> NNNNNN...CCTATTCT >>>> AGTAAAAATTTTATTTCATTCTGTAAAGAA**TTTGGTATTAAACTTAAAACTGGAATTC >>>> views: >>>> start end width >>>> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >>>> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >>>> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >>>> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >>>> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >>>> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >>>> .... >>>> >>> >>> Just turn this into a DNAStringSet object with a coercion: >>> >>> as(matches, "DNAStringSet") >>> >>> or by calling the DNAStringSet() constructor on it: >>> >>> DNAStringSet(matches) >>> >>> Cheers, >>> H. >>> >>> >>>> Best regards, >>>> >>>> Julie >>>> >>>> sessionInfo() >>>> R version 3.0.1 (2013-05-16) >>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>> >>>> locale: >>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >>>> >>>> attached base packages: >>>> [1] grid parallel stats graphics grDevices utils datasets >>>> methods base >>>> >>>> other attached packages: >>>> [1] BSgenome.Mmusculus.UCSC.mm9_1.**3.19 BiocInstaller_1.10.3 >>>> REDseq_1.6.0 >>>> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >>>> limma_3.16.7 >>>> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >>>> RSQLite_0.11.4 >>>> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >>>> BSgenome.Ecoli.NCBI.20080805_**1.3.17 >>>> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >>>> multtest_2.16.0 >>>> [16] Biobase_2.20.1 >>>> BSgenome.Celegans.UCSC.ce2_1.**3.19 >>>> BSgenome_1.28.0 >>>> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >>>> RColorBrewer_1.0-5 >>>> [22] Rsamtools_1.12.4 lattice_0.20-23 >>>> Biostrings_2.28.0 >>>> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >>>> BiocGenerics_0.6.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >>>> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >>>> stats4_3.0.1 >>>> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >>>> zlibbioc_1.6.0 >>>> >>> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Herve and Marcus, Thank you both for the great suggestions! Best regards, Julie On 9/17/13 8:27 PM, "Marcus Davy" <mdavy86@gmail.com> wrote: Hi Julie, also check out help("lowlevel-matching"). I used hasLetterAt() as an equivalent test in the example I created below which again requires the match to be the same shape. Marcus mpattern <- pattern <- "AATCGGATC" ## mpattern contains a mismatch subseq(mpattern,2,2) <- "G" subject <- DNAString(paste("ATTCG", pattern, "NATTCGGAGGTC", mpattern, "NNNGN",mpattern, "CAAA", sep="")) matches <- matchPattern(pattern, DNAString(subject), max.mismatch=1) print(matches) Lmismatch <- (!hasLetterAt(as(matches, "DNAStringSet"), pattern, seq(nchar(pattern)))) print(Lmismatch) Lmismatch2 <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) print(Lmismatch2) ## 'row' is the Matches views index position, 'col' is the nucleotide mismatch position ind <- which(Lmismatch, arr.ind=TRUE) print(ind) ## Sanity check: subseq(as(matches[ind[,1]], "DNAStringSet"), ind[,2], ind[,2]) On Wed, Sep 18, 2013 at 12:08 PM, Hervé Pagès <hpages@fhcrc.org> wrote: On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: Cool. Thanks, Herve! Is there a method to extract the mismatch position for all the matches? Right now, I am using pairwiseAlignment for each matched subsequence. No easy way at the moment (and this has been on the TODO list for a while). A faster workaround than pairwiseAlignment() if you don't use with.indels=TRUE or fixed=FALSE is: is_mismatch <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) 'is_mismatch' should be a LogicalList object with the same shape as 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch positions. The result will be an IntegerList object. HTH, H. However, this could become very slow when the number of matched sequences gets large. Best regards, Julie On 9/17/13 6:52 PM, "Hervé Pagès" <hpages@fhcrc.org> wrote: Hi Julie, Sorry for the late answer. On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: Herve, Is there a more elegant way to get all matched reference sequences besides using subject(matches)[start:end], e.g, subject(matches)[3010894 : 3010916] for each matched record? Thanks! 23-letter "DNAString" instance seq: GCGGAGCCTGAGGCAGAAACCTC matches is the object returned by matchPattern function call. matches Views on a 197195432-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...C CTATTCT AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC views: start end width [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] .... Just turn this into a DNAStringSet object with a coercion: as(matches, "DNAStringSet") or by calling the DNAStringSet() constructor on it: DNAStringSet(matches) Cheers, H. Best regards, Julie sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 REDseq_1.6.0 [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 limma_3.16.7 [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 RSQLite_0.11.4 [10] DBI_0.2-7 AnnotationDbi_1.22.6 BSgenome.Ecoli.NCBI.20080805_1.3.17 [13] biomaRt_2.16.0 VennDiagram_1.6.5 multtest_2.16.0 [16] Biobase_2.20.1 BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 [19] ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 [22] Rsamtools_1.12.4 lattice_0.20-23 Biostrings_2.28.0 [25] GenomicRanges_1.12.5 IRanges_1.18.3 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 stats4_3.0.1 [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 zlibbioc_1.6.0 [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6