Question: Biostrings::matchPattern,extract sequences
0
gravatar for Julie Zhu
5.6 years ago by
Julie Zhu4.0k
United States
Julie Zhu4.0k wrote:
Herve, Is there a more elegant way to get all matched reference sequences besides using subject(matches)[start:end], e.g, subject(matches)[3010894 : 3010916] for each matched record? Thanks! 23-letter "DNAString" instance seq: GCGGAGCCTGAGGCAGAAACCTC matches is the object returned by matchPattern function call. matches Views on a 197195432-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNN...CCTATTCTAGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGA ATTC views: start end width [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] .... Best regards, Julie sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 REDseq_1.6.0 [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 limma_3.16.7 [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 RSQLite_0.11.4 [10] DBI_0.2-7 AnnotationDbi_1.22.6 BSgenome.Ecoli.NCBI.20080805_1.3.17 [13] biomaRt_2.16.0 VennDiagram_1.6.5 multtest_2.16.0 [16] Biobase_2.20.1 BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 [19] ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 [22] Rsamtools_1.12.4 lattice_0.20-23 Biostrings_2.28.0 [25] GenomicRanges_1.12.5 IRanges_1.18.3 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 stats4_3.0.1 [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 zlibbioc_1.6.0 [[alternative HTML version deleted]]
go bsgenome • 1.4k views
ADD COMMENTlink modified 5.6 years ago by Hervé Pagès ♦♦ 13k • written 5.6 years ago by Julie Zhu4.0k
Answer: Biostrings::matchPattern,extract sequences
0
gravatar for Hervé Pagès
5.6 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
Hi Julie, Sorry for the late answer. On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: > Herve, > > Is there a more elegant way to get all matched reference sequences > besides using subject(matches)[start:end], e.g, subject(matches)[3010894 > : 3010916] for each matched record? Thanks! > 23-letter "DNAString" instance > seq: GCGGAGCCTGAGGCAGAAACCTC > > matches is the object returned by matchPattern function call. > > matches > Views on a 197195432-letter DNAString subject > subject: > NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN.. .CCTATTCTAGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC > views: > start end width > [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] > [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] > [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] > [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] > [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] > [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] > .... Just turn this into a DNAStringSet object with a coercion: as(matches, "DNAStringSet") or by calling the DNAStringSet() constructor on it: DNAStringSet(matches) Cheers, H. > > Best regards, > > Julie > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] grid parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 > REDseq_1.6.0 > [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 > limma_3.16.7 > [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 > RSQLite_0.11.4 > [10] DBI_0.2-7 AnnotationDbi_1.22.6 > BSgenome.Ecoli.NCBI.20080805_1.3.17 > [13] biomaRt_2.16.0 VennDiagram_1.6.5 > multtest_2.16.0 > [16] Biobase_2.20.1 > BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 > [19] ShortRead_1.18.0 latticeExtra_0.6-26 > RColorBrewer_1.0-5 > [22] Rsamtools_1.12.4 lattice_0.20-23 > Biostrings_2.28.0 > [25] GenomicRanges_1.12.5 IRanges_1.18.3 > BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 > RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 > stats4_3.0.1 > [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 > zlibbioc_1.6.0 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENTlink written 5.6 years ago by Hervé Pagès ♦♦ 13k
Cool. Thanks, Herve! Is there a method to extract the mismatch position for all the matches? Right now, I am using pairwiseAlignment for each matched subsequence. However, this could become very slow when the number of matched sequences gets large. Best regards, Julie On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at="" fhcrc.org=""> wrote: > Hi Julie, > > Sorry for the late answer. > > On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >> Herve, >> >> Is there a more elegant way to get all matched reference sequences >> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >> : 3010916] for each matched record? Thanks! >> 23-letter "DNAString" instance >> seq: GCGGAGCCTGAGGCAGAAACCTC >> >> matches is the object returned by matchPattern function call. >> >> matches >> Views on a 197195432-letter DNAString subject >> subject: >> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN. ..CCTATTCT >> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC >> views: >> start end width >> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >> .... > > Just turn this into a DNAStringSet object with a coercion: > > as(matches, "DNAStringSet") > > or by calling the DNAStringSet() constructor on it: > > DNAStringSet(matches) > > Cheers, > H. > >> >> Best regards, >> >> Julie >> >> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] grid parallel stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 >> REDseq_1.6.0 >> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >> limma_3.16.7 >> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >> RSQLite_0.11.4 >> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >> BSgenome.Ecoli.NCBI.20080805_1.3.17 >> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >> multtest_2.16.0 >> [16] Biobase_2.20.1 >> BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 >> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >> RColorBrewer_1.0-5 >> [22] Rsamtools_1.12.4 lattice_0.20-23 >> Biostrings_2.28.0 >> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >> BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >> stats4_3.0.1 >> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >> zlibbioc_1.6.0
ADD REPLYlink written 5.6 years ago by Julie Zhu4.0k
On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: > Cool. Thanks, Herve! > > Is there a method to extract the mismatch position for all the matches? > Right now, I am using pairwiseAlignment for each matched subsequence. No easy way at the moment (and this has been on the TODO list for a while). A faster workaround than pairwiseAlignment() if you don't use with.indels=TRUE or fixed=FALSE is: is_mismatch <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) 'is_mismatch' should be a LogicalList object with the same shape as 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch positions. The result will be an IntegerList object. HTH, H. > However, this could become very slow when the number of matched sequences > gets large. > > > Best regards, > > Julie > > > On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at="" fhcrc.org=""> wrote: > >> Hi Julie, >> >> Sorry for the late answer. >> >> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >>> Herve, >>> >>> Is there a more elegant way to get all matched reference sequences >>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >>> : 3010916] for each matched record? Thanks! >>> 23-letter "DNAString" instance >>> seq: GCGGAGCCTGAGGCAGAAACCTC >>> >>> matches is the object returned by matchPattern function call. >>> >>> matches >>> Views on a 197195432-letter DNAString subject >>> subject: >>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ...CCTATTCT >>> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC >>> views: >>> start end width >>> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >>> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >>> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >>> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >>> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >>> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >>> .... >> >> Just turn this into a DNAStringSet object with a coercion: >> >> as(matches, "DNAStringSet") >> >> or by calling the DNAStringSet() constructor on it: >> >> DNAStringSet(matches) >> >> Cheers, >> H. >> >>> >>> Best regards, >>> >>> Julie >>> >>> sessionInfo() >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] grid parallel stats graphics grDevices utils datasets >>> methods base >>> >>> other attached packages: >>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 >>> REDseq_1.6.0 >>> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >>> limma_3.16.7 >>> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >>> RSQLite_0.11.4 >>> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >>> BSgenome.Ecoli.NCBI.20080805_1.3.17 >>> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >>> multtest_2.16.0 >>> [16] Biobase_2.20.1 >>> BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 >>> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >>> RColorBrewer_1.0-5 >>> [22] Rsamtools_1.12.4 lattice_0.20-23 >>> Biostrings_2.28.0 >>> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >>> BiocGenerics_0.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >>> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >>> stats4_3.0.1 >>> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >>> zlibbioc_1.6.0 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 5.6 years ago by Hervé Pagès ♦♦ 13k
Hi Julie, also check out help("lowlevel-matching"). I used hasLetterAt() as an equivalent test in the example I created below which again requires the match to be the same shape. Marcus mpattern <- pattern <- "AATCGGATC" ## mpattern contains a mismatch subseq(mpattern,2,2) <- "G" subject <- DNAString(paste("ATTCG", pattern, "NATTCGGAGGTC", mpattern, "NNNGN",mpattern, "CAAA", sep="")) matches <- matchPattern(pattern, DNAString(subject), max.mismatch=1) print(matches) Lmismatch <- (!hasLetterAt(as(matches, "DNAStringSet"), pattern, seq(nchar(pattern)))) print(Lmismatch) Lmismatch2 <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) print(Lmismatch2) ## 'row' is the Matches views index position, 'col' is the nucleotide mismatch position ind <- which(Lmismatch, arr.ind=TRUE) print(ind) ## Sanity check: subseq(as(matches[ind[,1]], "DNAStringSet"), ind[,2], ind[,2]) On Wed, Sep 18, 2013 at 12:08 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: > >> Cool. Thanks, Herve! >> >> Is there a method to extract the mismatch position for all the matches? >> Right now, I am using pairwiseAlignment for each matched subsequence. >> > > No easy way at the moment (and this has been on the TODO list for a > while). A faster workaround than pairwiseAlignment() if you don't use > with.indels=TRUE or fixed=FALSE is: > > is_mismatch <- relist(as.raw(unlist(matches)) != > as.raw(DNAString(pattern)), matches) > > 'is_mismatch' should be a LogicalList object with the same shape as > 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch > positions. The result will be an IntegerList object. > > HTH, > > H. > > However, this could become very slow when the number of matched sequences >> gets large. >> >> >> Best regards, >> >> Julie >> >> >> On 9/17/13 6:52 PM, "Hervé Pagès" <hpages@fhcrc.org> wrote: >> >> Hi Julie, >>> >>> Sorry for the late answer. >>> >>> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: >>> >>>> Herve, >>>> >>>> Is there a more elegant way to get all matched reference sequences >>>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894 >>>> : 3010916] for each matched record? Thanks! >>>> 23-letter "DNAString" instance >>>> seq: GCGGAGCCTGAGGCAGAAACCTC >>>> >>>> matches is the object returned by matchPattern function call. >>>> >>>> matches >>>> Views on a 197195432-letter DNAString subject >>>> subject: >>>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN**NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN** >>>> NNNNNN...CCTATTCT >>>> AGTAAAAATTTTATTTCATTCTGTAAAGAA**TTTGGTATTAAACTTAAAACTGGAATTC >>>> views: >>>> start end width >>>> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] >>>> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] >>>> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] >>>> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] >>>> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] >>>> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] >>>> .... >>>> >>> >>> Just turn this into a DNAStringSet object with a coercion: >>> >>> as(matches, "DNAStringSet") >>> >>> or by calling the DNAStringSet() constructor on it: >>> >>> DNAStringSet(matches) >>> >>> Cheers, >>> H. >>> >>> >>>> Best regards, >>>> >>>> Julie >>>> >>>> sessionInfo() >>>> R version 3.0.1 (2013-05-16) >>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>> >>>> locale: >>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >>>> >>>> attached base packages: >>>> [1] grid parallel stats graphics grDevices utils datasets >>>> methods base >>>> >>>> other attached packages: >>>> [1] BSgenome.Mmusculus.UCSC.mm9_1.**3.19 BiocInstaller_1.10.3 >>>> REDseq_1.6.0 >>>> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 >>>> limma_3.16.7 >>>> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 >>>> RSQLite_0.11.4 >>>> [10] DBI_0.2-7 AnnotationDbi_1.22.6 >>>> BSgenome.Ecoli.NCBI.20080805_**1.3.17 >>>> [13] biomaRt_2.16.0 VennDiagram_1.6.5 >>>> multtest_2.16.0 >>>> [16] Biobase_2.20.1 >>>> BSgenome.Celegans.UCSC.ce2_1.**3.19 >>>> BSgenome_1.28.0 >>>> [19] ShortRead_1.18.0 latticeExtra_0.6-26 >>>> RColorBrewer_1.0-5 >>>> [22] Rsamtools_1.12.4 lattice_0.20-23 >>>> Biostrings_2.28.0 >>>> [25] GenomicRanges_1.12.5 IRanges_1.18.3 >>>> BiocGenerics_0.6.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 >>>> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 >>>> stats4_3.0.1 >>>> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 >>>> zlibbioc_1.6.0 >>>> >>> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLYlink written 5.6 years ago by Marcus Davy390
Herve and Marcus, Thank you both for the great suggestions! Best regards, Julie On 9/17/13 8:27 PM, "Marcus Davy" <mdavy86@gmail.com> wrote: Hi Julie, also check out help("lowlevel-matching"). I used hasLetterAt() as an equivalent test in the example I created below which again requires the match to be the same shape. Marcus mpattern <- pattern <- "AATCGGATC" ## mpattern contains a mismatch subseq(mpattern,2,2) <- "G" subject <- DNAString(paste("ATTCG", pattern, "NATTCGGAGGTC", mpattern, "NNNGN",mpattern, "CAAA", sep="")) matches <- matchPattern(pattern, DNAString(subject), max.mismatch=1) print(matches) Lmismatch <- (!hasLetterAt(as(matches, "DNAStringSet"), pattern, seq(nchar(pattern)))) print(Lmismatch) Lmismatch2 <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) print(Lmismatch2) ## 'row' is the Matches views index position, 'col' is the nucleotide mismatch position ind <- which(Lmismatch, arr.ind=TRUE) print(ind) ## Sanity check: subseq(as(matches[ind[,1]], "DNAStringSet"), ind[,2], ind[,2]) On Wed, Sep 18, 2013 at 12:08 PM, Hervé Pagès <hpages@fhcrc.org> wrote: On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote: Cool. Thanks, Herve! Is there a method to extract the mismatch position for all the matches? Right now, I am using pairwiseAlignment for each matched subsequence. No easy way at the moment (and this has been on the TODO list for a while). A faster workaround than pairwiseAlignment() if you don't use with.indels=TRUE or fixed=FALSE is: is_mismatch <- relist(as.raw(unlist(matches)) != as.raw(DNAString(pattern)), matches) 'is_mismatch' should be a LogicalList object with the same shape as 'matches'. Then you can do 'which(is_mismatch)' to get the mismatch positions. The result will be an IntegerList object. HTH, H. However, this could become very slow when the number of matched sequences gets large. Best regards, Julie On 9/17/13 6:52 PM, "Hervé Pagès" <hpages@fhcrc.org> wrote: Hi Julie, Sorry for the late answer. On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote: Herve, Is there a more elegant way to get all matched reference sequences besides using subject(matches)[start:end], e.g, subject(matches)[3010894 : 3010916] for each matched record? Thanks! 23-letter "DNAString" instance seq: GCGGAGCCTGAGGCAGAAACCTC matches is the object returned by matchPattern function call. matches Views on a 197195432-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...C CTATTCT AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC views: start end width [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC] [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG] [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG] [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG] [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG] [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC] .... Just turn this into a DNAStringSet object with a coercion: as(matches, "DNAStringSet") or by calling the DNAStringSet() constructor on it: DNAStringSet(matches) Cheers, H. Best regards, Julie sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3 REDseq_1.6.0 [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3 limma_3.16.7 [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0 RSQLite_0.11.4 [10] DBI_0.2-7 AnnotationDbi_1.22.6 BSgenome.Ecoli.NCBI.20080805_1.3.17 [13] biomaRt_2.16.0 VennDiagram_1.6.5 multtest_2.16.0 [16] Biobase_2.20.1 BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0 [19] ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 [22] Rsamtools_1.12.4 lattice_0.20-23 Biostrings_2.28.0 [25] GenomicRanges_1.12.5 IRanges_1.18.3 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29 RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1 stats4_3.0.1 [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 zlibbioc_1.6.0 [[alternative HTML version deleted]]
ADD REPLYlink written 5.6 years ago by Julie Zhu4.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 124 users visited in the last hour