Question

Longest continuous sequence from multiple alignment

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.4 years ago

I have a number of ShortRead sequences aligned and stored in a DNAMultipleAlignment object. I???m trying to find a way to extract the common longest sequence between the reads. i.e., it is a combination of consensus sequence and longest read here is an example of 6 reads that I have aligned: [1] "CTATGGGAGGGAAAAAGGATCCAGTATTAGGGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTC AACA-------------------------------------------------------------" [2] "CTATGGGAGGG-AAAAGGATCCAGTATTA- GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGC- AAAAAAAACATGGTGACTGCAAAGAG-----------------" [3] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTC AACAACATACTTTCACTTGGC--------------------------------------------" [4] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTC AAC--------------------------------------------------------------" [5] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTC AACAACATACTTTCACTTGGCAAAAAAAAACATGGTGACTGCAAAGAGAGCTCTGTCTGGCTTCT??? [6] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTC AAC--------------------------------------------------------------" >From these, I would like to go extract the longest common sequence, but retain ambiguity information where available i.e producing the following sequence: "CTATGGGAGGG-AAAAGGATCCAGTATTA- GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGC- AAAAAAAACATGGTGACTGCAAAGAGAGCTCTGTCTGGCTTCT??? The consensusString function does not work well for me as it truncates the sequence in the 3??? end in this example. Thank you for your help All the best Tomas -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] Rsubread_1.14.0 Rlibstree_0.3-2 BiocInstaller_1.14.2 muscle_3.8.31-1 ShortRead_1.22.0 GenomicAlignments_1.0.1 [7] BSgenome_1.32.0 Rsamtools_1.16.0 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 Biostrings_2.32.0 XVector_0.4.0 [13] IRanges_1.22.6 BiocParallel_0.6.0 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0 bitops_1.0-6 brew_1.0-6 codetools_0.2-8 DBI_0.2-7 [8] digest_0.6.4 fail_1.2 foreach_1.4.2 grid_3.1.0 hwriter_1.3 iterators_1.0.7 lattice_0.20-29 [15] latticeExtra_0.6-26 plyr_1.8.1 RColorBrewer_1.0-5 Rcpp_0.11.1 RSQLite_0.11.4 sendmailR_1.1-2 stats4_3.1.0 [22] stringr_0.6.2 tools_3.1.0 zlibbioc_1.10.0 -- Sent via the guest posting facility at bioconductor.org.

GO ShortRead GO ShortRead • 1.4k views

ADD COMMENT • link 11.7 years ago Guest User ★ 13k