Biostrings DNAMultipleAlignment vs AlignedXStringSet

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.7 years ago

Hi, I have what I thought should be simple procedure... I have a multiple alignment. I want to find the locations (and sequences) of all the differences. I can read the file into a DNAMultipleAlignment class: aln = read.DNAMultipleAlignment(filepath=alnFiles[i], format="fasta") This is a 4-species alignment, 300bp long. I finally figured out a way find all the "differences" (or, bases where the 4 species are not identical), like so: which(apply(consensusMatrix(aln), 2, max) !=4) It seems like there should be a more straightforward way, but this just looks at the count matrix and finds all the places that don't have all the species in the same nucleotide... Now, I found no easy way to subset the aln object to pull out the sequences at those positions. Is there one? I think maybe I want something like "mismatchTable", but according to the documentation, this won't work on a DNAMultipleAlignment... mismatchTable(aln) Error in function (classes, fdef, mtable) : unable to find an inherited method for function "mismatchTable", for signature "DNAMultipleAlignment" Instead, it works on a AlignedXStringSet. Well, I can't figure out the difference between an AlignedXStringSet and a DNAMultipleAlignment, which both appear to hold aligned strings, even XStringSets...and I cannot convert one to another. Can anyone tell me where I should start to figure out the differences between these objects? And, how to solve my original problem of getting the the sequences at the locations that are different in an alignment? Thanks for your help. -- output of sessionInfo(): R version 2.13.0 (2011-04-13) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] seqLogo_1.18.0 MotIV_1.6.0 rtracklayer_1.12.2 [4] RCurl_1.6-4 bitops_1.0-4.1 GenomicRanges_1.4.3 [7] Biostrings_2.20.1 IRanges_1.10.3 loaded via a namespace (and not attached): [1] BSgenome_1.20.0 lattice_0.19-23 rGADEM_1.4.0 tools_2.13.0 [5] XML_3.4-0 -- Sent via the guest posting facility at bioconductor.org.

Alignment convert Alignment convert • 2.0k views

ADD COMMENT • link updated 11.7 years ago by Martin Morgan 25k • written 11.7 years ago by Guest User ★ 13k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 16 days ago

United States

Hi Nathan -- On 08/31/2012 08:10 AM, Nathan Sheffield [guest] wrote: > > Hi, I have what I thought should be simple procedure... > > I have a multiple alignment. I want to find the locations (and > sequences) of all the differences. > > I can read the file into a DNAMultipleAlignment class: > > aln = read.DNAMultipleAlignment(filepath=alnFiles[i], > format="fasta") > > This is a 4-species alignment, 300bp long. I finally figured out a > way find all the "differences" (or, bases where the 4 species are not > identical), like so: > > which(apply(consensusMatrix(aln), 2, max) !=4) > > It seems like there should be a more straightforward way, but this > just looks at the count matrix and finds all the places that don't > have all the species in the same nucleotide... For a reproducible example, I grabbed aln <- readDNAMultipleAlignment(filepath = system.file("extdata", "msx2_mRNA.aln", package="Biostrings"), format="clustal") from the help page, ?MultipleAlignment. This has 8 rows. I did cidx = colSums(consensusMatrix(aln) == nrow(aln)) == 1 to get a logical vector, but same idea as your 'apply' solution. I ended up with > table(cidx) cidx FALSE TRUE 2024 319 so 2024 columns that are not consistent across all rows. > Now, I found no easy way to subset the aln object to pull out the > sequences at those positions. Is there one? I applied a mask to the columns colmask(aln, "replace") = cidx and then coerced my aln to a DNAStringSet > as(aln, "DNAStringSet") A DNAStringSet instance of length 8 width seq names [1] 2024 -----TCCCGTCTCCGCA...TAAAAAAAAAAAAAAAAA gi|84452153|ref|N... [2] 2024 ------------------...------------------ gi|208431713|ref|... [3] 2024 ------------------...------------------ gi|118601823|ref|... [4] 2024 ------------------...------------------ gi|114326503|ref|... [5] 2024 ------------------...------------------ gi|119220589|ref|... [6] 2024 ------------------...------------------ gi|148540149|ref|... [7] 2024 --------------CGGC...------------------ gi|45383056|ref|N... [8] 2024 GGGGGAGACTTCAGAAGT...------------------ gi|213515133|ref|... Also, I coerced the logical vector to an IRanges > (crng = as(!cidx, "IRanges")) IRanges of length 166 start end width [1] 1 98 98 [2] 100 109 10 [3] 111 120 10 [4] 122 122 1 [5] 124 127 4 [6] 129 142 14 [7] 144 155 12 [8] 157 159 3 [9] 161 166 6 ... ... ... ... [158] 876 876 1 [159] 878 882 5 [160] 885 886 2 [161] 888 889 2 [162] 892 892 1 [163] 894 896 3 [ reached getOption("max.print") -- omitted 3 rows ] This allowed me to look at one range of non-consensus at a time, e.g., > colmask(aln, "replace", invert=TRUE) = crng[1] > as(aln, "DNAStringSet") A DNAStringSet instance of length 8 width seq names [1] 98 -----TCCCGTCTCCGCA...GGGCAGAGAAGTCA-TGG gi|84452153|ref|N... [2] 98 ------------------...-------------A-TGG gi|208431713|ref|... [3] 98 ------------------...---GAGAGAAGTCA-TGG gi|118601823|ref|... [4] 98 ------------------...GCGCAGA-AAGTCA-TGG gi|114326503|ref|... [5] 98 ------------------...-------------A-TGG gi|119220589|ref|... [6] 98 ------------------...-------------A-TGG gi|148540149|ref|... [7] 98 --------------CGGC...GGGCGGCCCCGCTC-CAG gi|45383056|ref|N... [8] 98 GGGGGAGACTTCAGAAGT...GAATGTGTTCGTCAACAT gi|213515133|ref|... I haven't worked with these much, so I might not be doing things in the most efficient way; mostly I'm going from ?MultipleAlignment; there is also vignette(package="Biostrings", "MultipleAlignments"). > > I think maybe I want something like "mismatchTable", but according to > the documentation, this won't work on a DNAMultipleAlignment... > > mismatchTable(aln) Error in function (classes, fdef, mtable) : > unable to find an inherited method for function "mismatchTable", for > signature "DNAMultipleAlignment" > > Instead, it works on a AlignedXStringSet. Well, I can't figure out > the difference between an AlignedXStringSet and a > DNAMultipleAlignment, which both appear to hold aligned strings, even > XStringSets...and I cannot convert one to another. > > Can anyone tell me where I should start to figure out the differences > between these objects? And, how to solve my original problem of > getting the the sequences at the locations that are different in an > alignment? I think the AlignedXStringSet is strictly for pairwise alignments returned by the ?pairwiseAlignment function. Hope that helps, Martin > > > Thanks for your help. > > > > -- output of sessionInfo(): > > R version 2.13.0 (2011-04-13) Platform: x86_64-redhat-linux-gnu > (64-bit) > > locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C > LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] > LC_ADDRESS=C LC_TELEPHONE=C [11] > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: [1] grid stats graphics grDevices > utils datasets methods [8] base > > other attached packages: [1] seqLogo_1.18.0 MotIV_1.6.0 > rtracklayer_1.12.2 [4] RCurl_1.6-4 bitops_1.0-4.1 > GenomicRanges_1.4.3 [7] Biostrings_2.20.1 IRanges_1.10.3 > > loaded via a namespace (and not attached): [1] BSgenome_1.20.0 > lattice_0.19-23 rGADEM_1.4.0 tools_2.13.0 [5] XML_3.4-0 > > > -- Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.7 years ago Martin Morgan 25k

Login before adding your answer.