issue with Views on PairwiseAlignmentsSingleSubject?

0

Entering edit mode

Janet Young ▴ 740

@janet-young-2360

Last seen 4.5 years ago

Fred Hutchinson Cancer Research Center,…

Hi there, I've been using Biostrings' pairwiseAlignment for various things. I just saw the Views feature described in help("PairwiseAlignments- class") and thought I'd give it a try, but it's not behaving as I'd expect in cases where my subject sequence is one sequence selected from a DNAStringSet (rather than starting with a DNAString). I think I found a bug (?) although I could also be misunderstanding what Views is supposed to do. I updated to the new Bioc devel and new R today, but had the same issue before updating. I think the code chunk below will explain - does it make sense? thanks very much, Janet library(Biostrings) ## example seqs (taken from ?pairwiseAlignment) s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAAT TTTCATC") ### make some sequence sets s1a <- DNAStringSet(c(as.character(s1),as.character(s1))) s2a <- DNAStringSet(c(as.character(s2),as.character(s2))) ### align myAln <- pairwiseAlignment ( s1a[1], s2a[1]) ### I think the next line of code should give me a Views on s1a[1], but instead it gives me Views on all the seqs of s1a concatenated together Views(myAln) ##### and, an aside, it would seem intuitive to me that the code below should work to make a stringset of two sequences, but instead it concetenates them to one sequence s1b <- DNAStringSet(c(s1,s1)) ######### sessionInfo() R Under development (unstable) (2012-10-03 r60868) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.27.1 IRanges_1.17.0 BiocGenerics_0.5.0 loaded via a namespace (and not attached): [1] parallel_2.16.0 stats4_2.16.0

• 782 views

ADD COMMENT • link updated 11.5 years ago by Hervé Pagès 16k • written 11.5 years ago by Janet Young ▴ 740

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Janet, Yes, this is clearly a bug. Thanks for reporting it. Just to make the problem even more apparent: subject1 <- paste(rep("N", 60), collapse="") subject2 <- "GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC" subject <- DNAStringSet(c(subject1, subject2)) pattern <- DNAStringSet(c("ATATAAAAATTAATT", "CTTCCATTTCG")) myAln <- pairwiseAlignment(pattern, subject[2]) Then: > myAln Global PairwiseAlignmentsSingleSubject (1 of 2) pattern: [1] ATATAAAAAT-TAATT subject: [39] ATATAAAAATATAATT score: -180.2737 > Views(myAln) Views on a 120-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAATATATAAATATATAAAAATATAATTTTCAT C views: start end width [1] 39 54 16 [NNNNNNNNNNNNNNNN] [2] 1 19 19 [NNNNNNNNNNNNNNNNNNN] A fix is coming (in Biostrings 2.26.1 and 2.27.2). With this fix: > Views(myAln) Views on a 60-letter DNAString subject subject: GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC views: start end width [1] 39 54 16 [ATATAAAAATATAATT] [2] 1 19 19 [GTTTCACTACTTCCTTTCG] And looking closely at 'myAln' with writePairwiseAlignments() will confirm those views: > writePairwiseAlignments(myAln) ######################################## # Program: Biostrings (version 2.26.0), a Bioconductor package # Rundate: Fri Oct 5 16:15:47 2012 ######################################## #======================================= # # Aligned_sequences: 2 # 1: P1 # 2: S1 # Matrix: NA # Gap_penalty: 14.0 # Extend_penalty: 4.0 # # Length: 60 # Identity: 15/60 (25.0%) # Similarity: NA/60 (NA%) # Gaps: 45/60 (75.0%) # Score: -180.2737 # # #======================================= P1 1 --------------------------------------ATATAAAAAT-T 11 |||||||||| | S1 1 GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATAT 50 P1 12 AATT------ 15 |||| S1 51 AATTTTCATC 60 #======================================= # # Aligned_sequences: 2 # 1: P2 # 2: S1 # Matrix: NA # Gap_penalty: 14.0 # Extend_penalty: 4.0 # # Length: 60 # Identity: 9/60 (15.0%) # Similarity: NA/60 (NA%) # Gaps: 49/60 (81.7%) # Score: -209.9628 # # #======================================= P2 1 CTTCCA-------- TTTCG------------------------------- 11 || || ||||| S1 1 GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATAT 50 P2 12 ---------- 11 S1 51 AATTTTCATC 60 #--------------------------------------- #--------------------------------------- More below... On 10/05/2012 03:31 PM, Janet Young wrote: > Hi there, > > I've been using Biostrings' pairwiseAlignment for various things. I just saw the Views feature described in help("PairwiseAlignments- class") and thought I'd give it a try, but it's not behaving as I'd expect in cases where my subject sequence is one sequence selected from a DNAStringSet (rather than starting with a DNAString). > > I think I found a bug (?) although I could also be misunderstanding what Views is supposed to do. I updated to the new Bioc devel and new R today, but had the same issue before updating. > > I think the code chunk below will explain - does it make sense? > > thanks very much, > > Janet > > > > library(Biostrings) > > ## example seqs (taken from ?pairwiseAlignment) > s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") > s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATA ATTTTCATC") > > ### make some sequence sets > s1a <- DNAStringSet(c(as.character(s1),as.character(s1))) > s2a <- DNAStringSet(c(as.character(s2),as.character(s2))) > > > ### align > myAln <- pairwiseAlignment ( s1a[1], s2a[1]) > > > ### I think the next line of code should give me a Views on s1a[1], but instead it gives me Views on all the seqs of s1a concatenated together > Views(myAln) > > ##### and, an aside, it would seem intuitive to me that the code below should work to make a stringset of two sequences, but instead it concetenates them to one sequence > s1b <- DNAStringSet(c(s1,s1)) Combining (with c()) 2 vectors of n1 and n2 elements, respectively, is expected to return a vector of n1 + n2 elements (where the elements of the 2nd vector have been placed after the elements of the 1st vector), and of the same class as the original vectors. In a DNAString object, the elements are the letters and the length of the object is the number of letters: > x <- DNAString("AAAAACTTA") > x 9-letter "DNAString" instance seq: AAAAACTTA > length(x) [1] 9 > x[6:7] 2-letter "DNAString" instance seq: CT So doing c() on DNAString objects does: > c(x, DNAString("NN")) 11-letter "DNAString" instance seq: AAAAACTTANN In a DNAStringSet object, the elements are the sequences (represented as DNAString objects) and the length of the object is the number of sequences: > dna <- DNAStringSet("AAAAACTTA") > dna A DNAStringSet instance of length 1 width seq [1] 9 AAAAACTTA > length(dna) [1] 1 So doing c() on DNAStringSet objects does: > dna2 <- c(dna, dna) > dna2 A DNAStringSet instance of length 2 width seq [1] 9 AAAAACTTA [2] 9 AAAAACTTA > length(dna2) [1] 2 > dna2[[1]] 9-letter "DNAString" instance seq: AAAAACTTA > dna2[[2]] 9-letter "DNAString" instance seq: AAAAACTTA > dna2[[3]] Error in checkAndTranslateDbleBracketSubscript(x, i) : subscript out of bounds So if you had made 's1' a DNAStringSet instead of a DNAString, you would have gotten what you wanted: s1 <- DNAStringSet("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") s1b <- c(s1,s1) For those reasons a DNAStringSet object should be seen as the analog of an ordinary character vector. But not DNAString. There is actually no good analogy for DNAString in R ordinary objects. The closest to it is maybe a raw vector: a DNAString object can also be seen as a vector of bytes. HTH, H. > > ######### > sessionInfo() > > R Under development (unstable) (2012-10-03 r60868) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.27.1 IRanges_1.17.0 BiocGenerics_0.5.0 > > loaded via a namespace (and not attached): > [1] parallel_2.16.0 stats4_2.16.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 11.5 years ago Hervé Pagès 16k

Login before adding your answer.