matchTable for pairwiseAlignment in Biostrings?
1
0
Entering edit mode
Brian Herb ▴ 80
@brian-herb-3002
Last seen 10.2 years ago
Hi all, I feel like I'm missing some obvious function, but is there a simple way to output a table of matching start and end positions between the subject and pattern in a pairwiseAlignment result (say from a global-local alignment)? For example: s.pos1 s.pos2 p.pos1 p.pos2 [1,] 354 359 1 6 [2,] 360 364 9 19 where s.pos1 and s.pos2 are the start and end of a section of the subject that corresponds to the start and end (p.pos1 and p.pos2) of the pattern, taking into account insertions and deletions. I know the insertion and deletion information is easily available, but I haven't found a function in the literature that that summarizes them into a mummer like output of corresponding start and end positions. -- Brian Herb, Ph.D. Johns Hopkins School of Medicine Dr. Andrew Feinberg Laboratory Rangos 580 855 N. Wolfe St. Baltimore, MD 21205 Phone:410-614-3478 Fax: 410-614-9819 [[alternative HTML version deleted]]
• 1.0k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States
Hi Brian, I don't know of a function that creates such a table but I think you can create this with the start() and end() accessors. pat <- DNAStringSet(c("CACCAGCT", "TTCGCC", "CGGTC")) sub <- DNAString("ACTTCACCAGCTGACCTCG") res <- pairwiseAlignment(pat, sub) > res Global PairwiseAlignmentsSingleSubject (1 of 3) pattern: [1] CACCAGCT subject: [5] CACCAGCT score: -48.14596 DataFrame(subjectStart=start(subject(res)), subjectEnd=end(subject(res)), patternStart=start(pattern(res)), patternEnd=end(pattern(res))) DataFrame with 3 rows and 4 columns subjectStart subjectEnd patternStart patternEnd <integer> <integer> <integer> <integer> 1 5 12 1 8 2 3 8 1 6 3 1 5 1 5 Valerie On 08/21/2013 08:40 AM, Brian Herb wrote: > Hi all, > > I feel like I'm missing some obvious function, but is there a simple way to > output a table of matching start and end positions between the subject and > pattern in a pairwiseAlignment result (say from a global-local alignment)? > For example: > > s.pos1 s.pos2 p.pos1 p.pos2 > [1,] 354 359 1 6 > [2,] 360 364 9 19 > > where s.pos1 and s.pos2 are the start and end of a section of the subject > that corresponds to the start and end (p.pos1 and p.pos2) of the pattern, > taking into account insertions and deletions. I know the insertion and > deletion information is easily available, but I haven't found a function in > the literature that that summarizes them into a mummer like output of > corresponding start and end positions. >
ADD COMMENT
0
Entering edit mode
Hi Valerie, Thank you for looking into this! I actually didn't see my post show up so I did a little more tinkering and came up with the following function, see what you think: (input is PairwiseAlignmentsSingleSubject) matchTable <- function(pwa) { p.start <- start(pattern(pwa)) p.end <- end(pattern(pwa)) s.start <- start(subject(pwa)) s.end <- end(subject(pwa)) inserts <- as.matrix(insertion(pwa)[[1]]) dels <- as.matrix(deletion(pwa)[[1]]) s.pos1 <- c(s.start,(s.start+inserts[,1]-1)) p.pos1 <- c(p.start,(p.start + cumsum(inserts[,2])+cumsum(diff(s.pos1)))) s.pos2=c(s.pos1[-1]-1,s.end) p.pos2=c((head(p.pos1,n=(length(p.pos1)-1))+diff(s.pos1)-1),p.end) for(i in 1:nrow(dels)){ tmp=dels[i,] p.pos1[which(p.pos1>tmp[1])]=p.pos1[which(p.pos1>tmp[1])]-tmp[2] p.pos2[which(p.pos2>tmp[1])]=p.pos2[which(p.pos2>tmp[1])]-tmp[2] } p.pos2[length(p.pos2)]=p.end tot=cbind(p.pos1,p.pos2,(p.pos2-p.pos1),s.pos1,s.pos2,(s.pos2-s.pos1)) colnames(tot)=c("PatStart","PatEnd","PatDist","SubStart","SubEnd","Sub Dist") tot } example: > tmp1="GTATCGATACGTCGATCGTCGATGCAGCAGCAGACTGATGCATGACTGC" > tmp2="CACTCGATCGATGCTCTACATGATCGTGCAGCAACTGATGCATGACTGC" > test=pairwiseAlignment(pattern=tmp1,subject=tmp2,type="global- local") > matchTable(test) PatStart PatEnd PatDist SubStart SubEnd SubDist [1,] 1 21 20 5 26 21 [2,] 23 29 6 27 33 6 [3,] 34 49 15 34 49 15 For my work I am aligning longer sequences (~1kb) and I care more about where chunks of these sequences align rather than exactly which bases match within these chunks. Since I'm comparing sequences from different species, I'm trying to find where sequence from one species matches to the other, allowing for insertions and deletions. The function could probably be written better, but that is the hazard of letting biologists learn R. Thanks again for your help. writePairwiseAlignment output is attached On Thu, Aug 22, 2013 at 4:38 PM, Valerie Obenchain <vobencha at="" fhcrc.org="">wrote: > Hi Brian, > > I don't know of a function that creates such a table but I think you can > create this with the start() and end() accessors. > > pat <- DNAStringSet(c("CACCAGCT", "TTCGCC", "CGGTC")) > sub <- DNAString("**ACTTCACCAGCTGACCTCG") > res <- pairwiseAlignment(pat, sub) > > > res > Global PairwiseAlignmentsSingleSubjec**t (1 of 3) > pattern: [1] CACCAGCT > subject: [5] CACCAGCT > score: -48.14596 > > DataFrame(subjectStart=start(**subject(res)), > subjectEnd=end(subject(res)), > patternStart=start(pattern(**res)), > patternEnd=end(pattern(res))) > > DataFrame with 3 rows and 4 columns > subjectStart subjectEnd patternStart patternEnd > <integer> <integer> <integer> <integer> > 1 5 12 1 8 > 2 3 8 1 6 > 3 1 5 1 5 > > > Valerie > > > On 08/21/2013 08:40 AM, Brian Herb wrote: > >> Hi all, >> >> I feel like I'm missing some obvious function, but is there a simple way >> to >> output a table of matching start and end positions between the subject and >> pattern in a pairwiseAlignment result (say from a global-local alignment)? >> For example: >> >> s.pos1 s.pos2 p.pos1 p.pos2 >> [1,] 354 359 1 6 >> [2,] 360 364 9 19 >> >> where s.pos1 and s.pos2 are the start and end of a section of the subject >> that corresponds to the start and end (p.pos1 and p.pos2) of the pattern, >> taking into account insertions and deletions. I know the insertion and >> deletion information is easily available, but I haven't found a function >> in >> the literature that that summarizes them into a mummer like output of >> corresponding start and end positions. >> >> > > > -- Brian Herb, Ph.D. Johns Hopkins School of Medicine Dr. Andrew Feinberg Laboratory Rangos 580 855 N. Wolfe St. Baltimore, MD 21205 Phone:410-614-3478 Fax: 410-614-9819 -------------- next part -------------- ######################################## # Program: Biostrings (version 2.28.0), a Bioconductor package # Rundate: Thu Aug 22 16:49:37 2013 ######################################## #======================================= # # Aligned_sequences: 2 # 1: P1 # 2: S1 # Matrix: NA # Gap_penalty: 14.0 # Extend_penalty: 4.0 # # Length: 50 # Identity: 36/50 (72.0%) # Similarity: NA/50 (NA%) # Gaps: 6/50 (12.0%) # Score: -29.85107 # # #======================================= P1 1 GTATCGATACG- TCGATCGTCGATGCAGCAGCAGACTGATGCATGACTGC 49 |||||| | | || ||| ||||||| |||||||||||||||| S1 5 CGATCGATGCTCTACATGATCG-TGCAGCA---- ACTGATGCATGACTGC 49 #--------------------------------------- #---------------------------------------
ADD REPLY
0
Entering edit mode
This response from Herve didn't get posted - email troubles while traveling. Valerie -------- Original Message -------- Subject: Re: [BioC] matchTable for pairwiseAlignment in Biostrings? Date: Thu, 22 Aug 2013 23:58:51 -0700 From: Hervé Pagès <hpages@fhcrc.org> To: Brian Herb <brianherb at="" jhmi.edu=""> CC: Valerie Obenchain <vobencha at="" fhcrc.org="">, bioconductor at r-project.org Hi Brian, Global-local alignments can also be described in terms of position and cigar (POS and CIGAR fields in the SAM/BAM formats). In your example: POS = 5 CIGAR = 11M1D10M1I7M4I16M (generated by hand, sorry) The advantage of the POS + CIGAR representation is that one can make use of the cigar utilities in GenomicRanges to implement something equivalent to your matchTable() function: library(GenomicRanges) matchTable2 <- function(pos, cigar) { pat_ranges <- cigarRangesAlongQuerySpace(cigar, with.ops=TRUE)[[1]] sub_ranges <- cigarRangesAlongReferenceSpace(cigar, pos=pos, with.ops=TRUE)[[1]] ops <- names(pat_ranges) # guaranteed to be the same as names(sub_ranges) I_idx <- which(ops == "I") start_idx <- c(1L, I_idx + 1L) end_idx <- c(I_idx - 1L, length(pat_ranges)) PatStart <- start(pat_ranges)[start_idx] PatEnd <- end(pat_ranges)[end_idx] SubStart <- start(sub_ranges)[start_idx] SubEnd <- end(sub_ranges)[end_idx] data.frame(PatStart, PatEnd, SubStart, SubEnd) } Then: > matchTable2(5, "11M1D10M1I7M4I16M") PatStart PatEnd SubStart SubEnd 1 1 21 5 26 2 23 29 27 33 3 34 49 34 49 So it sounds like it would maybe be helpful if we had a cigar() getter that computes the cigars from a PairwiseAlignments object of type "global-local". Once we have this, we could even support exporting this kind of objects to the SAM/BAM format. Cheers, H. On 08/22/2013 02:01 PM, Brian Herb wrote: > Hi Valerie, > > Thank you for looking into this! I actually didn't see my post show up > so I did a little more tinkering and came up with the following > function, see what you think: > > (input is PairwiseAlignmentsSingleSubject) > > matchTable <- function(pwa) { > p.start <- start(pattern(pwa)) > p.end <- end(pattern(pwa)) > s.start <- start(subject(pwa)) > s.end <- end(subject(pwa)) > inserts <- as.matrix(insertion(pwa)[[1]]) > dels <- as.matrix(deletion(pwa)[[1]]) > s.pos1 <- c(s.start,(s.start+inserts[,1]-1)) > p.pos1 <- c(p.start,(p.start + cumsum(inserts[,2])+cumsum(diff(s.pos1)))) > s.pos2=c(s.pos1[-1]-1,s.end) > p.pos2=c((head(p.pos1,n=(length(p.pos1)-1))+diff(s.pos1)-1),p.end) > > for(i in 1:nrow(dels)){ > tmp=dels[i,] > p.pos1[which(p.pos1>tmp[1])]=p.pos1[which(p.pos1>tmp[1])]-tmp[2] > p.pos2[which(p.pos2>tmp[1])]=p.pos2[which(p.pos2>tmp[1])]-tmp[2] > } > > p.pos2[length(p.pos2)]=p.end > > tot=cbind(p.pos1,p.pos2,(p.pos2-p.pos1),s.pos1,s.pos2,(s.pos2-s.pos1)) > colnames(tot)=c("PatStart","PatEnd","PatDist","SubStart","SubEnd","S ubDist") > tot > } > > example: > > > tmp1="GTATCGATACGTCGATCGTCGATGCAGCAGCAGACTGATGCATGACTGC" > > > tmp2="CACTCGATCGATGCTCTACATGATCGTGCAGCAACTGATGCATGACTGC" > > > test=pairwiseAlignment(pattern=tmp1,subject=tmp2,type="global- local") > > > matchTable(test) > PatStart PatEnd PatDist SubStart SubEnd SubDist > [1,] 1 21 20 5 26 21 > [2,] 23 29 6 27 33 6 > [3,] 34 49 15 34 49 15 > For my work I am aligning longer sequences (~1kb) and I care more about > where chunks of these sequences align rather than exactly which bases > match within these chunks. Since I'm comparing sequences from different > species, I'm trying to find where sequence from one species matches to > the other, allowing for insertions and deletions. The function could > probably be written better, but that is the hazard of letting biologists > learn R. Thanks again for your help. > > writePairwiseAlignment output is attached > > > > > On Thu, Aug 22, 2013 at 4:38 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > Hi Brian, > > I don't know of a function that creates such a table but I think you > can create this with the start() and end() accessors. > > pat <- DNAStringSet(c("CACCAGCT", "TTCGCC", "CGGTC")) > sub <- DNAString("__ACTTCACCAGCTGACCTCG") > res <- pairwiseAlignment(pat, sub) > > > res > Global PairwiseAlignmentsSingleSubjec__t (1 of 3) > pattern: [1] CACCAGCT > subject: [5] CACCAGCT > score: -48.14596 > > DataFrame(subjectStart=start(__subject(res)), > subjectEnd=end(subject(res)), > patternStart=start(pattern(__res)), > patternEnd=end(pattern(res))) > > DataFrame with 3 rows and 4 columns > subjectStart subjectEnd patternStart patternEnd > <integer> <integer> <integer> <integer> > 1 5 12 1 8 > 2 3 8 1 6 > 3 1 5 1 5 > > > Valerie > > > On 08/21/2013 08:40 AM, Brian Herb wrote: > > Hi all, > > I feel like I'm missing some obvious function, but is there a > simple way to > output a table of matching start and end positions between the > subject and > pattern in a pairwiseAlignment result (say from a global- local > alignment)? > For example: > > s.pos1 s.pos2 p.pos1 p.pos2 > [1,] 354 359 1 6 > [2,] 360 364 9 19 > > where s.pos1 and s.pos2 are the start and end of a section of > the subject > that corresponds to the start and end (p.pos1 and p.pos2) of > the pattern, > taking into account insertions and deletions. I know the > insertion and > deletion information is easily available, but I haven't found a > function in > the literature that that summarizes them into a mummer like > output of > corresponding start and end positions. > > > > > > > > -- > Brian Herb, Ph.D. > Johns Hopkins School of Medicine > Dr. Andrew Feinberg Laboratory > Rangos 580 > 855 N. Wolfe St. > Baltimore, MD 21205 > Phone:410-614-3478 > Fax: 410-614-9819
ADD REPLY

Login before adding your answer.

Traffic: 490 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6