Parse sequences so they align properly
1
0
Entering edit mode
Guest User ★ 12k
@guest-user-4897
Last seen 8.4 years ago
Dear listserv, I have genetic sequence data in the following form: ref.sequence <- "ATAGCCGCA" sequence1 <- "AT[G][C][C]AGCCG[T]CA" sequence2 <- "ATAGCCGC[C][A][C]A" sequence3 <- "AT[GCC]AGCCGCA" The brackets indicate nucleotide insertions relative to the reference sequence ("ref.sequence"). Some sequences may have some/all of the insertions, some may not. What I want is for all of the loci to "align" properly. Therefore, the sequences lacking a particular insertion should get scored with a dash (or dashes) at that locus. I want to end up with this: ref.sequence should look like this: "AT---AGCCG-C---A" sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A" sequence2 should look like this: "AT---AGCCG-C[C][A][C]A" sequence3 should look like this: "AT[G][C][C]AGCCG-C---A" So how can I make this happen efficiently? This may require only "regular" R commands, and nothing specific to Bioconductor. I do not know. In any case, you are the folks that would know how to do it. If you could help me out with the syntax to get some working code I would be very appreciative! Thanks very much in advance, ----------------------------------- Josh Banta, Ph.D Assistant Professor Department of Biology The University of Texas at Tyler Tyler, TX 75799 Tel: (903) 565-5655 http://plantevolutionaryecology.org -- output of sessionInfo(): > ref.sequence <- "ATAGCCGCA" > sequence1 <- "AT[G][C][C]AGCCG[T]CA" > sequence2 <- "ATAGCCGC[C][A][C]A" > sequence3 <- "AT[GCC]AGCCGCA" >#now what? -- Sent via the guest posting facility at bioconductor.org.
• 648 views
0
Entering edit mode
@herve-pages-1542
Last seen 2 hours ago
Seattle, WA, United States
Hi Josh, On 01/03/2014 01:46 PM, Josh Banta [guest] wrote: > > Dear listserv, > > I have genetic sequence data in the following form: > > ref.sequence <- "ATAGCCGCA" > sequence1 <- "AT[G][C][C]AGCCG[T]CA" > sequence2 <- "ATAGCCGC[C][A][C]A" > sequence3 <- "AT[GCC]AGCCGCA" > > The brackets indicate nucleotide insertions relative to the reference sequence ("ref.sequence"). Some sequences may have some/all of the insertions, some may not. The fact that you represent the same 3-letter insertion in 2 different ways (e.g. [G][C][C] in sequence1 and [GCC] in sequence3) makes things a little bit more complicated than they need to be. So I'm going to assume that brackets are around all *individual* inserted letters, that is: ref.sequence <- "ATAGCCGCA" sequences <- c("AT[G][C][C]AGCCG[T]CA", "ATAGCCGC[C][A][C]A", "AT[G][C][C]AGCCGCA") > > What I want is for all of the loci to "align" properly. Therefore, the sequences lacking a particular insertion should get scored with a dash (or dashes) at that locus. > > I want to end up with this: > > ref.sequence should look like this: "AT---AGCCG-C---A" > sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A" > sequence2 should look like this: "AT---AGCCG-C[C][A][C]A" > sequence3 should look like this: "AT[G][C][C]AGCCG-C---A" > > So how can I make this happen efficiently? This may require only "regular" R commands, and nothing specific to Bioconductor. I do not know. In any case, you are the folks that would know how to do it. If you could help me out with the syntax to get some working code I would be very appreciative! Why do you want to keep the brackets in the "stretched" versions of the sequences? They're not needed. It makes the alignments really hard to see and a lot of downstream tools are probably going to choke on those brackets. It would be better to just tend up with: stretched.ref.sequence <- "AT---AGCCG-C---A" stretched.sequences <- c("ATGCCAGCCGTC---A", "AT---AGCCG-CCACA", "ATGCCAGCCG-C---A") Note that now all the stretched sequences have the same length. So we need 2 functions, stretchRefSequence() and stretchSequences(), that will perform this stretching: > stretchRefSequence(ref.sequence, sequences) [1] "AT---AGCCG-C---A" > stretchSequences(ref.sequence, sequences) [1] "ATGCCAGCCGTC---A" "AT---AGCCG-CCACA" "ATGCCAGCCG-C---A" Here is how they can be implemented (only lightly tested): library(IRanges) # for the isSingleString() and # IRanges:::fancy_mseq() functions .computeInsSizes <- function(sequence) { tmp <- strsplit(gsub("\$.\$", "-", sequence), "", fixed=TRUE)[[1L]] head(runLength(Rle(cumsum(tmp != "-"))) - 1L, n=-1L) } computeInsSizesInRef <- function(ref.sequence, sequences) { if (!isSingleString(ref.sequence)) stop("'ref.sequence' must be a single string") if (!is.character(sequences)) stop("'sequences' must be a character vector") ans <- integer(nchar(ref.sequence) - 1L) for (i in seq_along(sequences)) { seq <- sequences[i] if (gsub("\$.\$", "", seq) != ref.sequence) stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") ins_sizes <- .computeInsSizes(seq) ans <- pmax(ans, ins_sizes) } ans } .insertGapsAt <- function(x, at, gap_sizes) { ans_len <- nchar(x) + sum(gap_sizes) ans <- character(ans_len) ans[] <- "-" offset <- at + c(0L, cumsum(head(gap_sizes, n=-1L))) idx <- IRanges:::fancy_mseq(gap_sizes, offset=offset) ans[-idx] <- strsplit(x, "", fixed=TRUE)[[1L]] paste(ans, collapse="") } stretchRefSequence <- function(ref.sequence, sequences) { ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences) ins_at <- seq_len(nchar(ref.sequence) - 1L) .insertGapsAt(ref.sequence, ins_at, ins_sizes_in_ref) } stretchSequences <- function(ref.sequence, sequences) { ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences) sapply(sequences, function(seq) { ins_sizes <- .computeInsSizes(seq) naked_seq <- gsub("[\$\$]", "", seq, perl=TRUE) gap_sizes <- ins_sizes_in_ref - ins_sizes ins_at <- seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) .insertGapsAt(naked_seq, ins_at, gap_sizes) }, USE.NAMES=FALSE) } Cheers, H. > > Thanks very much in advance, > ----------------------------------- > Josh Banta, Ph.D > Assistant Professor > Department of Biology > The University of Texas at Tyler > Tyler, TX 75799 > Tel: (903) 565-5655 > http://plantevolutionaryecology.org > > -- output of sessionInfo(): > >> ref.sequence <- "ATAGCCGCA" >> sequence1 <- "AT[G][C][C]AGCCG[T]CA" >> sequence2 <- "ATAGCCGC[C][A][C]A" >> sequence3 <- "AT[GCC]AGCCGCA" >> #now what? > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
Dear Hervé, Thank you very much for your response. Your script has a line of code that reads: if (gsub("\$.\$", "", seq) != ref.sequence) stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") I take it this means that your functions do not allow for polymorphisms and other variations. For instance: ref.sequence <- "GCAAATT" sequences <- read.table(textConnection("name sequence sequence1 GC-AATT sequence2 GCAAAT[T]T sequence3 GCAAAGT"), header = TRUE, as.is<http: as.is=""> = TRUE) closeAllConnections() Sequence1 has a deletion (relative to the reference sequence) at position 3, sequence 2 has an insertion (relative to the reference sequence) between positions 5 and 6, and sequence3 has a G instead of a T (relative to the reference sequence) at position 6. (I am numbering the positions from left to right, using the reference sequence to assign the position numbers.) What I want to get is this: reference.sequence: "GCAAAT-T sequence1: GC-AAT-T sequence2: GCAAATTT sequence3: GCAAAG-T When I apply your functions to this new data, I get the following errors: > stretchRefSequence(ref.sequence, as.character(sequences[,2])) Error in computeInsSizesInRef(ref.sequence, sequences) : 'sequences[1]' is not compatible with 'ref.sequence' > stretchSequences(ref.sequence, as.character(sequences[,2])) Error in computeInsSizesInRef(ref.sequence, sequences) : 'sequences[1]' is not compatible with 'ref.sequence' I tried deleting the following lines of code from your script: if (gsub("\$.\$", "", seq) != ref.sequence) stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") Reapplying the functions, I get: > stretchRefSequence(ref.sequence, as.character(sequences[,2])) [1] "GC-AAA-TT" > #not correct > stretchSequences(ref.sequence, as.character(sequences[,2])) [1] "GC-AAT-T" "GC-AAATTT" "GC-AAA-GT" Warning messages: 1: In .Method(..., na.rm = na.rm) : an argument will be fractionally recycled 2: In ins_sizes_in_ref - ins_sizes : longer object length is not a multiple of shorter object length 3: In seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) : longer object length is not a multiple of shorter object length > #also not correct Is there a way to modify your codes to make my new toy data give me the answers I'm looking for? Again, I really appreciate your help, because this parsing syntax is way over my head at the moment. ----------------------------------- Josh Banta, Ph.D Assistant Professor Department of Biology The University of Texas at Tyler Tyler, TX 75799 Tel: (903) 565-5655 http://plantevolutionaryecology.org [[alternative HTML version deleted]]
0
Entering edit mode
On 01/04/2014 02:05 PM, Joshua Banta wrote: > Dear Herv?, > > Thank you very much for your response. > > Your script has a line of code that reads: > > if (gsub("\$.\$", "", seq) != ref.sequence) > stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") > > I take it this means that your functions do not allow for polymorphisms and other variations. For instance: > > ref.sequence <- "GCAAATT" > sequences <- read.table(textConnection("name sequence > sequence1 GC-AATT > sequence2 GCAAAT[T]T > sequence3 GCAAAGT"), header = TRUE, as.is<http: as.is=""> = TRUE) > closeAllConnections() > > Sequence1 has a deletion (relative to the reference sequence) at position 3, sequence 2 has an insertion (relative to the reference sequence) between positions 5 and 6, and sequence3 has a G instead of a T (relative to the reference sequence) at position 6. (I am numbering the positions from left to right, using the reference sequence to assign the position numbers.) > > What I want to get is this: > > reference.sequence: "GCAAAT-T > sequence1: GC-AAT-T > sequence2: GCAAATTT > sequence3: GCAAAG-T I took a different approach, not super efficient. Here's the original data, modified so that [GCC] is replaced by [G][C][C], with a single vector of 'sequences', and the ref.sequence included as the first (I don't think it's handled specially, other than not having any insertions). ref.sequence <- "ATAGCCGCA" sequences <- c(ref.sequence, "AT[G][C][C]AGCCG[T]CA", "ATAGCCGC[C][A][C]A", "AT[G][C][C]AGCCGCA") I create two copies of the sequences, one with the gap characters replaced by '+' and another with the raw sequence (no [] marks) gapped <- gsub("\$.\$", "+", sequences) raw <- gsub("(\$|\$)", "", sequences) I'm going to look at each letter in each 'gapped' sequence and decide whether it's a gap or not, and build up 'result' strings that are made of either an induced gap or the actual character at that position. Here's the index into the character positions of the gapped string, the result strings that I'm going to build up, and a stopping criterion idx <- rep(1, length(gapped)) result <- character(length(gapped)) maxit <- nchar(gapped) Now I'll look at each character of each string. When there are insertions, I'll either add '-' or the inserted ('raw') character to the result string, and increment the index into the string(s) with the insertion. repeat { if (all(idx > maxit)) break ins <- substr(gapped, idx, idx) == "+" char <- substr(raw, idx, idx) if (any(ins)) { char <- ifelse(ins, char, "-") idx <- ifelse(ins, idx + 1, idx) } else idx <- idx + 1 result <- sprintf("%s%s", result, char) } and then print the result out cat(paste(result, collapse="\n"), "\n") ## AT---AGCCG-C---A ## ATGCCAGCCGTC---A ## AT---AGCCG-CCACA ## ATGCCAGCCG-C---A or for your second example cat(paste(result, collapse="\n"), "\n") ## GCAAAT-T ## GC-AAT-T ## GCAAATTT ## GCAAAG-T > > When I apply your functions to this new data, I get the following errors: > >> stretchRefSequence(ref.sequence, as.character(sequences[,2])) > Error in computeInsSizesInRef(ref.sequence, sequences) : > 'sequences[1]' is not compatible with 'ref.sequence' > >> stretchSequences(ref.sequence, as.character(sequences[,2])) > Error in computeInsSizesInRef(ref.sequence, sequences) : > 'sequences[1]' is not compatible with 'ref.sequence' > > I tried deleting the following lines of code from your script: > > if (gsub("\$.\$", "", seq) != ref.sequence) > stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") > > Reapplying the functions, I get: > >> stretchRefSequence(ref.sequence, as.character(sequences[,2])) > [1] "GC-AAA-TT" >> #not correct > >> stretchSequences(ref.sequence, as.character(sequences[,2])) > [1] "GC-AAT-T" "GC-AAATTT" "GC-AAA-GT" > Warning messages: > 1: In .Method(..., na.rm = na.rm) : > an argument will be fractionally recycled > 2: In ins_sizes_in_ref - ins_sizes : > longer object length is not a multiple of shorter object length > 3: In seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) : > longer object length is not a multiple of shorter object length >> #also not correct > > Is there a way to modify your codes to make my new toy data give me the answers I'm looking for? Again, I really appreciate your help, because this parsing syntax is way over my head at the moment. > > ----------------------------------- > Josh Banta, Ph.D > Assistant Professor > Department of Biology > The University of Texas at Tyler > Tyler, TX 75799 > Tel: (903) 565-5655 > http://plantevolutionaryecology.org > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
0
Entering edit mode
Hi Martin, On 01/04/2014 03:14 PM, Martin Morgan wrote: > On 01/04/2014 02:05 PM, Joshua Banta wrote: >> Dear Herv?, >> >> Thank you very much for your response. >> >> Your script has a line of code that reads: >> >> if (gsub("\$.\$", "", seq) != ref.sequence) >> stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") >> >> I take it this means that your functions do not allow for >> polymorphisms and other variations. For instance: >> >> ref.sequence <- "GCAAATT" >> sequences <- read.table(textConnection("name sequence >> sequence1 GC-AATT >> sequence2 GCAAAT[T]T >> sequence3 GCAAAGT"), header = TRUE, as.is<http: as.is=""> = TRUE) >> closeAllConnections() >> >> Sequence1 has a deletion (relative to the reference sequence) at >> position 3, sequence 2 has an insertion (relative to the reference >> sequence) between positions 5 and 6, and sequence3 has a G instead of >> a T (relative to the reference sequence) at position 6. (I am >> numbering the positions from left to right, using the reference >> sequence to assign the position numbers.) >> >> What I want to get is this: >> >> reference.sequence: "GCAAAT-T >> sequence1: GC-AAT-T >> sequence2: GCAAATTT >> sequence3: GCAAAG-T > > I took a different approach, not super efficient. Here's the original > data, modified so that [GCC] is replaced by [G][C][C], with a single > vector of 'sequences', and the ref.sequence included as the first (I > don't think it's handled specially, other than not having any insertions). > > ref.sequence <- "ATAGCCGCA" > sequences <- c(ref.sequence, > "AT[G][C][C]AGCCG[T]CA", > "ATAGCCGC[C][A][C]A", > "AT[G][C][C]AGCCGCA") > > I create two copies of the sequences, one with the gap characters > replaced by '+' and another with the raw sequence (no [] marks) > > gapped <- gsub("\$.\$", "+", sequences) > raw <- gsub("(\$|\$)", "", sequences) > > I'm going to look at each letter in each 'gapped' sequence and decide > whether it's a gap or not, and build up 'result' strings that are made > of either an induced gap or the actual character at that position. > Here's the index into the character positions of the gapped string, the > result strings that I'm going to build up, and a stopping criterion > > idx <- rep(1, length(gapped)) > result <- character(length(gapped)) > maxit <- nchar(gapped) > > Now I'll look at each character of each string. When there are > insertions, I'll either add '-' or the inserted ('raw') character to the > result string, and increment the index into the string(s) with the > insertion. > > repeat { > if (all(idx > maxit)) > break > ins <- substr(gapped, idx, idx) == "+" > char <- substr(raw, idx, idx) > if (any(ins)) { > char <- ifelse(ins, char, "-") > idx <- ifelse(ins, idx + 1, idx) > } else > idx <- idx + 1 > result <- sprintf("%s%s", result, char) > } > > and then print the result out > > cat(paste(result, collapse="\n"), "\n") > ## AT---AGCCG-C---A > ## ATGCCAGCCGTC---A > ## AT---AGCCG-CCACA > ## ATGCCAGCCG-C---A > > or for your second example > > cat(paste(result, collapse="\n"), "\n") > ## GCAAAT-T > ## GC-AAT-T > ## GCAAATTT > ## GCAAAG-T Nice. Much simpler than my solution. It's interesting to note that the main difference between your solution and mine is that your main loop (your repeat statement) is on the "columns" of the dataset while my main loop (my for and sapply statements, because I actually make 2 passes) is on the "rows" of the dataset (i.e. on the individual sequences). As a consequence, your solution is faster than mine when the dataset is made of a big number (typically thousands) of short sequences (typically < 1000 chars) while mine is faster than yours when the dataset is made of a small number (e.g. only a few hundreds) of long sequences (typically > 10000 chars). So an optimized pure R solution would require to integrate the 2 implementations and to add an internal switch to one or the other depending on the shape of the dataset. Probably ugly enough to justify that one of the 2 implementations is moved to the C level and used in all situations (it'll be fast whatever the shape of the dataset is). H. > >> >> When I apply your functions to this new data, I get the following errors: >> >>> stretchRefSequence(ref.sequence, as.character(sequences[,2])) >> Error in computeInsSizesInRef(ref.sequence, sequences) : >> 'sequences[1]' is not compatible with 'ref.sequence' >> >>> stretchSequences(ref.sequence, as.character(sequences[,2])) >> Error in computeInsSizesInRef(ref.sequence, sequences) : >> 'sequences[1]' is not compatible with 'ref.sequence' >> >> I tried deleting the following lines of code from your script: >> >> if (gsub("\$.\$", "", seq) != ref.sequence) >> stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") >> >> Reapplying the functions, I get: >> >>> stretchRefSequence(ref.sequence, as.character(sequences[,2])) >> [1] "GC-AAA-TT" >>> #not correct >> >>> stretchSequences(ref.sequence, as.character(sequences[,2])) >> [1] "GC-AAT-T" "GC-AAATTT" "GC-AAA-GT" >> Warning messages: >> 1: In .Method(..., na.rm = na.rm) : >> an argument will be fractionally recycled >> 2: In ins_sizes_in_ref - ins_sizes : >> longer object length is not a multiple of shorter object length >> 3: In seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) : >> longer object length is not a multiple of shorter object length >>> #also not correct >> >> Is there a way to modify your codes to make my new toy data give me >> the answers I'm looking for? Again, I really appreciate your help, >> because this parsing syntax is way over my head at the moment. >> >> ----------------------------------- >> Josh Banta, Ph.D >> Assistant Professor >> Department of Biology >> The University of Texas at Tyler >> Tyler, TX 75799 >> Tel: (903) 565-5655 >> http://plantevolutionaryecology.org >> >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
Hi Joshua, On 01/04/2014 02:05 PM, Joshua Banta wrote: > Dear Herv?, > > Thank you very much for your response. > > Your script has a line of code that reads: > > if (gsub("\$.\$", "", seq) != ref.sequence) > stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") > > I take it this means that your functions do not allow for polymorphisms > and other variations. For instance: > > ref.sequence <- "GCAAATT" > sequences <- read.table(textConnection("name sequence > sequence1 GC-AATT > sequence2 GCAAAT[T]T > sequence3 GCAAAGT"), header = TRUE, as.is <http: as.is=""> = TRUE) > closeAllConnections() > > Sequence1 has a deletion (relative to the reference sequence) at > position 3, sequence 2 has an insertion (relative to the reference > sequence) between positions 5 and 6, and sequence3 has a G instead of a > T (relative to the reference sequence) at position 6. (I am numbering > the positions from left to right, using the reference sequence to assign > the position numbers.) > > What I want to get is this: > > reference.sequence: "GCAAAT-T > sequence1: GC-AAT-T > sequence2: GCAAATTT > sequence3: GCAAAG-T Based on the description you gave of your data, it seemed that your sequences only had insertions with respect to the reference sequence. So the code I provided only supported insertions. Below I modified the original code to support insertions, deletions, and replacements. I also fixed a bug in the .insertGaps() helper function when there is nothing to insert. library(IRanges) # for the Rle(), runLength(), isSingleString(), # and IRanges:::fancy_mseq() functions .isCompatibleWithRef <- function(sequence, ref.sequence) { seq0 <- gsub("\$.\$", "", sequence) nchar(seq0) == nchar(ref.sequence) } .computeInsSizes <- function(sequence) { tmp <- strsplit(gsub("\$.\$", "i", sequence), "", fixed=TRUE)[[1L]] head(runLength(Rle(cumsum(tmp != "i"))) - 1L, n=-1L) } computeInsSizesInRef <- function(ref.sequence, sequences) { if (!isSingleString(ref.sequence)) stop("'ref.sequence' must be a single string") if (!is.character(sequences)) stop("'sequences' must be a character vector") ans <- integer(nchar(ref.sequence) - 1L) for (i in seq_along(sequences)) { seq <- sequences[i] if (!.isCompatibleWithRef(seq, ref.sequence)) stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") ins_sizes <- .computeInsSizes(seq) ans <- pmax(ans, ins_sizes) } ans } .insertGaps <- function(x, after, gap_sizes) { sum_gap_sizes <- sum(gap_sizes) if (sum_gap_sizes == 0L) return(x) ans_len <- nchar(x) + sum_gap_sizes ans <- character(ans_len) ans[] <- "-" offset <- after + c(0L, cumsum(head(gap_sizes, n=-1L))) idx <- IRanges:::fancy_mseq(gap_sizes, offset=offset) ans[-idx] <- strsplit(x, "", fixed=TRUE)[[1L]] paste(ans, collapse="") } stretchRefSequence <- function(ref.sequence, sequences) { ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences) ins_after <- seq_len(nchar(ref.sequence) - 1L) .insertGaps(ref.sequence, ins_after, ins_sizes_in_ref) } stretchSequences <- function(ref.sequence, sequences) { ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences) sapply(sequences, function(seq) { ins_sizes <- .computeInsSizes(seq) naked_seq <- gsub("[\$\$]", "", seq, perl=TRUE) gap_sizes <- ins_sizes_in_ref - ins_sizes ins_after <- seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) .insertGaps(naked_seq, ins_after, gap_sizes) }, USE.NAMES=FALSE) } Cheers, H. > > When I apply your functions to this new data, I get the following errors: > > > stretchRefSequence(ref.sequence, as.character(sequences[,2])) > Error in computeInsSizesInRef(ref.sequence, sequences) : > 'sequences[1]' is not compatible with 'ref.sequence' > > > stretchSequences(ref.sequence, as.character(sequences[,2])) > Error in computeInsSizesInRef(ref.sequence, sequences) : > 'sequences[1]' is not compatible with 'ref.sequence' > > I tried deleting the following lines of code from your script: > > if (gsub("\$.\$", "", seq) != ref.sequence) > stop("'sequences[", i, "]' is not compatible with 'ref.sequence'") > > Reapplying the functions, I get: > > > stretchRefSequence(ref.sequence, as.character(sequences[,2])) > [1] "GC-AAA-TT" > > #not correct > > > stretchSequences(ref.sequence, as.character(sequences[,2])) > [1] "GC-AAT-T" "GC-AAATTT" "GC-AAA-GT" > Warning messages: > 1: In .Method(..., na.rm = na.rm) : > an argument will be fractionally recycled > 2: In ins_sizes_in_ref - ins_sizes : > longer object length is not a multiple of shorter object length > 3: In seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) : > longer object length is not a multiple of shorter object length > > #also not correct > > Is there a way to modify your codes to make my new toy data give me the > answers I'm looking for? Again, I really appreciate your help, because > this parsing syntax is way over my head at the moment. > > ----------------------------------- > Josh Banta, Ph.D > Assistant Professor > Department of Biology > The University of Texas at Tyler > Tyler, TX 75799 > Tel: (903) 565-5655 > http://plantevolutionaryecology.org > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319