Distance btw sequences from different sets
1
0
Entering edit mode
@benilton-carvalho-1375
Last seen 4.7 years ago
Brazil/Campinas/UNICAMP
Hi, I was wondering what would be the most efficient strategy to get distances between sequences that belong to two different sets. So far, what I am doing is: library(Biostrings) set.seed(1) alph = DNA_ALPHABET[1:4] set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) set1 = apply(set1, 1, paste, collapse='') set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) set2 = apply(set2, 1, paste, collapse='') outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) Thanks a lot for any suggestion, benilton [[alternative HTML version deleted]]
• 1.2k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 10 hours ago
Seattle, WA, United States
Hi Benilton, On 03/18/2013 05:20 PM, Benilton Carvalho wrote: > Hi, > > I was wondering what would be the most efficient strategy to get distances > between sequences that belong to two different sets. So far, what I am > doing is: > > library(Biostrings) > set.seed(1) > alph = DNA_ALPHABET[1:4] > set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) > set1 = apply(set1, 1, paste, collapse='') > set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) > set2 = apply(set2, 1, paste, collapse='') > outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) Note that by default, stringDist() computes the Levenshtein distance aka "the edit distance" between strings. This is controlled via the 'method' argument. With the toy data you generate above, where all the strings have the same length, it often makes sense to use the Hamming distance instead i.e. to count the nb of mismatches between strings. This is of course up to the user. However it's important to keep in mind that computing the Hamming distance is much faster than computing the Levenshtein distance. No surprise: the algo for computing the former is trivial, but the algo for computing the latter uses a complicated and expensive dynamic programming approach. That being said, and back to your question, if you don't mind calculating a little bit more distances than necessary (and thus also using more memory than necessary), you can call stringDist() on the big set formed by putting the 2 different sets together, and then extract only the part of the big matrix that corresponds to the outer product of the 2 original sets: outerStringDists1 <- function(x, y, method="levenshtein") { d <- stringDist(c(x, y), method=method) m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] if (storage.mode(m) != storage.mode(d)) storage.mode(m) <- storage.mode(d) dimnames(m) <- NULL t(m) } With this solution, there are more loops (looping happens inside stringDist()) but the looping is much faster (it's done in C). In the best case, i.e. when the 2 sets have the same size, this will be hundred times faster than using outer() + apply(). Also, in that case, the big intermediate matrix will be "only" 4 times bigger than the final matrix, which is probably a price that is OK to pay for that kind of speed-up. Nonetheless the situation will quickly degrade if one set is much bigger than the other, e.g. an order of magnitude bigger or more. For example, it would not be efficient at all to use outerStringDists1() if 1 set had 100k strings and the other 100. If you are in this situation, and if you want to compute the Hamming distance, then here is a solution that uses sapply() over the smallest of the 2 sets: outerStringDists2 <- function(x, y) { if (!is(x, "XStringSet")) x <- as(x, "XStringSet") if (!is(y, "XStringSet")) y <- as(y, "XStringSet") if (length(x) < length(y)) return(t(outerStringDists2(y, x))) if (length(unique(c(width(x), width(y)))) != 1L) stop("strings in 'x' and 'y' must have the same length") sapply(seq_along(y), function(j) neditAt(y[[j]], x)) } Still faster than the outer() + apply() solution but can only compute the Hamming distance. Sounds like maybe we should have an outerStringDists() function in Biostrings? Cheers, H. > > Thanks a lot for any suggestion, > benilton > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
There is also a recent stringdist package on CRAN. I have not used it though. Kasper On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Benilton, > > > On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >> >> Hi, >> >> I was wondering what would be the most efficient strategy to get distances >> between sequences that belong to two different sets. So far, what I am >> doing is: >> >> library(Biostrings) >> set.seed(1) >> alph = DNA_ALPHABET[1:4] >> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >> set1 = apply(set1, 1, paste, collapse='') >> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >> set2 = apply(set2, 1, paste, collapse='') >> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) > > > Note that by default, stringDist() computes the Levenshtein distance > aka "the edit distance" between strings. This is controlled via the > 'method' argument. With the toy data you generate above, where all > the strings have the same length, it often makes sense to use the > Hamming distance instead i.e. to count the nb of mismatches between > strings. This is of course up to the user. > > However it's important to keep in mind that computing the Hamming > distance is much faster than computing the Levenshtein distance. > No surprise: the algo for computing the former is trivial, but the > algo for computing the latter uses a complicated and expensive > dynamic programming approach. > > That being said, and back to your question, if you don't mind > calculating a little bit more distances than necessary (and thus > also using more memory than necessary), you can call stringDist() > on the big set formed by putting the 2 different sets together, > and then extract only the part of the big matrix that corresponds > to the outer product of the 2 original sets: > > outerStringDists1 <- function(x, y, method="levenshtein") > { > d <- stringDist(c(x, y), method=method) > m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] > if (storage.mode(m) != storage.mode(d)) > storage.mode(m) <- storage.mode(d) > dimnames(m) <- NULL > t(m) > } > > With this solution, there are more loops (looping happens inside > stringDist()) but the looping is much faster (it's done in C). > In the best case, i.e. when the 2 sets have the same size, this > will be hundred times faster than using outer() + apply(). Also, > in that case, the big intermediate matrix will be "only" 4 times > bigger than the final matrix, which is probably a price that is > OK to pay for that kind of speed-up. Nonetheless the situation > will quickly degrade if one set is much bigger than the other, > e.g. an order of magnitude bigger or more. For example, it would > not be efficient at all to use outerStringDists1() if 1 set had > 100k strings and the other 100. > > If you are in this situation, and if you want to compute the Hamming > distance, then here is a solution that uses sapply() over the smallest > of the 2 sets: > > outerStringDists2 <- function(x, y) > { > if (!is(x, "XStringSet")) > x <- as(x, "XStringSet") > if (!is(y, "XStringSet")) > y <- as(y, "XStringSet") > if (length(x) < length(y)) > return(t(outerStringDists2(y, x))) > if (length(unique(c(width(x), width(y)))) != 1L) > stop("strings in 'x' and 'y' must have the same length") > sapply(seq_along(y), function(j) neditAt(y[[j]], x)) > } > > Still faster than the outer() + apply() solution but can only compute > the Hamming distance. > > Sounds like maybe we should have an outerStringDists() function in > Biostrings? > > Cheers, > H. > > > >> >> Thanks a lot for any suggestion, >> benilton >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On 03/18/2013 08:12 PM, Kasper Daniel Hansen wrote: > There is also a recent stringdist package on CRAN. I have not used it though. Just tried it. 'stringdistmatrix(set1, set2, method="lv")' seems to do the job. And it supports multiple cores. Thanks Kasper! H. > > Kasper > > On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi Benilton, >> >> >> On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >>> >>> Hi, >>> >>> I was wondering what would be the most efficient strategy to get distances >>> between sequences that belong to two different sets. So far, what I am >>> doing is: >>> >>> library(Biostrings) >>> set.seed(1) >>> alph = DNA_ALPHABET[1:4] >>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >>> set1 = apply(set1, 1, paste, collapse='') >>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >>> set2 = apply(set2, 1, paste, collapse='') >>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) >> >> >> Note that by default, stringDist() computes the Levenshtein distance >> aka "the edit distance" between strings. This is controlled via the >> 'method' argument. With the toy data you generate above, where all >> the strings have the same length, it often makes sense to use the >> Hamming distance instead i.e. to count the nb of mismatches between >> strings. This is of course up to the user. >> >> However it's important to keep in mind that computing the Hamming >> distance is much faster than computing the Levenshtein distance. >> No surprise: the algo for computing the former is trivial, but the >> algo for computing the latter uses a complicated and expensive >> dynamic programming approach. >> >> That being said, and back to your question, if you don't mind >> calculating a little bit more distances than necessary (and thus >> also using more memory than necessary), you can call stringDist() >> on the big set formed by putting the 2 different sets together, >> and then extract only the part of the big matrix that corresponds >> to the outer product of the 2 original sets: >> >> outerStringDists1 <- function(x, y, method="levenshtein") >> { >> d <- stringDist(c(x, y), method=method) >> m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] >> if (storage.mode(m) != storage.mode(d)) >> storage.mode(m) <- storage.mode(d) >> dimnames(m) <- NULL >> t(m) >> } >> >> With this solution, there are more loops (looping happens inside >> stringDist()) but the looping is much faster (it's done in C). >> In the best case, i.e. when the 2 sets have the same size, this >> will be hundred times faster than using outer() + apply(). Also, >> in that case, the big intermediate matrix will be "only" 4 times >> bigger than the final matrix, which is probably a price that is >> OK to pay for that kind of speed-up. Nonetheless the situation >> will quickly degrade if one set is much bigger than the other, >> e.g. an order of magnitude bigger or more. For example, it would >> not be efficient at all to use outerStringDists1() if 1 set had >> 100k strings and the other 100. >> >> If you are in this situation, and if you want to compute the Hamming >> distance, then here is a solution that uses sapply() over the smallest >> of the 2 sets: >> >> outerStringDists2 <- function(x, y) >> { >> if (!is(x, "XStringSet")) >> x <- as(x, "XStringSet") >> if (!is(y, "XStringSet")) >> y <- as(y, "XStringSet") >> if (length(x) < length(y)) >> return(t(outerStringDists2(y, x))) >> if (length(unique(c(width(x), width(y)))) != 1L) >> stop("strings in 'x' and 'y' must have the same length") >> sapply(seq_along(y), function(j) neditAt(y[[j]], x)) >> } >> >> Still faster than the outer() + apply() solution but can only compute >> the Hamming distance. >> >> Sounds like maybe we should have an outerStringDists() function in >> Biostrings? >> >> Cheers, >> H. >> >> >> >>> >>> Thanks a lot for any suggestion, >>> benilton >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Thanks, Kasper, installing it now. b 2013/3/19 Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com="">: > There is also a recent stringdist package on CRAN. I have not used it though. > > Kasper > > On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi Benilton, >> >> >> On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >>> >>> Hi, >>> >>> I was wondering what would be the most efficient strategy to get distances >>> between sequences that belong to two different sets. So far, what I am >>> doing is: >>> >>> library(Biostrings) >>> set.seed(1) >>> alph = DNA_ALPHABET[1:4] >>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >>> set1 = apply(set1, 1, paste, collapse='') >>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >>> set2 = apply(set2, 1, paste, collapse='') >>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) >> >> >> Note that by default, stringDist() computes the Levenshtein distance >> aka "the edit distance" between strings. This is controlled via the >> 'method' argument. With the toy data you generate above, where all >> the strings have the same length, it often makes sense to use the >> Hamming distance instead i.e. to count the nb of mismatches between >> strings. This is of course up to the user. >> >> However it's important to keep in mind that computing the Hamming >> distance is much faster than computing the Levenshtein distance. >> No surprise: the algo for computing the former is trivial, but the >> algo for computing the latter uses a complicated and expensive >> dynamic programming approach. >> >> That being said, and back to your question, if you don't mind >> calculating a little bit more distances than necessary (and thus >> also using more memory than necessary), you can call stringDist() >> on the big set formed by putting the 2 different sets together, >> and then extract only the part of the big matrix that corresponds >> to the outer product of the 2 original sets: >> >> outerStringDists1 <- function(x, y, method="levenshtein") >> { >> d <- stringDist(c(x, y), method=method) >> m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] >> if (storage.mode(m) != storage.mode(d)) >> storage.mode(m) <- storage.mode(d) >> dimnames(m) <- NULL >> t(m) >> } >> >> With this solution, there are more loops (looping happens inside >> stringDist()) but the looping is much faster (it's done in C). >> In the best case, i.e. when the 2 sets have the same size, this >> will be hundred times faster than using outer() + apply(). Also, >> in that case, the big intermediate matrix will be "only" 4 times >> bigger than the final matrix, which is probably a price that is >> OK to pay for that kind of speed-up. Nonetheless the situation >> will quickly degrade if one set is much bigger than the other, >> e.g. an order of magnitude bigger or more. For example, it would >> not be efficient at all to use outerStringDists1() if 1 set had >> 100k strings and the other 100. >> >> If you are in this situation, and if you want to compute the Hamming >> distance, then here is a solution that uses sapply() over the smallest >> of the 2 sets: >> >> outerStringDists2 <- function(x, y) >> { >> if (!is(x, "XStringSet")) >> x <- as(x, "XStringSet") >> if (!is(y, "XStringSet")) >> y <- as(y, "XStringSet") >> if (length(x) < length(y)) >> return(t(outerStringDists2(y, x))) >> if (length(unique(c(width(x), width(y)))) != 1L) >> stop("strings in 'x' and 'y' must have the same length") >> sapply(seq_along(y), function(j) neditAt(y[[j]], x)) >> } >> >> Still faster than the outer() + apply() solution but can only compute >> the Hamming distance. >> >> Sounds like maybe we should have an outerStringDists() function in >> Biostrings? >> >> Cheers, >> H. >> >> >> >>> >>> Thanks a lot for any suggestion, >>> benilton >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Thanks, Kasper... stringdist does exactly what I need. b 2013/3/19 Benilton Carvalho <beniltoncarvalho at="" gmail.com="">: > Thanks, Kasper, installing it now. b > > 2013/3/19 Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com="">: >> There is also a recent stringdist package on CRAN. I have not used it though. >> >> Kasper >> >> On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>> Hi Benilton, >>> >>> >>> On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >>>> >>>> Hi, >>>> >>>> I was wondering what would be the most efficient strategy to get distances >>>> between sequences that belong to two different sets. So far, what I am >>>> doing is: >>>> >>>> library(Biostrings) >>>> set.seed(1) >>>> alph = DNA_ALPHABET[1:4] >>>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >>>> set1 = apply(set1, 1, paste, collapse='') >>>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >>>> set2 = apply(set2, 1, paste, collapse='') >>>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) >>> >>> >>> Note that by default, stringDist() computes the Levenshtein distance >>> aka "the edit distance" between strings. This is controlled via the >>> 'method' argument. With the toy data you generate above, where all >>> the strings have the same length, it often makes sense to use the >>> Hamming distance instead i.e. to count the nb of mismatches between >>> strings. This is of course up to the user. >>> >>> However it's important to keep in mind that computing the Hamming >>> distance is much faster than computing the Levenshtein distance. >>> No surprise: the algo for computing the former is trivial, but the >>> algo for computing the latter uses a complicated and expensive >>> dynamic programming approach. >>> >>> That being said, and back to your question, if you don't mind >>> calculating a little bit more distances than necessary (and thus >>> also using more memory than necessary), you can call stringDist() >>> on the big set formed by putting the 2 different sets together, >>> and then extract only the part of the big matrix that corresponds >>> to the outer product of the 2 original sets: >>> >>> outerStringDists1 <- function(x, y, method="levenshtein") >>> { >>> d <- stringDist(c(x, y), method=method) >>> m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] >>> if (storage.mode(m) != storage.mode(d)) >>> storage.mode(m) <- storage.mode(d) >>> dimnames(m) <- NULL >>> t(m) >>> } >>> >>> With this solution, there are more loops (looping happens inside >>> stringDist()) but the looping is much faster (it's done in C). >>> In the best case, i.e. when the 2 sets have the same size, this >>> will be hundred times faster than using outer() + apply(). Also, >>> in that case, the big intermediate matrix will be "only" 4 times >>> bigger than the final matrix, which is probably a price that is >>> OK to pay for that kind of speed-up. Nonetheless the situation >>> will quickly degrade if one set is much bigger than the other, >>> e.g. an order of magnitude bigger or more. For example, it would >>> not be efficient at all to use outerStringDists1() if 1 set had >>> 100k strings and the other 100. >>> >>> If you are in this situation, and if you want to compute the Hamming >>> distance, then here is a solution that uses sapply() over the smallest >>> of the 2 sets: >>> >>> outerStringDists2 <- function(x, y) >>> { >>> if (!is(x, "XStringSet")) >>> x <- as(x, "XStringSet") >>> if (!is(y, "XStringSet")) >>> y <- as(y, "XStringSet") >>> if (length(x) < length(y)) >>> return(t(outerStringDists2(y, x))) >>> if (length(unique(c(width(x), width(y)))) != 1L) >>> stop("strings in 'x' and 'y' must have the same length") >>> sapply(seq_along(y), function(j) neditAt(y[[j]], x)) >>> } >>> >>> Still faster than the outer() + apply() solution but can only compute >>> the Hamming distance. >>> >>> Sounds like maybe we should have an outerStringDists() function in >>> Biostrings? >>> >>> Cheers, >>> H. >>> >>> >>> >>>> >>>> Thanks a lot for any suggestion, >>>> benilton >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Herve, Hamming Distance is actually my interest (didn't use it on my example as I was trying to save some typing). Thank you so much for your help, will definitely put your solution to good use. For me, it'd be useful to have such functionality in Biostrings.... With best wishes, benilton 2013/3/19 Hervé Pagès <hpages at="" fhcrc.org="">: > Hi Benilton, > > > On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >> >> Hi, >> >> I was wondering what would be the most efficient strategy to get distances >> between sequences that belong to two different sets. So far, what I am >> doing is: >> >> library(Biostrings) >> set.seed(1) >> alph = DNA_ALPHABET[1:4] >> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >> set1 = apply(set1, 1, paste, collapse='') >> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >> set2 = apply(set2, 1, paste, collapse='') >> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) > > > Note that by default, stringDist() computes the Levenshtein distance > aka "the edit distance" between strings. This is controlled via the > 'method' argument. With the toy data you generate above, where all > the strings have the same length, it often makes sense to use the > Hamming distance instead i.e. to count the nb of mismatches between > strings. This is of course up to the user. > > However it's important to keep in mind that computing the Hamming > distance is much faster than computing the Levenshtein distance. > No surprise: the algo for computing the former is trivial, but the > algo for computing the latter uses a complicated and expensive > dynamic programming approach. > > That being said, and back to your question, if you don't mind > calculating a little bit more distances than necessary (and thus > also using more memory than necessary), you can call stringDist() > on the big set formed by putting the 2 different sets together, > and then extract only the part of the big matrix that corresponds > to the outer product of the 2 original sets: > > outerStringDists1 <- function(x, y, method="levenshtein") > { > d <- stringDist(c(x, y), method=method) > m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] > if (storage.mode(m) != storage.mode(d)) > storage.mode(m) <- storage.mode(d) > dimnames(m) <- NULL > t(m) > } > > With this solution, there are more loops (looping happens inside > stringDist()) but the looping is much faster (it's done in C). > In the best case, i.e. when the 2 sets have the same size, this > will be hundred times faster than using outer() + apply(). Also, > in that case, the big intermediate matrix will be "only" 4 times > bigger than the final matrix, which is probably a price that is > OK to pay for that kind of speed-up. Nonetheless the situation > will quickly degrade if one set is much bigger than the other, > e.g. an order of magnitude bigger or more. For example, it would > not be efficient at all to use outerStringDists1() if 1 set had > 100k strings and the other 100. > > If you are in this situation, and if you want to compute the Hamming > distance, then here is a solution that uses sapply() over the smallest > of the 2 sets: > > outerStringDists2 <- function(x, y) > { > if (!is(x, "XStringSet")) > x <- as(x, "XStringSet") > if (!is(y, "XStringSet")) > y <- as(y, "XStringSet") > if (length(x) < length(y)) > return(t(outerStringDists2(y, x))) > if (length(unique(c(width(x), width(y)))) != 1L) > stop("strings in 'x' and 'y' must have the same length") > sapply(seq_along(y), function(j) neditAt(y[[j]], x)) > } > > Still faster than the outer() + apply() solution but can only compute > the Hamming distance. > > Sounds like maybe we should have an outerStringDists() function in > Biostrings? > > Cheers, > H. > > >> >> Thanks a lot for any suggestion, >> benilton >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 833 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6