Distance btw sequences from different sets

0

Entering edit mode

Benilton Carvalho ★ 4.3k

@benilton-carvalho-1375

Last seen 4.1 years ago

Brazil/Campinas/UNICAMP

Hi, I was wondering what would be the most efficient strategy to get distances between sequences that belong to two different sets. So far, what I am doing is: library(Biostrings) set.seed(1) alph = DNA_ALPHABET[1:4] set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) set1 = apply(set1, 1, paste, collapse='') set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) set2 = apply(set2, 1, paste, collapse='') outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) Thanks a lot for any suggestion, benilton [[alternative HTML version deleted]]

• 1.0k views

ADD COMMENT • link updated 11.1 years ago by Hervé Pagès 16k • written 11.1 years ago by Benilton Carvalho ★ 4.3k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 20 hours ago

Seattle, WA, United States

Hi Benilton, On 03/18/2013 05:20 PM, Benilton Carvalho wrote: > Hi, > > I was wondering what would be the most efficient strategy to get distances > between sequences that belong to two different sets. So far, what I am > doing is: > > library(Biostrings) > set.seed(1) > alph = DNA_ALPHABET[1:4] > set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) > set1 = apply(set1, 1, paste, collapse='') > set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) > set2 = apply(set2, 1, paste, collapse='') > outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) Note that by default, stringDist() computes the Levenshtein distance aka "the edit distance" between strings. This is controlled via the 'method' argument. With the toy data you generate above, where all the strings have the same length, it often makes sense to use the Hamming distance instead i.e. to count the nb of mismatches between strings. This is of course up to the user. However it's important to keep in mind that computing the Hamming distance is much faster than computing the Levenshtein distance. No surprise: the algo for computing the former is trivial, but the algo for computing the latter uses a complicated and expensive dynamic programming approach. That being said, and back to your question, if you don't mind calculating a little bit more distances than necessary (and thus also using more memory than necessary), you can call stringDist() on the big set formed by putting the 2 different sets together, and then extract only the part of the big matrix that corresponds to the outer product of the 2 original sets: outerStringDists1 <- function(x, y, method="levenshtein") { d <- stringDist(c(x, y), method=method) m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] if (storage.mode(m) != storage.mode(d)) storage.mode(m) <- storage.mode(d) dimnames(m) <- NULL t(m) } With this solution, there are more loops (looping happens inside stringDist()) but the looping is much faster (it's done in C). In the best case, i.e. when the 2 sets have the same size, this will be hundred times faster than using outer() + apply(). Also, in that case, the big intermediate matrix will be "only" 4 times bigger than the final matrix, which is probably a price that is OK to pay for that kind of speed-up. Nonetheless the situation will quickly degrade if one set is much bigger than the other, e.g. an order of magnitude bigger or more. For example, it would not be efficient at all to use outerStringDists1() if 1 set had 100k strings and the other 100. If you are in this situation, and if you want to compute the Hamming distance, then here is a solution that uses sapply() over the smallest of the 2 sets: outerStringDists2 <- function(x, y) { if (!is(x, "XStringSet")) x <- as(x, "XStringSet") if (!is(y, "XStringSet")) y <- as(y, "XStringSet") if (length(x) < length(y)) return(t(outerStringDists2(y, x))) if (length(unique(c(width(x), width(y)))) != 1L) stop("strings in 'x' and 'y' must have the same length") sapply(seq_along(y), function(j) neditAt(y[[j]], x)) } Still faster than the outer() + apply() solution but can only compute the Hamming distance. Sounds like maybe we should have an outerStringDists() function in Biostrings? Cheers, H. > > Thanks a lot for any suggestion, > benilton > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 11.1 years ago Hervé Pagès 16k

0

Entering edit mode

There is also a recent stringdist package on CRAN. I have not used it though. Kasper On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Benilton, > > > On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >> >> Hi, >> >> I was wondering what would be the most efficient strategy to get distances >> between sequences that belong to two different sets. So far, what I am >> doing is: >> >> library(Biostrings) >> set.seed(1) >> alph = DNA_ALPHABET[1:4] >> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >> set1 = apply(set1, 1, paste, collapse='') >> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >> set2 = apply(set2, 1, paste, collapse='') >> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) > > > Note that by default, stringDist() computes the Levenshtein distance > aka "the edit distance" between strings. This is controlled via the > 'method' argument. With the toy data you generate above, where all > the strings have the same length, it often makes sense to use the > Hamming distance instead i.e. to count the nb of mismatches between > strings. This is of course up to the user. > > However it's important to keep in mind that computing the Hamming > distance is much faster than computing the Levenshtein distance. > No surprise: the algo for computing the former is trivial, but the > algo for computing the latter uses a complicated and expensive > dynamic programming approach. > > That being said, and back to your question, if you don't mind > calculating a little bit more distances than necessary (and thus > also using more memory than necessary), you can call stringDist() > on the big set formed by putting the 2 different sets together, > and then extract only the part of the big matrix that corresponds > to the outer product of the 2 original sets: > > outerStringDists1 <- function(x, y, method="levenshtein") > { > d <- stringDist(c(x, y), method=method) > m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] > if (storage.mode(m) != storage.mode(d)) > storage.mode(m) <- storage.mode(d) > dimnames(m) <- NULL > t(m) > } > > With this solution, there are more loops (looping happens inside > stringDist()) but the looping is much faster (it's done in C). > In the best case, i.e. when the 2 sets have the same size, this > will be hundred times faster than using outer() + apply(). Also, > in that case, the big intermediate matrix will be "only" 4 times > bigger than the final matrix, which is probably a price that is > OK to pay for that kind of speed-up. Nonetheless the situation > will quickly degrade if one set is much bigger than the other, > e.g. an order of magnitude bigger or more. For example, it would > not be efficient at all to use outerStringDists1() if 1 set had > 100k strings and the other 100. > > If you are in this situation, and if you want to compute the Hamming > distance, then here is a solution that uses sapply() over the smallest > of the 2 sets: > > outerStringDists2 <- function(x, y) > { > if (!is(x, "XStringSet")) > x <- as(x, "XStringSet") > if (!is(y, "XStringSet")) > y <- as(y, "XStringSet") > if (length(x) < length(y)) > return(t(outerStringDists2(y, x))) > if (length(unique(c(width(x), width(y)))) != 1L) > stop("strings in 'x' and 'y' must have the same length") > sapply(seq_along(y), function(j) neditAt(y[[j]], x)) > } > > Still faster than the outer() + apply() solution but can only compute > the Hamming distance. > > Sounds like maybe we should have an outerStringDists() function in > Biostrings? > > Cheers, > H. > > > >> >> Thanks a lot for any suggestion, >> benilton >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.1 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

On 03/18/2013 08:12 PM, Kasper Daniel Hansen wrote: > There is also a recent stringdist package on CRAN. I have not used it though. Just tried it. 'stringdistmatrix(set1, set2, method="lv")' seems to do the job. And it supports multiple cores. Thanks Kasper! H. > > Kasper > > On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi Benilton, >> >> >> On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >>> >>> Hi, >>> >>> I was wondering what would be the most efficient strategy to get distances >>> between sequences that belong to two different sets. So far, what I am >>> doing is: >>> >>> library(Biostrings) >>> set.seed(1) >>> alph = DNA_ALPHABET[1:4] >>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >>> set1 = apply(set1, 1, paste, collapse='') >>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >>> set2 = apply(set2, 1, paste, collapse='') >>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) >> >> >> Note that by default, stringDist() computes the Levenshtein distance >> aka "the edit distance" between strings. This is controlled via the >> 'method' argument. With the toy data you generate above, where all >> the strings have the same length, it often makes sense to use the >> Hamming distance instead i.e. to count the nb of mismatches between >> strings. This is of course up to the user. >> >> However it's important to keep in mind that computing the Hamming >> distance is much faster than computing the Levenshtein distance. >> No surprise: the algo for computing the former is trivial, but the >> algo for computing the latter uses a complicated and expensive >> dynamic programming approach. >> >> That being said, and back to your question, if you don't mind >> calculating a little bit more distances than necessary (and thus >> also using more memory than necessary), you can call stringDist() >> on the big set formed by putting the 2 different sets together, >> and then extract only the part of the big matrix that corresponds >> to the outer product of the 2 original sets: >> >> outerStringDists1 <- function(x, y, method="levenshtein") >> { >> d <- stringDist(c(x, y), method=method) >> m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] >> if (storage.mode(m) != storage.mode(d)) >> storage.mode(m) <- storage.mode(d) >> dimnames(m) <- NULL >> t(m) >> } >> >> With this solution, there are more loops (looping happens inside >> stringDist()) but the looping is much faster (it's done in C). >> In the best case, i.e. when the 2 sets have the same size, this >> will be hundred times faster than using outer() + apply(). Also, >> in that case, the big intermediate matrix will be "only" 4 times >> bigger than the final matrix, which is probably a price that is >> OK to pay for that kind of speed-up. Nonetheless the situation >> will quickly degrade if one set is much bigger than the other, >> e.g. an order of magnitude bigger or more. For example, it would >> not be efficient at all to use outerStringDists1() if 1 set had >> 100k strings and the other 100. >> >> If you are in this situation, and if you want to compute the Hamming >> distance, then here is a solution that uses sapply() over the smallest >> of the 2 sets: >> >> outerStringDists2 <- function(x, y) >> { >> if (!is(x, "XStringSet")) >> x <- as(x, "XStringSet") >> if (!is(y, "XStringSet")) >> y <- as(y, "XStringSet") >> if (length(x) < length(y)) >> return(t(outerStringDists2(y, x))) >> if (length(unique(c(width(x), width(y)))) != 1L) >> stop("strings in 'x' and 'y' must have the same length") >> sapply(seq_along(y), function(j) neditAt(y[[j]], x)) >> } >> >> Still faster than the outer() + apply() solution but can only compute >> the Hamming distance. >> >> Sounds like maybe we should have an outerStringDists() function in >> Biostrings? >> >> Cheers, >> H. >> >> >> >>> >>> Thanks a lot for any suggestion, >>> benilton >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.1 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks, Kasper, installing it now. b 2013/3/19 Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com="">: > There is also a recent stringdist package on CRAN. I have not used it though. > > Kasper > > On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi Benilton, >> >> >> On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >>> >>> Hi, >>> >>> I was wondering what would be the most efficient strategy to get distances >>> between sequences that belong to two different sets. So far, what I am >>> doing is: >>> >>> library(Biostrings) >>> set.seed(1) >>> alph = DNA_ALPHABET[1:4] >>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >>> set1 = apply(set1, 1, paste, collapse='') >>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >>> set2 = apply(set2, 1, paste, collapse='') >>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) >> >> >> Note that by default, stringDist() computes the Levenshtein distance >> aka "the edit distance" between strings. This is controlled via the >> 'method' argument. With the toy data you generate above, where all >> the strings have the same length, it often makes sense to use the >> Hamming distance instead i.e. to count the nb of mismatches between >> strings. This is of course up to the user. >> >> However it's important to keep in mind that computing the Hamming >> distance is much faster than computing the Levenshtein distance. >> No surprise: the algo for computing the former is trivial, but the >> algo for computing the latter uses a complicated and expensive >> dynamic programming approach. >> >> That being said, and back to your question, if you don't mind >> calculating a little bit more distances than necessary (and thus >> also using more memory than necessary), you can call stringDist() >> on the big set formed by putting the 2 different sets together, >> and then extract only the part of the big matrix that corresponds >> to the outer product of the 2 original sets: >> >> outerStringDists1 <- function(x, y, method="levenshtein") >> { >> d <- stringDist(c(x, y), method=method) >> m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] >> if (storage.mode(m) != storage.mode(d)) >> storage.mode(m) <- storage.mode(d) >> dimnames(m) <- NULL >> t(m) >> } >> >> With this solution, there are more loops (looping happens inside >> stringDist()) but the looping is much faster (it's done in C). >> In the best case, i.e. when the 2 sets have the same size, this >> will be hundred times faster than using outer() + apply(). Also, >> in that case, the big intermediate matrix will be "only" 4 times >> bigger than the final matrix, which is probably a price that is >> OK to pay for that kind of speed-up. Nonetheless the situation >> will quickly degrade if one set is much bigger than the other, >> e.g. an order of magnitude bigger or more. For example, it would >> not be efficient at all to use outerStringDists1() if 1 set had >> 100k strings and the other 100. >> >> If you are in this situation, and if you want to compute the Hamming >> distance, then here is a solution that uses sapply() over the smallest >> of the 2 sets: >> >> outerStringDists2 <- function(x, y) >> { >> if (!is(x, "XStringSet")) >> x <- as(x, "XStringSet") >> if (!is(y, "XStringSet")) >> y <- as(y, "XStringSet") >> if (length(x) < length(y)) >> return(t(outerStringDists2(y, x))) >> if (length(unique(c(width(x), width(y)))) != 1L) >> stop("strings in 'x' and 'y' must have the same length") >> sapply(seq_along(y), function(j) neditAt(y[[j]], x)) >> } >> >> Still faster than the outer() + apply() solution but can only compute >> the Hamming distance. >> >> Sounds like maybe we should have an outerStringDists() function in >> Biostrings? >> >> Cheers, >> H. >> >> >> >>> >>> Thanks a lot for any suggestion, >>> benilton >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.1 years ago Benilton Carvalho ★ 4.3k

0

Entering edit mode

Thanks, Kasper... stringdist does exactly what I need. b 2013/3/19 Benilton Carvalho <beniltoncarvalho at="" gmail.com="">: > Thanks, Kasper, installing it now. b > > 2013/3/19 Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com="">: >> There is also a recent stringdist package on CRAN. I have not used it though. >> >> Kasper >> >> On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>> Hi Benilton, >>> >>> >>> On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >>>> >>>> Hi, >>>> >>>> I was wondering what would be the most efficient strategy to get distances >>>> between sequences that belong to two different sets. So far, what I am >>>> doing is: >>>> >>>> library(Biostrings) >>>> set.seed(1) >>>> alph = DNA_ALPHABET[1:4] >>>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >>>> set1 = apply(set1, 1, paste, collapse='') >>>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >>>> set2 = apply(set2, 1, paste, collapse='') >>>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) >>> >>> >>> Note that by default, stringDist() computes the Levenshtein distance >>> aka "the edit distance" between strings. This is controlled via the >>> 'method' argument. With the toy data you generate above, where all >>> the strings have the same length, it often makes sense to use the >>> Hamming distance instead i.e. to count the nb of mismatches between >>> strings. This is of course up to the user. >>> >>> However it's important to keep in mind that computing the Hamming >>> distance is much faster than computing the Levenshtein distance. >>> No surprise: the algo for computing the former is trivial, but the >>> algo for computing the latter uses a complicated and expensive >>> dynamic programming approach. >>> >>> That being said, and back to your question, if you don't mind >>> calculating a little bit more distances than necessary (and thus >>> also using more memory than necessary), you can call stringDist() >>> on the big set formed by putting the 2 different sets together, >>> and then extract only the part of the big matrix that corresponds >>> to the outer product of the 2 original sets: >>> >>> outerStringDists1 <- function(x, y, method="levenshtein") >>> { >>> d <- stringDist(c(x, y), method=method) >>> m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] >>> if (storage.mode(m) != storage.mode(d)) >>> storage.mode(m) <- storage.mode(d) >>> dimnames(m) <- NULL >>> t(m) >>> } >>> >>> With this solution, there are more loops (looping happens inside >>> stringDist()) but the looping is much faster (it's done in C). >>> In the best case, i.e. when the 2 sets have the same size, this >>> will be hundred times faster than using outer() + apply(). Also, >>> in that case, the big intermediate matrix will be "only" 4 times >>> bigger than the final matrix, which is probably a price that is >>> OK to pay for that kind of speed-up. Nonetheless the situation >>> will quickly degrade if one set is much bigger than the other, >>> e.g. an order of magnitude bigger or more. For example, it would >>> not be efficient at all to use outerStringDists1() if 1 set had >>> 100k strings and the other 100. >>> >>> If you are in this situation, and if you want to compute the Hamming >>> distance, then here is a solution that uses sapply() over the smallest >>> of the 2 sets: >>> >>> outerStringDists2 <- function(x, y) >>> { >>> if (!is(x, "XStringSet")) >>> x <- as(x, "XStringSet") >>> if (!is(y, "XStringSet")) >>> y <- as(y, "XStringSet") >>> if (length(x) < length(y)) >>> return(t(outerStringDists2(y, x))) >>> if (length(unique(c(width(x), width(y)))) != 1L) >>> stop("strings in 'x' and 'y' must have the same length") >>> sapply(seq_along(y), function(j) neditAt(y[[j]], x)) >>> } >>> >>> Still faster than the outer() + apply() solution but can only compute >>> the Hamming distance. >>> >>> Sounds like maybe we should have an outerStringDists() function in >>> Biostrings? >>> >>> Cheers, >>> H. >>> >>> >>> >>>> >>>> Thanks a lot for any suggestion, >>>> benilton >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.1 years ago Benilton Carvalho ★ 4.3k

0

Entering edit mode

Hi Herve, Hamming Distance is actually my interest (didn't use it on my example as I was trying to save some typing). Thank you so much for your help, will definitely put your solution to good use. For me, it'd be useful to have such functionality in Biostrings.... With best wishes, benilton 2013/3/19 Hervé Pagès <hpages at="" fhcrc.org="">: > Hi Benilton, > > > On 03/18/2013 05:20 PM, Benilton Carvalho wrote: >> >> Hi, >> >> I was wondering what would be the most efficient strategy to get distances >> between sequences that belong to two different sets. So far, what I am >> doing is: >> >> library(Biostrings) >> set.seed(1) >> alph = DNA_ALPHABET[1:4] >> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6) >> set1 = apply(set1, 1, paste, collapse='') >> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4) >> set2 = apply(set2, 1, paste, collapse='') >> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist)) > > > Note that by default, stringDist() computes the Levenshtein distance > aka "the edit distance" between strings. This is controlled via the > 'method' argument. With the toy data you generate above, where all > the strings have the same length, it often makes sense to use the > Hamming distance instead i.e. to count the nb of mismatches between > strings. This is of course up to the user. > > However it's important to keep in mind that computing the Hamming > distance is much faster than computing the Levenshtein distance. > No surprise: the algo for computing the former is trivial, but the > algo for computing the latter uses a complicated and expensive > dynamic programming approach. > > That being said, and back to your question, if you don't mind > calculating a little bit more distances than necessary (and thus > also using more memory than necessary), you can call stringDist() > on the big set formed by putting the 2 different sets together, > and then extract only the part of the big matrix that corresponds > to the outer product of the 2 original sets: > > outerStringDists1 <- function(x, y, method="levenshtein") > { > d <- stringDist(c(x, y), method=method) > m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)] > if (storage.mode(m) != storage.mode(d)) > storage.mode(m) <- storage.mode(d) > dimnames(m) <- NULL > t(m) > } > > With this solution, there are more loops (looping happens inside > stringDist()) but the looping is much faster (it's done in C). > In the best case, i.e. when the 2 sets have the same size, this > will be hundred times faster than using outer() + apply(). Also, > in that case, the big intermediate matrix will be "only" 4 times > bigger than the final matrix, which is probably a price that is > OK to pay for that kind of speed-up. Nonetheless the situation > will quickly degrade if one set is much bigger than the other, > e.g. an order of magnitude bigger or more. For example, it would > not be efficient at all to use outerStringDists1() if 1 set had > 100k strings and the other 100. > > If you are in this situation, and if you want to compute the Hamming > distance, then here is a solution that uses sapply() over the smallest > of the 2 sets: > > outerStringDists2 <- function(x, y) > { > if (!is(x, "XStringSet")) > x <- as(x, "XStringSet") > if (!is(y, "XStringSet")) > y <- as(y, "XStringSet") > if (length(x) < length(y)) > return(t(outerStringDists2(y, x))) > if (length(unique(c(width(x), width(y)))) != 1L) > stop("strings in 'x' and 'y' must have the same length") > sapply(seq_along(y), function(j) neditAt(y[[j]], x)) > } > > Still faster than the outer() + apply() solution but can only compute > the Hamming distance. > > Sounds like maybe we should have an outerStringDists() function in > Biostrings? > > Cheers, > H. > > >> >> Thanks a lot for any suggestion, >> benilton >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319

ADD REPLY • link 11.1 years ago Benilton Carvalho ★ 4.3k

Login before adding your answer.