Count differences between sequences

0

Entering edit mode

erikwright@comcast.net ▴ 20

@erikwrightcomcastnet-3998

Last seen 10.4 years ago

Hello all, I have a large DNAStringSet and I am trying to calculate its distance matrix. My DNAStrings are equal width and they are already aligned. I have tried using the stringDist() function, but it is very slow for large DNAStringSets. Is there a way to quickly calculate the number of differences between two DNAString instances? For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I would like to know if their is a function other than stringDist() that will tell me the distance between them is 1. Thanks in advance for any help. - Erik [[alternative HTML version deleted]]

• 2.9k views

ADD COMMENT • link 14.8 years ago erikwright@comcast.net ▴ 20

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 10.3 years ago

United States

Erik, Could you provide more details on your data? How long are each of the strings and how many strings do you have? Also, do you need the entire N x N distance matrix for downstream analysis or are you just looking for closest relatives? Patrick On 3/25/10 2:29 PM, erikwright at comcast.net wrote: > Hello all, > > > I have a large DNAStringSet and I am trying to calculate its distance matrix. My DNAStrings are equal width and they are already aligned. > > > I have tried using the stringDist() function, but it is very slow for large DNAStringSets. Is there a way to quickly calculate the number of differences between two DNAString instances? > > > For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I would like to know if their is a function other than stringDist() that will tell me the distance between them is 1. > > > Thanks in advance for any help. > > > - Erik > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.8 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 10.3 years ago

United States

Erik, Judging from your data, I would gather that you are not interested in indels. Is that correct? You should look at the neditStartingAt function. Something like the following may meet your needs: N <- length(myStrings) myDists <- matrix(0, nrow = N, ncol = N) for (i in 1:(N-1)) for (j in (i+1):N) myDists[i, j] <- myDists[j, i] <- neditStartingAt(myStrings[[i]], myStrings[[j]]) Patrick On 3/25/10 2:57 PM, erikwright@comcast.net wrote: > I have 500 DNAStrings, all of length 8000. I need the entire N x N > distance matrix. > > Thanks, > Erik > > > > ----- Original Message ----- > From: "Patrick Aboyoun" <paboyoun@fhcrc.org> > To: erikwright@comcast.net > Cc: bioconductor@stat.math.ethz.ch > Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central > Subject: Re: [BioC] Count differences between sequences > > Erik, > Could you provide more details on your data? How long are each of the > strings and how many strings do you have? Also, do you need the entire > N x N distance matrix for downstream analysis or are you just looking > for closest relatives? > > > Patrick > > > > On 3/25/10 2:29 PM, erikwright@comcast.net wrote: > > Hello all, > > > > > > I have a large DNAStringSet and I am trying to calculate its > distance matrix. My DNAStrings are equal width and they are already > aligned. > > > > > > I have tried using the stringDist() function, but it is very slow > for large DNAStringSets. Is there a way to quickly calculate the > number of differences between two DNAString instances? > > > > > > For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I > would like to know if their is a function other than stringDist() that > will tell me the distance between them is 1. > > > > > > Thanks in advance for any help. > > > > > > - Erik > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]]

ADD COMMENT • link 14.8 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

On 03/25/2010 03:22 PM, Patrick Aboyoun wrote: > Erik, > Judging from your data, I would gather that you are not interested in > indels. Is that correct? You should look at the neditStartingAt > function. Something like the following may meet your needs: > > N <- length(myStrings) > myDists <- matrix(0, nrow = N, ncol = N) > for (i in 1:(N-1)) > for (j in (i+1):N) > myDists[i, j] <- myDists[j, i] <- > neditStartingAt(myStrings[[i]], myStrings[[j]]) Here's a psuedo dataset library(Biostrings) library(drosophila2probe) ch0 <- drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- DNAStringSet(apply(m0, 1, paste, collapse="")) Suppose we take dna and convert it to a matrix m of M nucleotides x N sequences. The distance between one sequence (column) of the matrix, say elt = m[,1], and all the other sequences is func <- function(elt, m) colSums(elt != m) so an inefficient (polynomial time, N^2) algorithm that works ok for 'dna' might be dnaDist0 <- function(dna, func=function(elt, m) colSums(elt != m)) { l <- strsplit(as.character(dna), "") m <- matrix(unlist(l), unique(width(dna))) res0 <- lapply(l, func, m) matrix(unlist(res0), ncol=length(res0)) } so > system.time(res <- dnaDist0(dna)) user system elapsed 68.624 0.048 68.828 It parallelizes very nicely on linux machines by preceding the above with > library(multicore) > lapply <- mclapply > system.time(res <- dnaDist0(dna)) user system elapsed 35.394 0.324 39.566 A generalization is to suppose a matrix of edit distances, e.g., dist <- 1 - nucleotideSubstitutionMatrix() and to use indicies into the vector representation of dist (here dist is a 15 x 15 matrix; it is also a length 255 (= 15 * 15) vector, with elements 1:15 the first column of dist, 16:30 the second column, etc.) instead of elt != m as the distance metric. This variant is func1 <- function(elt, m, d, dim) { x <- d[m + elt]; dim(x) <- dim colSums(x) } when the letters of the DNAStringSet in elt, m have been replaced by their row / column indexes as dnaDist1 <- function(dna, dist, func) { l <- lapply(strsplit(as.character(dna), ""), match, rownames(dist)) m <- (unlist(l) - 1L) * nrow(dist) dim <- c(unique(width(dna)), length(dna)) res0 <- lapply(l, func, m, dist, dim) matrix(unlist(res0), ncol=length(res0)) } dnaDist1(dna, dist, func1) is quite a bit slower (most of the time is in func1), taking about 300s as a single thread on my laptop. Memory consumption is about the same. As mentioned, the algorithm is inefficient and polynomial in the number of sequences, so if the problem had been transposed, say 8000 sequences of 500 nucleotides each, the timing would have been unbearably slow. The 'obvious' optimization of only calculating the lower (or upper) triangle of distances might cut the time in half, but that doesn't change the polynomial time. I suspect there is a N * log(N) algorithm that could be implemented in R, and that perhaps there is something clever with PDict. Martin > > > Patrick > > > On 3/25/10 2:57 PM, erikwright at comcast.net wrote: >> I have 500 DNAStrings, all of length 8000. I need the entire N x N >> distance matrix. >> >> Thanks, >> Erik >> >> >> >> ----- Original Message ----- >> From: "Patrick Aboyoun" <paboyoun at="" fhcrc.org=""> >> To: erikwright at comcast.net >> Cc: bioconductor at stat.math.ethz.ch >> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >> Subject: Re: [BioC] Count differences between sequences >> >> Erik, >> Could you provide more details on your data? How long are each of the >> strings and how many strings do you have? Also, do you need the entire >> N x N distance matrix for downstream analysis or are you just looking >> for closest relatives? >> >> >> Patrick >> >> >> >> On 3/25/10 2:29 PM, erikwright at comcast.net wrote: >>> Hello all, >>> >>> >>> I have a large DNAStringSet and I am trying to calculate its >> distance matrix. My DNAStrings are equal width and they are already >> aligned. >>> >>> >>> I have tried using the stringDist() function, but it is very slow >> for large DNAStringSets. Is there a way to quickly calculate the >> number of differences between two DNAString instances? >>> >>> >>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >> would like to know if their is a function other than stringDist() that >> will tell me the distance between them is 1. >>> >>> >>> Thanks in advance for any help. >>> >>> >>> - Erik >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 14.8 years ago Martin Morgan 25k

0

Entering edit mode

Hi Erik, Patrick, Patrick Aboyoun wrote: > Erik, > Judging from your data, I would gather that you are not interested in > indels. Is that correct? You should look at the neditStartingAt > function. Something like the following may meet your needs: > > N <- length(myStrings) > myDists <- matrix(0, nrow = N, ncol = N) > for (i in 1:(N-1)) > for (j in (i+1):N) > myDists[i, j] <- myDists[j, i] <- > neditStartingAt(myStrings[[i]], myStrings[[j]]) > Note that you can take advantage of the fact that neditStartingAt() is vectorized with respect to its second argument: N <- length(myStrings) myDists <- sapply(1:N, function(i) neditStartingAt(myStrings[[i]], myStrings))) That will make things hundred times faster with a big "DNA rectangle" like yours (500x8000): > system.time( myDists <- sapply(1:N, function(i) neditStartingAt(myStrings[[i]], myStrings))) user system elapsed 12.053 0.000 12.723 > myDists[1:10, 1:10] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 [5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194 [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 [10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0 Cheers, H. > > Patrick > > > On 3/25/10 2:57 PM, erikwright at comcast.net wrote: >> I have 500 DNAStrings, all of length 8000. I need the entire N x N >> distance matrix. >> >> Thanks, >> Erik >> >> >> >> ----- Original Message ----- >> From: "Patrick Aboyoun" <paboyoun at="" fhcrc.org=""> >> To: erikwright at comcast.net >> Cc: bioconductor at stat.math.ethz.ch >> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >> Subject: Re: [BioC] Count differences between sequences >> >> Erik, >> Could you provide more details on your data? How long are each of the >> strings and how many strings do you have? Also, do you need the entire >> N x N distance matrix for downstream analysis or are you just looking >> for closest relatives? >> >> >> Patrick >> >> >> >> On 3/25/10 2:29 PM, erikwright at comcast.net wrote: >>> Hello all, >>> >>> >>> I have a large DNAStringSet and I am trying to calculate its >> distance matrix. My DNAStrings are equal width and they are already >> aligned. >>> >>> I have tried using the stringDist() function, but it is very slow >> for large DNAStringSets. Is there a way to quickly calculate the >> number of differences between two DNAString instances? >>> >>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >> would like to know if their is a function other than stringDist() that >> will tell me the distance between them is 1. >>> >>> Thanks in advance for any help. >>> >>> >>> - Erik >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.8 years ago Hervé Pagès 16k

0

Entering edit mode

2010/3/26 Hervé Pagès <hpages@fhcrc.org> > Hi Erik, Patrick, > > > Patrick Aboyoun wrote: > >> Erik, >> Judging from your data, I would gather that you are not interested in >> indels. Is that correct? You should look at the neditStartingAt function. >> Something like the following may meet your needs: >> >> N <- length(myStrings) >> myDists <- matrix(0, nrow = N, ncol = N) >> for (i in 1:(N-1)) >> for (j in (i+1):N) >> myDists[i, j] <- myDists[j, i] <- neditStartingAt(myStrings[[i]], >> myStrings[[j]]) >> >> > Note that you can take advantage of the fact that neditStartingAt() is > vectorized with respect to its second argument: > > N <- length(myStrings) > myDists <- sapply(1:N, > function(i) neditStartingAt(myStrings[[i]], myStrings))) > > That will make things hundred times faster with a big "DNA rectangle" > like yours (500x8000): > > > system.time( > myDists <- sapply(1:N, > function(i) neditStartingAt(myStrings[[i]], myStrings))) > user system elapsed > 12.053 0.000 12.723 > > > myDists[1:10, 1:10] > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 > [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 > [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 > [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 > [5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194 > [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 > [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 > [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 > [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 > [10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0 > > Wow, nice. It sounds to me like this would be a common use case; how hard would it be to vectorize both arguments? Michael > Cheers, > H. > > > >> Patrick >> >> >> On 3/25/10 2:57 PM, erikwright@comcast.net wrote: >> >>> I have 500 DNAStrings, all of length 8000. I need the entire N x N >>> distance matrix. >>> >>> Thanks, >>> Erik >>> >>> >>> >>> ----- Original Message ----- >>> From: "Patrick Aboyoun" <paboyoun@fhcrc.org> >>> To: erikwright@comcast.net >>> Cc: bioconductor@stat.math.ethz.ch >>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>> Subject: Re: [BioC] Count differences between sequences >>> >>> Erik, >>> Could you provide more details on your data? How long are each of the >>> strings and how many strings do you have? Also, do you need the entire N x N >>> distance matrix for downstream analysis or are you just looking for closest >>> relatives? >>> >>> >>> Patrick >>> >>> >>> >>> On 3/25/10 2:29 PM, erikwright@comcast.net wrote: >>> >>>> Hello all, >>>> >>>> >>>> I have a large DNAStringSet and I am trying to calculate its >>>> >>> distance matrix. My DNAStrings are equal width and they are already >>> aligned. >>> >>>> >>>> I have tried using the stringDist() function, but it is very slow >>>> >>> for large DNAStringSets. Is there a way to quickly calculate the number >>> of differences between two DNAString instances? >>> >>>> >>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>> >>> would like to know if their is a function other than stringDist() that >>> will tell me the distance between them is 1. >>> >>>> >>>> Thanks in advance for any help. >>>> >>>> >>>> - Erik >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I just added a Hamming distance metric to the stringDist function in BioC 2.6, which should be available on bioconductor.org in 36 hours. This uses the code underlying the neditStartingAt functions and loops over the lower triangle of the distance matrix in C so it is faster than the sapply method Herve provided. To calculate this newly added distance measure, specify stringDist(x, method = "hamming") where x is an XStringSet object containing equal length strings. library(Biostrings) library(drosophila2probe) ch0 <- drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029 5.541 sessionInfo() R version 2.11.0 Under development (unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Hervé Pagès<hpages@fhcrc.org> > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging from your data, I would gather that you are not interested in >>> indels. Is that correct? You should look at the neditStartingAt function. >>> Something like the following may meet your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of the fact that neditStartingAt() is >> vectorized with respect to its second argument: >> >> N<- length(myStrings) >> myDists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That will make things hundred times faster with a big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time( >>> >> myDists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >> > Wow, nice. It sounds to me like this would be a common use case; how hard > would it be to vectorize both arguments? > > Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM, erikwright@comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all of length 8000. I need the entire N x N >>>> distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Patrick Aboyoun"<paboyoun@fhcrc.org> >>>> To: erikwright@comcast.net >>>> Cc: bioconductor@stat.math.ethz.ch >>>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re: [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could you provide more details on your data? How long are each of the >>>> strings and how many strings do you have? Also, do you need the entire N x N >>>> distance matrix for downstream analysis or are you just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29 PM, erikwright@comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> >>>>> I have a large DNAStringSet and I am trying to calculate its >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the stringDist() function, but it is very slow >>>>> >>>>> >>>> for large DNAStringSets. Is there a way to quickly calculate the number >>>> of differences between two DNAString instances? >>>> >>>> >>>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>> >>>>> >>>> would like to know if their is a function other than stringDist() that >>>> will tell me the distance between them is 1. >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> >>>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 14.8 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

erikwright@comcast.net ▴ 20

@erikwrightcomcastnet-3998

Last seen 10.4 years ago

Hi Patrick, Michael, HervÃ©, and Martin, Wow! Thanks very much everyone for putting so much effort into answering my question. While we wait for Patrick's update of stringDist to become available, I will describe my solution and a related problem I have run into. I am using this piece of code to do most of the work: numF <- length ( myDNAStringSet ) for ( i in 1 :( numF - 1 )) { # use IUPAC_CODE_MAP with fixed=FALSE distMatrix [ i ,( i + 1 ): numF ] <- neditStartingAt ( myDNAStringSet [[ i ]], myDNAStringSet [( i + 1 ): numF ], fixed = FALSE ) distMatrix [( i + 1 ): numF , i ] <- distMatrix [ i ,( i + 1 ): numF ] } diag ( distMatrix ) <- 1 This will populate a matrix with the number of differences between strings. The code works reasonably fast - about 15 seconds for a DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with 8,000 sequences of length 8,000 is exponentially slower - about 45 minutes. Obviously, if Patrick's update of stringDist improves the speed it will greatly help. In playing around with this I have run into a related problem that you all might be able to help me with. As I stated in my original email, I am trying to calculate a Distance Matrix with a large aligned DNAStringSet. My DNAStringSet does not contain ideal data, meaning there are many gap characters ("-"). If I find the edit difference between two strings that have gaps in the same places then the gaps are not counted towards the edit distance. This makes complete sense because "-" == "-". My problem is that I want to include any gap characters in my edit distance. For example: > x <- DNAString("--A") > y <- DNAString("--A") > neditStartingAt(x,y) [1] 0 The distance between these two strings for my distance measurement should be 2. That is because the gap character ("-") gives me no data, so I cannot say that the distance between the two strings is 0. Can anyone think of a good method of counting common gap characters toward the edit distance? Unfortunately I cannot simply count up the number of gap characters in my pattern and add it to my edit distance. For example: > x <- DNAString("AC--") > y <- DNAString("ACC-") > neditStartingAt(x,y) [1] 1 The distance between these two strings for my distance measurement should be 2. If I were to add the number of gap characters in the pattern to the edit distance I would get 2 + 1 = 3. So basically I want to count any gap character in the pattern or subject toward the edit distance because it gives me no information. Does this make sense, and does anyone have an idea for how to do this? Thanks again, Erik ----- Original Message ----- From: "Patrick Aboyoun" <paboyoun@fhcrc.org> To: "Michael Lawrence" <lawrence.michael@gene.com> Cc: "BioC list" <bioconductor@stat.math.ethz.ch> Sent: Saturday, March 27, 2010 7:39:16 PM GMT -06:00 US/Canada Central Subject: Re: [BioC] Count differences between sequences I just added a Hamming distance metric to the stringDist function in BioC 2.6, which should be available on bioconductor.org in 36 hours. This uses the code underlying the neditStartingAt functions and loops over the lower triangle of the distance matrix in C so it is faster than the sapply method Herve provided. To calculate this newly added distance measure, specify stringDist(x, method = "hamming") where x is an XStringSet object containing equal length strings. library(Biostrings) library(drosophila2probe) ch0 <- drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029 5.541 sessionInfo() R version 2.11.0 Under development (unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Hervï¿½ Pagï¿½s > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging from your data, I would gather that you are not interested in >>> indels. Is that correct? You should look at the neditStartingAt function. >>> Something like the following may meet your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of the fact that neditStartingAt() is >> vectorized with respect to its second argument: >> >> N<- length(myStrings) >> myDists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That will make things hundred times faster with a big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time( >>> >> myDists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >> > Wow, nice. It sounds to me like this would be a common use case; how hard > would it be to vectorize both arguments? > > Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM, erikwright@comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all of length 8000. I need the entire N x N >>>> distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Patrick Aboyoun" >>>> To: erikwright@comcast.net >>>> Cc: bioconductor@stat.math.ethz.ch >>>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re: [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could you provide more details on your data? How long are each of the >>>> strings and how many strings do you have? Also, do you need the entire N x N >>>> distance matrix for downstream analysis or are you just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29 PM, erikwright@comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> >>>>> I have a large DNAStringSet and I am trying to calculate its >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the stringDist() function, but it is very slow >>>>> >>>>> >>>> for large DNAStringSets. Is there a way to quickly calculate the number >>>> of differences between two DNAString instances? >>>> >>>> >>>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>> >>>>> >>>> would like to know if their is a function other than stringDist() that >>>> will tell me the distance between them is 1. >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> >>>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> -- >> Hervï¿½ Pagï¿½s >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 14.8 years ago erikwright@comcast.net ▴ 20

0

Entering edit mode

Hi Erik et al., Not offering anything practical, but... On 03/28/2010 11:39 AM, erikwright at comcast.net wrote: > Hi Patrick, Michael, Herv??, and Martin, > > > Wow! Thanks very much everyone for putting so much effort into > answering my question. While we wait for Patrick's update of > stringDist to become available, I will describe my solution and a > related problem I have run into. > > > I am using this piece of code to do most of the work: > > > > > > > numF <- length ( myDNAStringSet ) > > for ( i in 1 :( numF - 1 )) { > > # use IUPAC_CODE_MAP with fixed=FALSE > > distMatrix [ i ,( i + 1 ): numF ] <- neditStartingAt ( > > myDNAStringSet [[ i ]], > > myDNAStringSet [( i + 1 ): numF ], > > fixed = FALSE ) > > distMatrix [( i + 1 ): numF , i ] <- distMatrix [ i ,( i + 1 ): numF > ] > > > } > > diag ( distMatrix ) <- 1 > > > This will populate a matrix with the number of differences between > strings. The code works reasonably fast - about 15 seconds for a > DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with > 8,000 sequences of length 8,000 is exponentially slower - about 45 > minutes. Obviously, if Patrick's update of stringDist improves the > speed it will greatly help. Not exponential, the algorithm is polynomial in the number of comparisons N (N - 1) / 2. I think Patrick's update will help you through this part of your problem though -- it's still polynomial, but fast enough that you hit memory limits before unbearable time limits. But... Here's a little R code that, ignoring the commented out bit for a moment, groups a single 'column' of nucleotides in linear time -- some small multiple of N f0 <- function(dna) { l <- unique(width(dna)) w <- length(dna) dna0 <- factor(unlist(strsplit(as.character(dna), ""))) idx <- seq(1, l * w, l) # 'column' offsets i <- seq_len(w) d <- matrix(l, w, w) for (j in seq_len(l)-1L) { grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides ## for (k in grp) ## d[k, k] = d[k, k] - 1L } d } Internally, 'split' does its work in 2N comparisons. This important part is that comparisons are accomplished in linear rather than polynomial time. Each element of 'grp' is an integer vector indexing individuals with the same nucleotide. The code above assumes one is interested in Hamming distance. It initializes the distance matrix 'd' so that everyone is maximally distant from one another. and then in the commented out code updates the distance by subtracting one from all individuals with the same nucleotide (i.e., in k). This is unfortunately the slow part of the revised code, and it is polynomial time -- in the best case, for 4 equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates, and in the worst case (all nucleotides identical) we do N^2 updates. We pay a heavy price for doing this at the R level, and basically get swamped by the update rather than sort cost. But we shouldn't lose track of the fact that we can do the sorting much more efficiently. I've also looked a little at sort.list(, method="radix"), which orders a factor very quickly. I wonder what the down-stream fate of this distance matrix is, because perhaps it doesn't need to be create it at all? > > > In playing around with this I have run into a related problem that > you all might be able to help me with. As I stated in my original > email, I am trying to calculate a Distance Matrix with a large > aligned DNAStringSet. My DNAStringSet does not contain ideal data, > meaning there are many gap characters ("-"). If I find the edit > difference between two strings that have gaps in the same places then > the gaps are not counted towards the edit distance. This makes > complete sense because "-" == "-". My problem is that I want to > include any gap characters in my edit distance. For example: > > > >> x <- DNAString("--A") y <- DNAString("--A") neditStartingAt(x,y) > [1] 0 > > > The distance between these two strings for my distance measurement > should be 2. That is because the gap character ("-") gives me no > data, so I cannot say that the distance between the two strings is 0. > Can anyone think of a good method of counting common gap characters > toward the edit distance? Unfortunately I cannot simply count up the > number of gap characters in my pattern and add it to my edit > distance. For example: > > > > >> x <- DNAString("AC--") y <- DNAString("ACC-") neditStartingAt(x,y) >> > [1] 1 > > > The distance between these two strings for my distance measurement > should be 2. If I were to add the number of gap characters in the > pattern to the edit distance I would get 2 + 1 = 3. So basically I > want to count any gap character in the pattern or subject toward the > edit distance because it gives me no information. Does this make > sense, and does anyone have an idea for how to do this? Again not being helpful in a practical sense here, but the idea is that you have a distance matrix between letters in the Biostrings::IUPAC_CODE_MAP. The implementation Patrick provides is I think a matrix of 0's and 1's, whereas you'd like support for a generalized distance matrix. If that matrix is called 'dist', then I think the following (untested; it needs to not update if length(grp[[l]]) or length(grp[[m]]) is zero) does the trick... f0d <- function(dna, dist) { l <- unique(width(dna)) w <- length(dna) dna0 <- factor(unlist(strsplit(as.character(dna), ""))) idx <- seq(1, l * w, l) i <- seq_len(w) d <- matrix(0, w, w) for (j in seq_len(l)-1L) { grp <- split(i, dna0[j+idx]) for (l in seq_along(grp)) for (m in seq_along(grp)) d[grp[[l]], grp[[m]]] <- d[grp[[l]], grp[[m]]] + dist[l, m] } d } Martin > > > Thanks again, Erik > > > > ----- Original Message ----- From: "Patrick Aboyoun" > <paboyoun at="" fhcrc.org=""> To: "Michael Lawrence" > <lawrence.michael at="" gene.com=""> Cc: "BioC list" > <bioconductor at="" stat.math.ethz.ch=""> Sent: Saturday, March 27, 2010 > 7:39:16 PM GMT -06:00 US/Canada Central Subject: Re: [BioC] Count > differences between sequences > > I just added a Hamming distance metric to the stringDist function in > BioC 2.6, which should be available on bioconductor.org in 36 hours. > This uses the code underlying the neditStartingAt functions and loops > over the lower triangle of the distance matrix in C so it is faster > than the sapply method Herve provided. To calculate this newly added > distance measure, specify stringDist(x, method = "hamming") where x > is an XStringSet object containing equal length strings. > library(Biostrings) library(drosophila2probe) ch0 <- > drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- > unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- > DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- > stringDist(dna, method = "hamming")) # user system elapsed # 5.459 > 0.029 5.541 sessionInfo() R version 2.11.0 Under development > (unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] > en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base > packages: [1] stats graphics grDevices utils datasets methods base > other attached packages: [1] drosophila2probe_2.5.0 > AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26 > IRanges_1.5.72 loaded via a namespace (and not attached): [1] > DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM, > Michael Lawrence wrote: > 2010/3/26 Herv??? Pag???s > > >> Hi Erik, > Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging > from your data, I would gather that you are not interested in >>> > indels. Is that correct? You should look at the neditStartingAt > function. >>> Something like the following may meet your needs: >>> > >>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N) > >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<- > myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]]) > >>> >>> >>> >> Note that you can take advantage of the fact that > neditStartingAt() is >> vectorized with respect to its second > argument: >> >> N<- length(myStrings) >> myDists<- sapply(1:N, >> > function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That > will make things hundred times faster with a big "DNA rectangle" >> > like yours (500x8000): >> >> >>> system.time( >>> >> myDists<- > sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], > myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>> > myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] > [,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >> > [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849 > 0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111 > 6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906 > 6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >> > [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956 > 6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019 > 5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947 > 6183 5984 0 >> >> >> > Wow, nice. It sounds to me like this would be > a common use case; how hard > would it be to vectorize both > arguments? > > Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick > >>> >>> >>> On 3/25/10 2:57 PM, erikwright at comcast.net wrote: >>> >>> > >>>> I have 500 DNAStrings, all of length 8000. I need the entire N x > N >>>> distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> > >>>> ----- Original Message ----- >>>> From: "Patrick Aboyoun" >>>> > To: erikwright at comcast.net >>>> Cc: bioconductor at stat.math.ethz.ch > >>>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada > Central >>>> Subject: Re: [BioC] Count differences between sequences > >>>> >>>> Erik, >>>> Could you provide more details on your data? How > long are each of the >>>> strings and how many strings do you have? > Also, do you need the entire N x N >>>> distance matrix for > downstream analysis or are you just looking for closest >>>> > relatives? >>>> >>>> >>>> Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29 > PM, erikwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> > >>>>> >>>>> I have a large DNAStringSet and I am trying to calculate > its >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width > and they are already >>>> aligned. >>>> >>>> >>>>> I have tried using > the stringDist() function, but it is very slow >>>>> >>>>> >>>> for > large DNAStringSets. Is there a way to quickly calculate the number > >>>> of differences between two DNAString instances? >>>> >>>> >>>>> > For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I > >>>>> >>>>> >>>> would like to know if their is a function other than > stringDist() that >>>> will tell me the distance between them is 1. > >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - > Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> > _______________________________________________ >>>>> Bioconductor > mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the > archives: >>>>> >>>>> >>>> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> > _______________________________________________ >>> Bioconductor > mailing list >>> Bioconductor at stat.math.ethz.ch >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the > archives: >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> >> -- >> Herv??? Pag???s >> >> Program in Computational Biology > >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer > Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> > Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) > 667-5791 >> Fax: (206) 667-1319 >> >> >> > _______________________________________________ >> Bioconductor > mailing list >> Bioconductor at stat.math.ethz.ch >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the > archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > Bioconductor > mailing list > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] > _______________________________________________ Bioconductor mailing > list Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] > > > > > _______________________________________________ Bioconductor mailing > list Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 14.8 years ago Martin Morgan 25k

0

Entering edit mode

On 03/28/2010 01:17 PM, Martin Morgan wrote: > Hi Erik et al., > > Not offering anything practical, but... > > On 03/28/2010 11:39 AM, erikwright at comcast.net wrote: >> Hi Patrick, Michael, Herv??, and Martin, > f0 <- function(dna) > { > l <- unique(width(dna)) > w <- length(dna) > > dna0 <- factor(unlist(strsplit(as.character(dna), ""))) > idx <- seq(1, l * w, l) # 'column' offsets > i <- seq_len(w) > > d <- matrix(l, w, w) > for (j in seq_len(l)-1L) { > grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides > ## for (k in grp) > ## d[k, k] = d[k, k] - 1L > } > d > } > > Internally, 'split' does its work in 2N comparisons. This important part > is that comparisons are accomplished in linear rather than polynomial time. > same nucleotide (i.e., in k). This is unfortunately the slow part of the > revised code, and it is polynomial time -- in the best case, for 4 > equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates, > and in the worst case (all nucleotides identical) we do N^2 updates. We > pay a heavy price for doing this at the R level, and basically get > swamped by the update rather than sort cost. > But we shouldn't lose track of the fact that we can do the sorting much > more efficiently. I've also looked a little at sort.list(, > method="radix"), which orders a factor very quickly. dampening my enthusiasm a bit here, as even implemented in C the update operation still dominates, and the algorithm remains polynomial. Martin

ADD REPLY • link 14.8 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, Thanks for sharing your method of calculating a distance matrix. I would never have thought of implementing the distance matrix in such a way. Correct me if I am wrong, but any algorithm for computing a distance matrix will take at least polynomial time because it inherently requires comparing N to (N-1) strings. My problem has to do with the handling of gap characters in Biostrings methods. I would like to treat the gap character as a wildcard. So, in another sense, what I am looking for is a way to extend the IUPAC_CODE_MAP to include the gap character: IUPAC_CODE_MAP <- c(IUPAC_CODE_MAP, "-"="ACGT") That way neditAt("A","-") would be 0 and not 1, and neditAt("-","-") would remain zero. Perhaps the easiest way to do this would be to somehow replace all my gaps with the "N" character. Thanks!, Erik On Mar 28, 2010, at 4:23 PM, Martin Morgan wrote: > On 03/28/2010 01:17 PM, Martin Morgan wrote: >> Hi Erik et al., >> >> Not offering anything practical, but... >> >> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote: >>> Hi Patrick, Michael, Herv??, and Martin, > >> f0 <- function(dna) >> { >> l <- unique(width(dna)) >> w <- length(dna) >> >> dna0 <- factor(unlist(strsplit(as.character(dna), ""))) >> idx <- seq(1, l * w, l) # 'column' offsets >> i <- seq_len(w) >> >> d <- matrix(l, w, w) >> for (j in seq_len(l)-1L) { >> grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides >> ## for (k in grp) >> ## d[k, k] = d[k, k] - 1L >> } >> d >> } >> >> Internally, 'split' does its work in 2N comparisons. This important part >> is that comparisons are accomplished in linear rather than polynomial time. > >> same nucleotide (i.e., in k). This is unfortunately the slow part of the >> revised code, and it is polynomial time -- in the best case, for 4 >> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates, >> and in the worst case (all nucleotides identical) we do N^2 updates. We >> pay a heavy price for doing this at the R level, and basically get >> swamped by the update rather than sort cost. > >> But we shouldn't lose track of the fact that we can do the sorting much >> more efficiently. I've also looked a little at sort.list(, >> method="radix"), which orders a factor very quickly. > > dampening my enthusiasm a bit here, as even implemented in C the update > operation still dominates, and the algorithm remains polynomial. > > Martin > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.8 years ago Erik Wright ▴ 210

0

Entering edit mode

On 03/29/2010 05:21 AM, Erik Wright wrote: > Hi Martin, > > Thanks for sharing your method of calculating a distance matrix. I > would never have thought of implementing the distance matrix in such > a way. Correct me if I am wrong, but any algorithm for computing a > distance matrix will take at least polynomial time because it > inherently requires comparing N to (N-1) strings. Say you had 10 nucleotides with ids 0:9 0 1 2 3 4 5 6 7 8 9 A C A T G C A T T A If you grouped them as A: 0, 2, 6, 0 C: 1, 5 G: 4 T: 3, 7, 8 Then you'd know about distances (e.g., Hamming) in a useful way, e.g., you could draw a dendrogram with nodes labeled with nucleotide and corresponding ids, or ask which groups of ids have distance 0 from each other. A naive approach for grouping nucleotides would be to look up each in a 'dictionary' of the four possible nucleotides. The dictionary might be "A", "C", "G", "T", and the look-up would ask, is id 0, with identifier A, equal to the first entry in the dictionary, "A"? If yes, then classify id 0 as an "A", if not then look at the next entry in the dictionary. The worst case scenario would be if the nucleotides were all "T", then you'd have to make 4 lookups for each identifier before classifying it. But that's only 10 x 4 comparisons, and not 10 x (10 - 1). Although the method I mentioned below is still slow compared with a C implementation, it is about twice as fast as an N x (N - 1) comparison. > My problem has to do with the handling of gap characters in > Biostrings methods. I would like to treat the gap character as a > wildcard. So, in another sense, what I am looking for is a way to > extend the IUPAC_CODE_MAP to include the gap character: > IUPAC_CODE_MAP <- c(IUPAC_CODE_MAP, "-"="ACGT") > > That way neditAt("A","-") would be 0 and not 1, and neditAt("-","-") > would remain zero. Perhaps the easiest way to do this would be to > somehow replace all my gaps with the "N" character. A C level solution will have to wait for Patrick or someone else to implement it; an R solution might construct an appropriate matrix dist that includes "-", and use that in f0d (checking that f0d actually works!). Martin > Thanks!, > Erik > > > On Mar 28, 2010, at 4:23 PM, Martin Morgan wrote: > >> On 03/28/2010 01:17 PM, Martin Morgan wrote: >>> Hi Erik et al., >>> >>> Not offering anything practical, but... >>> >>> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote: >>>> Hi Patrick, Michael, Herv??, and Martin, >> >>> f0 <- function(dna) >>> { >>> l <- unique(width(dna)) >>> w <- length(dna) >>> >>> dna0 <- factor(unlist(strsplit(as.character(dna), ""))) >>> idx <- seq(1, l * w, l) # 'column' offsets >>> i <- seq_len(w) >>> >>> d <- matrix(l, w, w) >>> for (j in seq_len(l)-1L) { >>> grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides >>> ## for (k in grp) >>> ## d[k, k] = d[k, k] - 1L >>> } >>> d >>> } >>> >>> Internally, 'split' does its work in 2N comparisons. This important part >>> is that comparisons are accomplished in linear rather than polynomial time. >> >>> same nucleotide (i.e., in k). This is unfortunately the slow part of the >>> revised code, and it is polynomial time -- in the best case, for 4 >>> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates, >>> and in the worst case (all nucleotides identical) we do N^2 updates. We >>> pay a heavy price for doing this at the R level, and basically get >>> swamped by the update rather than sort cost. >> >>> But we shouldn't lose track of the fact that we can do the sorting much >>> more efficiently. I've also looked a little at sort.list(, >>> method="radix"), which orders a factor very quickly. >> >> dampening my enthusiasm a bit here, as even implemented in C the update >> operation still dominates, and the algorithm remains polynomial. >> >> Martin >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 14.8 years ago Martin Morgan 25k

0

Entering edit mode

Erik Wright wrote: > Hi Martin, > > Thanks for sharing your method of calculating a distance matrix. I would never have thought of implementing the distance matrix in such a way. Correct me if I am wrong, but any algorithm for computing a distance matrix will take at least polynomial time because it inherently requires comparing N to (N-1) strings. > > My problem has to do with the handling of gap characters in Biostrings methods. I would like to treat the gap character as a wildcard. So, in another sense, what I am looking for is a way to extend the IUPAC_CODE_MAP to include the gap character: > IUPAC_CODE_MAP <- c(IUPAC_CODE_MAP, "-"="ACGT") > > That way neditAt("A","-") would be 0 and not 1, and neditAt("-","-") would remain zero. Perhaps the easiest way to do this would be to somehow replace all my gaps with the "N" character. That's indeed the easy and natural way to do it. But I'm confused now because in a previous email you said you wanted the gap letter to *not* match any other letter, not even itself... H. > > Thanks!, > Erik > > > On Mar 28, 2010, at 4:23 PM, Martin Morgan wrote: > >> On 03/28/2010 01:17 PM, Martin Morgan wrote: >>> Hi Erik et al., >>> >>> Not offering anything practical, but... >>> >>> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote: >>>> Hi Patrick, Michael, Herv??, and Martin, >>> f0 <- function(dna) >>> { >>> l <- unique(width(dna)) >>> w <- length(dna) >>> >>> dna0 <- factor(unlist(strsplit(as.character(dna), ""))) >>> idx <- seq(1, l * w, l) # 'column' offsets >>> i <- seq_len(w) >>> >>> d <- matrix(l, w, w) >>> for (j in seq_len(l)-1L) { >>> grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides >>> ## for (k in grp) >>> ## d[k, k] = d[k, k] - 1L >>> } >>> d >>> } >>> >>> Internally, 'split' does its work in 2N comparisons. This important part >>> is that comparisons are accomplished in linear rather than polynomial time. >>> same nucleotide (i.e., in k). This is unfortunately the slow part of the >>> revised code, and it is polynomial time -- in the best case, for 4 >>> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates, >>> and in the worst case (all nucleotides identical) we do N^2 updates. We >>> pay a heavy price for doing this at the R level, and basically get >>> swamped by the update rather than sort cost. >>> But we shouldn't lose track of the fact that we can do the sorting much >>> more efficiently. I've also looked a little at sort.list(, >>> method="radix"), which orders a factor very quickly. >> dampening my enthusiasm a bit here, as even implemented in C the update >> operation still dominates, and the algorithm remains polynomial. >> >> Martin >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.8 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Erik, To address your "gap problem", one solution is to replace the "-" in one of the 2 strings to compare with a letter that is guaranteed not to be in the other string, e.g. the "+" letter (yes, this is part of the DNA alphabet, and if it was not, you could always work with BStringSet objects). This can be done with chartr(): numF <- length(myDNAStringSet) distMatrix <- matrix(integer(numF*numF), nrow=numF) for (i in 1:(numF-1)) { pattern <- chartr("-", "+", myDNAStringSet[[i]]) # use IUPAC_CODE_MAP with fixed=FALSE distMatrix[i, (i+1):numF] <- neditStartingAt( pattern, myDNAStringSet[(i+1):numF], fixed=FALSE) distMatrix[(i+1):numF, i] <- distMatrix[i, (i+1):numF] } diag(distMatrix) <- 0L # I think you want 0 here, not 1 Takes about 15 sec. on my machine for your 500x8000 DNA rectangle. Note that I'm using an integer matrix which is twice smaller than a numeric/double matrix. You could maybe get a small speed improvement by doing this replacement once for all at the DNAStringSet level before you start the loop with: myDNAStringSet2 <- chartr("-", "+", myDNAStringSet) and by adapting the code in the loop but I don't expect this to make a big difference. Cheers, H. erikwright at comcast.net wrote: > Hi Patrick, Michael, Herv??, and Martin, > > > Wow! Thanks very much everyone for putting so much effort into answering my question. While we wait for Patrick's update of stringDist to become available, I will describe my solution and a related problem I have run into. > > > I am using this piece of code to do most of the work: > > > > > > > numF <- length ( myDNAStringSet ) > > for ( i in 1 :( numF - 1 )) { > > # use IUPAC_CODE_MAP with fixed=FALSE > > distMatrix [ i ,( i + 1 ): numF ] <- neditStartingAt ( > > myDNAStringSet [[ i ]], > > myDNAStringSet [( i + 1 ): numF ], > > fixed = FALSE ) > > distMatrix [( i + 1 ): numF , i ] <- distMatrix [ i ,( i + 1 ): numF ] > > > } > > diag ( distMatrix ) <- 1 > > > This will populate a matrix with the number of differences between strings. The code works reasonably fast - about 15 seconds for a DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with 8,000 sequences of length 8,000 is exponentially slower - about 45 minutes. Obviously, if Patrick's update of stringDist improves the speed it will greatly help. > > > In playing around with this I have run into a related problem that you all might be able to help me with. As I stated in my original email, I am trying to calculate a Distance Matrix with a large aligned DNAStringSet. My DNAStringSet does not contain ideal data, meaning there are many gap characters ("-"). If I find the edit difference between two strings that have gaps in the same places then the gaps are not counted towards the edit distance. This makes complete sense because "-" == "-". My problem is that I want to include any gap characters in my edit distance. For example: > > > >> x <- DNAString("--A") >> y <- DNAString("--A") >> neditStartingAt(x,y) > [1] 0 > > > The distance between these two strings for my distance measurement should be 2. That is because the gap character ("-") gives me no data, so I cannot say that the distance between the two strings is 0. Can anyone think of a good method of counting common gap characters toward the edit distance? Unfortunately I cannot simply count up the number of gap characters in my pattern and add it to my edit distance. For example: > > > > >> x <- DNAString("AC--") >> y <- DNAString("ACC-") >> neditStartingAt(x,y) > [1] 1 > > > The distance between these two strings for my distance measurement should be 2. If I were to add the number of gap characters in the pattern to the edit distance I would get 2 + 1 = 3. So basically I want to count any gap character in the pattern or subject toward the edit distance because it gives me no information. Does this make sense, and does anyone have an idea for how to do this? > > > Thanks again, > Erik > > > > ----- Original Message ----- > From: "Patrick Aboyoun" <paboyoun at="" fhcrc.org=""> > To: "Michael Lawrence" <lawrence.michael at="" gene.com=""> > Cc: "BioC list" <bioconductor at="" stat.math.ethz.ch=""> > Sent: Saturday, March 27, 2010 7:39:16 PM GMT -06:00 US/Canada Central > Subject: Re: [BioC] Count differences between sequences > > I just added a Hamming distance metric to the stringDist function in BioC 2.6, which should be available on bioconductor.org in 36 hours. This uses the code underlying the neditStartingAt functions and loops over the lower triangle of the distance matrix in C so it is faster than the sapply method Herve provided. To calculate this newly added distance measure, specify stringDist(x, method = "hamming") where x is an XStringSet object containing equal length strings. library(Biostrings) library(drosophila2probe) ch0 <- drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029 5.541 sessionInfo() R version 2.11.0 Under development (unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats gra phics grDevices utils datasets methods base other attached packages: [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Herv??? Pag???s > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging from your data, I would gather that you are not interested in >>> indels. Is that correct? You should look at the neditStartingAt function. >>> Something like the following may meet your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of the fact that neditStartingAt() is >> vectorized with respect to its second argument: >> >> N<- length(myStrings) >> myD ists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That will make things hundred times faster with a big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time( >>> >> myDists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >> > W ow, nice. It sounds to me like this would be a common use case; how hard > would it be to vectorize both arguments? > > Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM, erikwright at comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all of length 8000. I need the entire N x N >>>> distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Patrick Aboyoun" >>>> To: erikwright at comcast.net >>>> Cc: bioconductor at stat.math.ethz.ch >>>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re: [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could you provide more details on your data? How long are each of the >>>> strings and how many strings do you have? Also, do you need the entire N x N >>>> distance matrix for downstream analysis or are you just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29 PM, eri kwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> >>>>> I have a large DNAStringSet and I am trying to calculate its >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the stringDist() function, but it is very slow >>>>> >>>>> >>>> for large DNAStringSets. Is there a way to quickly calculate the number >>>> of differences between two DNAString instances? >>>> >>>> >>>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>> >>>>> >>>> would like to know if their is a function other than stringDist() that >>>> will tell me the distance between them is 1. >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Searc h the archives: >>>>> >>>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> -- >> Herv??? Pag???s >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informati cs.conductor >> >> > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] > _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] > > > > -------------------------------------------------------------------- ---- > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.8 years ago Hervé Pagès 16k

0

Entering edit mode

Hi everyone, Since I don't want to include terminal gaps into my distance calculation, is there a function that will quickly return the number of leading gaps and/or trailing gaps? For example, seq1 <- DNAString("--AT-CG-") number of leading terminal gaps = 2 number of trailing terminal gaps = 1 Thanks again!, Erik On Mar 29, 2010, at 5:39 PM, Hervé Pagès wrote: > Hi Erik, > > To address your "gap problem", one solution is to replace the "-" in > one of the 2 strings to compare with a letter that is guaranteed not > to be in the other string, e.g. the "+" letter (yes, this is part of > the DNA alphabet, and if it was not, you could always work with > BStringSet objects). This can be done with chartr(): > > numF <- length(myDNAStringSet) > > distMatrix <- matrix(integer(numF*numF), nrow=numF) > > for (i in 1:(numF-1)) { > pattern <- chartr("-", "+", myDNAStringSet[[i]]) > # use IUPAC_CODE_MAP with fixed=FALSE > distMatrix[i, (i+1):numF] <- neditStartingAt( > pattern, > myDNAStringSet[(i+1):numF], > fixed=FALSE) > distMatrix[(i+1):numF, i] <- distMatrix[i, (i+1):numF] > } > > diag(distMatrix) <- 0L # I think you want 0 here, not 1 > > Takes about 15 sec. on my machine for your 500x8000 DNA rectangle. > > Note that I'm using an integer matrix which is twice smaller than a > numeric/double matrix. > > You could maybe get a small speed improvement by doing this replacement > once for all at the DNAStringSet level before you start the loop with: > > myDNAStringSet2 <- chartr("-", "+", myDNAStringSet) > > and by adapting the code in the loop but I don't expect this to make a > big difference. > > Cheers, > H. > > > erikwright at comcast.net wrote: >> Hi Patrick, Michael, Herv??, and Martin, Wow! Thanks very much everyone for putting so much effort into answering my question. While we wait for Patrick's update of stringDist to become available, I will describe my solution and a related problem I have run into. I am using this piece of code to do most of the work: numF <- length ( myDNAStringSet ) for ( i in 1 :( numF - 1 )) { # use IUPAC_CODE_MAP with fixed=FALSE distMatrix [ i ,( i + 1 ): numF ] <- neditStartingAt ( myDNAStringSet [[ i ]], myDNAStringSet [( i + 1 ): numF ], fixed = FALSE ) distMatrix [( i + 1 ): numF , i ] <- distMatrix [ i ,( i + 1 ): numF ] } diag ( distMatrix ) <- 1 This will populate a matrix with the number of differences between strings. The code works reasonably fast - about 15 seconds for a DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with 8,000 sequences of length 8,000 is exponentially slower - about 45 minutes. Obviously, if Patrick's update of stringDist improves the speed it will greatly help. In playing around with this I have run into a related problem that you all might be able to help me with. As I stated in my original email, I am trying to calculate a Distance Matrix with a large aligned DNAStringSet. My DNAStringSet does not contain ideal data, meaning there are many gap characters ("-"). If I find the edit difference between two strings that have gaps in the same places then the gaps are not counted towards the edit distance. This makes complete sense because "-" == "-". My problem is that I want to include any gap characters in my edit distance. For example: >>> x <- DNAString("--A") y <- DNAString("--A") neditStartingAt(x,y) >> [1] 0 The distance between these two strings for my distance measurement should be 2. That is because the gap character ("-") gives me no data, so I cannot say that the distance between the two strings is 0. Can anyone think of a good method of counting common gap characters toward the edit distance? Unfortunately I cannot simply count up the number of gap characters in my pattern and add it to my edit distance. For example: >>> x <- DNAString("AC--") y <- DNAString("ACC-") neditStartingAt(x,y) >> [1] 1 The distance between these two strings for my distance measurement should be 2. If I were to add the number of gap characters in the pattern to the edit distance I would get 2 + 1 = 3. So basically I want to count any gap character in the pattern or subject toward the edit distance because it gives me no information. Does this make sense, and does anyone have an idea for how to do this? Thanks again, Erik ----- Original Message ----- From: "Patrick Aboyoun" <paboyoun at="" fhcrc.org=""> To: "Michael Lawrence" <lawrence.michael at="" gene.com=""> Cc: "BioC list" <bioconductor at="" stat.math.ethz.ch=""> Sent: Saturday, March 27, 2010 7:39:16 PM GMT -06:00 US/Canada Central Subject: Re: [BioC] Count differences between sequences I just added a Hamming distance metric to the stringDist function in BioC 2.6, which should be available on bioconductor.org in 36 hours. This uses the code underlying the neditStartingAt functions and loops over the lower triangle of the distance matrix in C so it is faster than the sapply method Herve provided. To calculate this newly added distance measure, specify stringDist(x, method = "hamming") where x is an XStringSet object containing equal length strings. library(Biostrings) library(drosophila2probe) ch0 <- drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029 5.541 sessionInfo() R version 2.11.0 Under development (unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats gra > phics grDevices utils datasets methods base other attached packages: [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Herv??? Pag???s > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging from your data, I would gather that you are not interested in >>> indels. Is that correct? You should look at the neditStartingAt function. >>> Something like the following may meet your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of the fact that neditStartingAt() is >> vectorized with respect to its second argument: >> >> N<- length(myStrings) >> myD > ists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That will make things hundred times faster with a big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time( >>> >> myDists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >> > W > ow, nice. It sounds to me like this would be a common use case; how hard > would it be to vectorize both arguments? > > Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM, erikwright at comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all of length 8000. I need the entire N x N >>>> distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Patrick Aboyoun" >>>> To: erikwright at comcast.net >>>> Cc: bioconductor at stat.math.ethz.ch >>>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re: [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could you provide more details on your data? How long are each of the >>>> strings and how many strings do you have? Also, do you need the entire N x N >>>> distance matrix for downstream analysis or are you just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29 PM, eri > kwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> >>>>> I have a large DNAStringSet and I am trying to calculate its >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the stringDist() function, but it is very slow >>>>> >>>>> >>>> for large DNAStringSets. Is there a way to quickly calculate the number >>>> of differences between two DNAString instances? >>>> >>>> >>>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>> >>>>> >>>> would like to know if their is a function other than stringDist() that >>>> will tell me the distance between them is 1. >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Searc > h the archives: >>>>> >>>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> -- >> Herv??? Pag???s >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informati > cs.conductor >> >> > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] >> _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> [[alternative HTML version deleted]] >> ------------------------------------------------------------------- ----- >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.8 years ago Erik Wright ▴ 210

0

Entering edit mode

Erik Wright wrote: > Hi everyone, > > Since I don't want to include terminal gaps into my distance calculation, is there a function that will quickly return the number of leading gaps and/or trailing gaps? > For example, > seq1 <- DNAString("--AT-CG-") > number of leading terminal gaps = 2 > number of trailing terminal gaps = 1 Maybe not the most efficient way to do this but the trimLRPatterns() function can be used for this: countLeadingLetter <- function(x, letter="-") { Lpattern <- DNAString(pasterep.int("-", length(x)), collapse="")) rg <- trimLRPatterns(Lpattern=Lpattern , subject=x, ranges=TRUE) start(rg) - 1L } countTrailingLetter <- function(x, letter="-") { Rpattern <- DNAString(pasterep.int("-", length(x)), collapse="")) rg <- trimLRPatterns(Rpattern=Rpattern , subject=x, ranges=TRUE) length(x) - end(rg) } Note that when using this in the context of your distance calculation you need to be careful that adding countLeadingLetter() + countTrailingLetter() won't be correct when the string contains only gaps (very unlikely to happen but still). Also in this context, since the left and right fake patterns are always going to be the same, it would be better to generate them once for all before starting the big "for" loop. I didn't do any timing so please let us know if this is too slow. Cheers, H. > > Thanks again!, > Erik > > > On Mar 29, 2010, at 5:39 PM, Hervé Pagès wrote: > >> Hi Erik, >> >> To address your "gap problem", one solution is to replace the "-" in >> one of the 2 strings to compare with a letter that is guaranteed not >> to be in the other string, e.g. the "+" letter (yes, this is part of >> the DNA alphabet, and if it was not, you could always work with >> BStringSet objects). This can be done with chartr(): >> >> numF <- length(myDNAStringSet) >> >> distMatrix <- matrix(integer(numF*numF), nrow=numF) >> >> for (i in 1:(numF-1)) { >> pattern <- chartr("-", "+", myDNAStringSet[[i]]) >> # use IUPAC_CODE_MAP with fixed=FALSE >> distMatrix[i, (i+1):numF] <- neditStartingAt( >> pattern, >> myDNAStringSet[(i+1):numF], >> fixed=FALSE) >> distMatrix[(i+1):numF, i] <- distMatrix[i, (i+1):numF] >> } >> >> diag(distMatrix) <- 0L # I think you want 0 here, not 1 >> >> Takes about 15 sec. on my machine for your 500x8000 DNA rectangle. >> >> Note that I'm using an integer matrix which is twice smaller than a >> numeric/double matrix. >> >> You could maybe get a small speed improvement by doing this replacement >> once for all at the DNAStringSet level before you start the loop with: >> >> myDNAStringSet2 <- chartr("-", "+", myDNAStringSet) >> >> and by adapting the code in the loop but I don't expect this to make a >> big difference. >> >> Cheers, >> H. >> >> >> erikwright at comcast.net wrote: >>> Hi Patrick, Michael, Herv??, and Martin, Wow! Thanks very much everyone for putting so much effort into answering my question. While we wait for Patrick's update of stringDist to become available, I will describe my solution and a related problem I have run into. I am using this piece of code to do most of the work: numF <- length ( myDNAStringSet ) for ( i in 1 :( numF - 1 )) { # use IUPAC_CODE_MAP with fixed=FALSE distMatrix [ i ,( i + 1 ): numF ] <- neditStartingAt ( myDNAStringSet [[ i ]], myDNAStringSet [( i + 1 ): numF ], fixed = FALSE ) distMatrix [( i + 1 ): numF , i ] <- distMatrix [ i ,( i + 1 ): numF ] } diag ( distMatrix ) <- 1 This will populate a matrix with the number of differences between strings. The code works reasonably fast - about 15 seconds for a DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with 8,000 sequences of length 8,000 is exponentially slower - about 45 minutes. Obviously, if Patrick's update of stringDist improves the spe ed it will greatly help. In playing around with this I have run into a related problem that you all might be able to help me with. As I stated in my original email, I am trying to calculate a Distance Matrix with a large aligned DNAStringSet. My DNAStringSet does not contain ideal data, meaning there are many gap characters ("-"). If I find the edit difference between two strings that have gaps in the same places then the gaps are not counted towards the edit distance. This makes complete sense because "-" == "-". My problem is that I want to include any gap characters in my edit distance. For example: >>>> x <- DNAString("--A") y <- DNAString("--A") neditStartingAt(x,y) >>> [1] 0 The distance between these two strings for my distance measurement should be 2. That is because the gap character ("-") gives me no data, so I cannot say that the distance between the two strings is 0. Can anyone think of a good method of counting common gap characters toward the edit distance? Unfortunately I cannot simply count up the number of gap characters in my pattern and add it to my edit distance. For example: >>>> x <- DNAString("AC--") y <- DNAString("ACC-") neditStartingAt(x,y) >>> [1] 1 The distance between these two strings for my distance measurement should be 2. If I were to add the number of gap characters in the pattern to the edit distance I would get 2 + 1 = 3. So basically I want to count any gap character in the pattern or subject toward the edit distance because it gives me no information. Does this make sense, and does anyone have an idea for how to do this? Thanks again, Erik ----- Original Message ----- From: "Patrick Aboyoun" <paboyoun at="" fhcrc.org=""> To: "Michael Lawrence" <lawrence.michael at="" gene.com=""> Cc: "BioC list" <bioconductor at="" stat.math.ethz.ch=""> Sent: Saturday, March 27, 2010 7:39:16 PM GMT -06:00 US/Canada Central Subject: Re: [BioC] Count differences between sequences I just added a Hamming distance metric to the stringDist function in BioC 2.6, which should be available on bioconductor.org in 36 hours. This uses the code underlying the neditStartingAt functions and loops over the lower triangle of the distance matrix in C so it is f aster than the sapply method Herve provided. To calculate this newly added distance measure, specify stringDist(x, method = "hamming") where x is an XStringSet object containing equal length strings. library(Biostrings) library(drosophila2probe) ch0 <- drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029 5.541 sessionInfo() R version 2.11.0 Under development (unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats gra >> phics grDevices utils datasets methods base other attached packages: [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Herv??? Pag???s > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging from your data, I would gather that you are not interested in >>> indels. Is that correct? You should look at the neditStartingAt function. >>> Something like the following may meet your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of the fact that neditStartingAt() is >> vectorized with respect to its second argument: >> >> N<- length(myStrings) >> myD >> ists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That will make things hundred times faster with a big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time( >>> >> myDists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >> > W >> ow, nice. It sounds to me like this would be a common use case; how hard > would it be to vectorize both arguments? > > Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM, erikwright at comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all of length 8000. I need the entire N x N >>>> distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Patrick Aboyoun" >>>> To: erikwright at comcast.net >>>> Cc: bioconductor at stat.math.ethz.ch >>>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re: [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could you provide more details on your data? How long are each of the >>>> strings and how many strings do you have? Also, do you need the entire N x N >>>> distance matrix for downstream analysis or are you just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29 PM, eri >> kwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> >>>>> I have a large DNAStringSet and I am trying to calculate its >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the stringDist() function, but it is very slow >>>>> >>>>> >>>> for large DNAStringSets. Is there a way to quickly calculate the number >>>> of differences between two DNAString instances? >>>> >>>> >>>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>> >>>>> >>>> would like to know if their is a function other than stringDist() that >>>> will tell me the distance between them is 1. >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Se arc >> h the archives: >>>>> >>>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> -- >> Herv??? Pag???s >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.inform ati >> cs.conductor >> >> > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] >>> _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> [[alternative HTML version deleted]] >>> ------------------------------------------------------------------ ------ >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.8 years ago Hervé Pagès 16k

0

Entering edit mode

Erik, This is much faster: countLeadingLetter2 <- function(x, letter="-") { ans <- which(!isMatchingAt(letter, x, seq_len(length(x))))[1L] if is.na(ans)) return(length(x)) ans - 1L } countTrailingLetter2 <- function(x, letter="-") countLeadingLetter2(reverse(x), letter=letter) Cheers, H. Hervé Pagès wrote: > Erik Wright wrote: >> Hi everyone, >> >> Since I don't want to include terminal gaps into my distance >> calculation, is there a function that will quickly return the number >> of leading gaps and/or trailing gaps? >> For example, >> seq1 <- DNAString("--AT-CG-") >> number of leading terminal gaps = 2 >> number of trailing terminal gaps = 1 > > Maybe not the most efficient way to do this but the trimLRPatterns() > function can be used for this: > > countLeadingLetter <- function(x, letter="-") > { > Lpattern <- DNAString(pasterep.int("-", length(x)), collapse="")) > rg <- trimLRPatterns(Lpattern=Lpattern , subject=x, ranges=TRUE) > start(rg) - 1L > } > > countTrailingLetter <- function(x, letter="-") > { > Rpattern <- DNAString(pasterep.int("-", length(x)), collapse="")) > rg <- trimLRPatterns(Rpattern=Rpattern , subject=x, ranges=TRUE) > length(x) - end(rg) > } > > Note that when using this in the context of your distance > calculation you need to be careful that adding > countLeadingLetter() + countTrailingLetter() > won't be correct when the string contains only gaps (very > unlikely to happen but still). > > Also in this context, since the left and right fake patterns > are always going to be the same, it would be better to generate > them once for all before starting the big "for" loop. > > I didn't do any timing so please let us know if this is too slow. > > Cheers, > H. > >> >> Thanks again!, >> Erik >> >> >> On Mar 29, 2010, at 5:39 PM, Hervé Pagès wrote: >> >>> Hi Erik, >>> >>> To address your "gap problem", one solution is to replace the "-" in >>> one of the 2 strings to compare with a letter that is guaranteed not >>> to be in the other string, e.g. the "+" letter (yes, this is part of >>> the DNA alphabet, and if it was not, you could always work with >>> BStringSet objects). This can be done with chartr(): >>> >>> numF <- length(myDNAStringSet) >>> >>> distMatrix <- matrix(integer(numF*numF), nrow=numF) >>> >>> for (i in 1:(numF-1)) { >>> pattern <- chartr("-", "+", myDNAStringSet[[i]]) >>> # use IUPAC_CODE_MAP with fixed=FALSE >>> distMatrix[i, (i+1):numF] <- neditStartingAt( >>> pattern, >>> myDNAStringSet[(i+1):numF], >>> fixed=FALSE) >>> distMatrix[(i+1):numF, i] <- distMatrix[i, (i+1):numF] >>> } >>> >>> diag(distMatrix) <- 0L # I think you want 0 here, not 1 >>> >>> Takes about 15 sec. on my machine for your 500x8000 DNA rectangle. >>> >>> Note that I'm using an integer matrix which is twice smaller than a >>> numeric/double matrix. >>> >>> You could maybe get a small speed improvement by doing this replacement >>> once for all at the DNAStringSet level before you start the loop with: >>> >>> myDNAStringSet2 <- chartr("-", "+", myDNAStringSet) >>> >>> and by adapting the code in the loop but I don't expect this to make a >>> big difference. >>> >>> Cheers, >>> H. >>> >>> >>> erikwright at comcast.net wrote: >>>> Hi Patrick, Michael, Herv??, and Martin, Wow! Thanks very much >>>> everyone for putting so much effort into answering my question. >>>> While we wait for Patrick's update of stringDist to become >>>> available, I will describe my solution and a related problem I have >>>> run into. I am using this piece of code to do most of the work: numF >>>> <- length ( myDNAStringSet ) for ( i in 1 :( numF - 1 )) { # use >>>> IUPAC_CODE_MAP with fixed=FALSE distMatrix [ i ,( i + 1 ): numF ] <- >>>> neditStartingAt ( myDNAStringSet [[ i ]], myDNAStringSet [( i + 1 ): >>>> numF ], fixed = FALSE ) distMatrix [( i + 1 ): numF , i ] <- >>>> distMatrix [ i ,( i + 1 ): numF ] } diag ( distMatrix ) <- 1 This >>>> will populate a matrix with the number of differences between >>>> strings. The code works reasonably fast - about 15 seconds for a >>>> DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with >>>> 8,000 sequences of length 8,000 is exponentially slower - about 45 >>>> minutes. Obviously, if Patrick's update of stringDist improves the spe > ed it will greatly help. In playing around with this I have run into a > related problem that you all might be able to help me with. As I stated > in my original email, I am trying to calculate a Distance Matrix with a > large aligned DNAStringSet. My DNAStringSet does not contain ideal data, > meaning there are many gap characters ("-"). If I find the edit > difference between two strings that have gaps in the same places then > the gaps are not counted towards the edit distance. This makes complete > sense because "-" == "-". My problem is that I want to include any gap > characters in my edit distance. For example: >>>>> x <- DNAString("--A") y <- DNAString("--A") neditStartingAt(x,y) >>>> [1] 0 The distance between these two strings for my distance >>>> measurement should be 2. That is because the gap character ("-") >>>> gives me no data, so I cannot say that the distance between the two >>>> strings is 0. Can anyone think of a good method of counting common >>>> gap characters toward the edit distance? Unfortunately I cannot >>>> simply count up the number of gap characters in my pattern and add >>>> it to my edit distance. For example: >>>>> x <- DNAString("AC--") y <- DNAString("ACC-") neditStartingAt(x,y) >>>> [1] 1 The distance between these two strings for my distance >>>> measurement should be 2. If I were to add the number of gap >>>> characters in the pattern to the edit distance I would get 2 + 1 = >>>> 3. So basically I want to count any gap character in the pattern or >>>> subject toward the edit distance because it gives me no information. >>>> Does this make sense, and does anyone have an idea for how to do >>>> this? Thanks again, Erik ----- Original Message ----- From: "Patrick >>>> Aboyoun" <paboyoun at="" fhcrc.org=""> To: "Michael Lawrence" >>>> <lawrence.michael at="" gene.com=""> Cc: "BioC list" >>>> <bioconductor at="" stat.math.ethz.ch=""> Sent: Saturday, March 27, 2010 >>>> 7:39:16 PM GMT -06:00 US/Canada Central Subject: Re: [BioC] Count >>>> differences between sequences I just added a Hamming distance metric >>>> to the stringDist function in BioC 2.6, which should be available on >>>> bioconductor.org in 36 hours. This uses the code underlying the >>>> neditStartingAt functions and loops over the lower triangle of the >>>> distance matrix in C so it is f > aster than the sapply method Herve provided. To calculate this newly > added distance measure, specify stringDist(x, method = "hamming") where > x is an XStringSet object containing equal length strings. > library(Biostrings) library(drosophila2probe) ch0 <- > drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- > unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- > DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- > stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029 > 5.541 sessionInfo() R version 2.11.0 Under development (unstable) > (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] > en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base > packages: [1] stats gra >>> phics grDevices utils datasets methods base other attached packages: >>> [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4] >>> Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not >>> attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On >>> 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Herv??? Pag???s >>> > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> >>> Erik, >>> Judging from your data, I would gather that you are not >>> interested in >>> indels. Is that correct? You should look at the >>> neditStartingAt function. >>> Something like the following may meet >>> your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0, >>> nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> >>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> >>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of >>> the fact that neditStartingAt() is >> vectorized with respect to its >>> second argument: >> >> N<- length(myStrings) >> > myD >>> ists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], >>> myStrings))) >> >> That will make things hundred times faster with a >>> big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time( >>> >>> >> myDists<- sapply(1:N, >> function(i) >>> neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed >>> >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2] >>> [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947 >>> 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117 >>> 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 >>> >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152 >>> 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167 >>> 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0 >>> 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 >>> >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047 >>> 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >> >> W >>> ow, nice. It sounds to me like this would be a common use case; how >>> hard > would it be to vectorize both arguments? > > Michael > > > >> >>> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM, >>> erikwright at comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all >>> of length 8000. I need the entire N x N >>>> distance matrix. >>>> >>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message >>> ----- >>>> From: "Patrick Aboyoun" >>>> To: erikwright at comcast.net >>> >>>> Cc: bioconductor at stat.math.ethz.ch >>>> Sent: Thursday, March >>> 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re: >>> [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could >>> you provide more details on your data? How long are each of the >>>> >>> strings and how many strings do you have? Also, do you need the >>> entire N x N >>>> distance matrix for downstream analysis or are you >>> just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>> >>> >>>> >>>> >>>> On 3/25/10 2:29 PM, > eri >>> kwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> >>> >>>>> I have a large DNAStringSet and I am trying to calculate its >>> >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and >>> they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the >>> stringDist() function, but it is very slow >>>>> >>>>> >>>> for large >>> DNAStringSets. Is there a way to quickly calculate the number >>>> of >>> differences between two DNAString instances? >>>> >>>> >>>>> For >>> example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>> >>> >>>>> >>>> would like to know if their is a function other than >>> stringDist() that >>>> will tell me the distance between them is 1. >>> >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - >>> Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> >>> _______________________________________________ >>>>> Bioconductor >>> mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Se > arc >>> h the archives: >>>>> >>>>> >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> _______________________________________________ >>> Bioconductor >>> mailing list >>> Bioconductor at stat.math.ethz.ch >>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the >>> archives: >>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >> -- >> Herv??? Pag???s >> >> Program in Computational Biology >>> >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer >>> Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> >>> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) >>> 667-5791 >> Fax: (206) 667-1319 >> >> >> >>> _______________________________________________ >> Bioconductor >>> mailing list >> Bioconductor at stat.math.ethz.ch >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the >>> archives: >> http://news.gmane.org/gmane.science.biology.inform > ati >>> cs.conductor >> >> > [[alternative HTML version deleted]] > > > > > >>> _______________________________________________ > Bioconductor >>> mailing list > Bioconductor at stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the >>> archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> [[alternative HTML version deleted]] >>>> _______________________________________________ Bioconductor mailing >>>> list Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>> archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> [[alternative HTML version deleted]] >>>> ----------------------------------------------------------------- ------- >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M2-B876 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 14.8 years ago Hervé Pagès 16k

Login before adding your answer.