Hi Patrick, Michael, Hervé, and Martin,
Wow! Thanks very much everyone for putting so much effort into
answering my question. While we wait for Patrick's update of
stringDist to become available, I will describe my solution and a
related problem I have run into.
I am using this piece of code to do most of the work:
numF <- length ( myDNAStringSet )
for ( i in 1 :( numF - 1 )) {
# use IUPAC_CODE_MAP with fixed=FALSE
distMatrix [ i ,( i + 1 ): numF ] <- neditStartingAt (
myDNAStringSet [[ i ]],
myDNAStringSet [( i + 1 ): numF ],
fixed = FALSE )
distMatrix [( i + 1 ): numF , i ] <- distMatrix [ i ,( i + 1 ): numF ]
}
diag ( distMatrix ) <- 1
This will populate a matrix with the number of differences between
strings. The code works reasonably fast - about 15 seconds for a
DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with
8,000 sequences of length 8,000 is exponentially slower - about 45
minutes. Obviously, if Patrick's update of stringDist improves the
speed it will greatly help.
In playing around with this I have run into a related problem that you
all might be able to help me with. As I stated in my original email, I
am trying to calculate a Distance Matrix with a large aligned
DNAStringSet. My DNAStringSet does not contain ideal data, meaning
there are many gap characters ("-"). If I find the edit difference
between two strings that have gaps in the same places then the gaps
are not counted towards the edit distance. This makes complete sense
because "-" == "-". My problem is that I want to include any gap
characters in my edit distance. For example:
> x <- DNAString("--A")
> y <- DNAString("--A")
> neditStartingAt(x,y)
[1] 0
The distance between these two strings for my distance measurement
should be 2. That is because the gap character ("-") gives me no data,
so I cannot say that the distance between the two strings is 0. Can
anyone think of a good method of counting common gap characters toward
the edit distance? Unfortunately I cannot simply count up the number
of gap characters in my pattern and add it to my edit distance. For
example:
> x <- DNAString("AC--")
> y <- DNAString("ACC-")
> neditStartingAt(x,y)
[1] 1
The distance between these two strings for my distance measurement
should be 2. If I were to add the number of gap characters in the
pattern to the edit distance I would get 2 + 1 = 3. So basically I
want to count any gap character in the pattern or subject toward the
edit distance because it gives me no information. Does this make
sense, and does anyone have an idea for how to do this?
Thanks again,
Erik
----- Original Message -----
From: "Patrick Aboyoun" <paboyoun@fhcrc.org>
To: "Michael Lawrence" <lawrence.michael@gene.com>
Cc: "BioC list" <bioconductor@stat.math.ethz.ch>
Sent: Saturday, March 27, 2010 7:39:16 PM GMT -06:00 US/Canada Central
Subject: Re: [BioC] Count differences between sequences
I just added a Hamming distance metric to the stringDist function in
BioC 2.6, which should be available on bioconductor.org in 36 hours.
This uses the code underlying the neditStartingAt functions and loops
over the lower triangle of the distance matrix in C so it is faster
than the sapply method Herve provided. To calculate this newly added
distance measure, specify stringDist(x, method = "hamming") where x is
an XStringSet object containing equal length strings.
library(Biostrings) library(drosophila2probe) ch0 <-
drosophila2probe[seq_len(500*8000/25), "sequence"] ch <-
unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <-
DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <-
stringDist(dna, method = "hamming")) # user system elapsed # 5.459
0.029 5.541 sessionInfo() R version 2.11.0 Under development
(unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1]
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base
packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] drosophila2probe_2.5.0
AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26
IRanges_1.5.72 loaded via a namespace (and not attached): [1]
DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM,
Michael Lawrence wrote: > 2010/3/26 Herv� Pag�s > > >> Hi Erik,
Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging
from your data, I would gather that you are not interested in >>>
indels. Is that correct? You should look at the neditStartingAt
function. >>> Something like the following may meet your needs: >>>
>>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N)
>>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<-
myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]])
>>> >>> >>> >> Note that you can take advantage of the fact that
neditStartingAt() is >> vectorized with respect to its second
argument: >> >> N<- length(myStrings) >> myDists<- sapply(1:N, >>
function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That
will make things hundred times faster with a big "DNA rectangle" >>
like yours (500x8000): >> >> >>> system.time( >>> >> myDists<-
sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]],
myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>>
myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >>
[2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849
0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111
6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906
6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >>
[7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956
6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019
5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947 6183
5984 0 >> >> >> > Wow, nice. It sounds to me like this would be a
common use case; how hard > would it be to vectorize both arguments? >
> Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>>
On 3/25/10 2:57 PM, erikwright@comcast.net wrote: >>> >>> >>>> I have
500 DNAStrings, all of length 8000. I need the entire N x N >>>>
distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> -----
Original Message ----- >>>> From: "Patrick Aboyoun" >>>> To:
erikwright@comcast.net >>>> Cc: bioconductor@stat.math.ethz.ch >>>>
Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central
>>>> Subject: Re: [BioC] Count differences between sequences >>>> >>>>
Erik, >>>> Could you provide more details on your data? How long are
each of the >>>> strings and how many strings do you have? Also, do
you need the entire N x N >>>> distance matrix for downstream analysis
or are you just looking for closest >>>> relatives? >>>> >>>> >>>>
Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29 PM, erikwright@comcast.net
wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> >>>>> I have a large
DNAStringSet and I am trying to calculate its >>>>> >>>>> >>>>
distance matrix. My DNAStrings are equal width and they are already
>>>> aligned. >>>> >>>> >>>>> I have tried using the stringDist()
function, but it is very slow >>>>> >>>>> >>>> for large
DNAStringSets. Is there a way to quickly calculate the number >>>> of
differences between two DNAString instances? >>>> >>>> >>>>> For
example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>>
>>>>> >>>> would like to know if their is a function other than
stringDist() that >>>> will tell me the distance between them is 1.
>>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> -
Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>>
_______________________________________________ >>>>> Bioconductor
mailing list >>>>> Bioconductor@stat.math.ethz.ch >>>>>
https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the
archives: >>>>> >>>>> >>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>
>>>> >>> >>> [[alternative HTML version deleted]] >>> >>>
_______________________________________________ >>> Bioconductor
mailing list >>> Bioconductor@stat.math.ethz.ch >>>
https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the
archives: >>>
http://news.gmane.org/gmane.science.biology.informatics.conductor >>>
>>> >> -- >> Herv� Pag�s >> >> Program in Computational Biology >>
Division of Public Health Sciences >> >> Fred Hutchinson Cancer
Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >>
Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206)
667-5791 >> Fax: (206) 667-1319 >> >> >>
_______________________________________________ >> Bioconductor
mailing list >> Bioconductor@stat.math.ethz.ch >>
https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the
archives: >>
http://news.gmane.org/gmane.science.biology.informatics.conductor >>
>> > [[alternative HTML version deleted]] > > > > >
_______________________________________________ > Bioconductor mailing
list > Bioconductor@stat.math.ethz.ch >
https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the
archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]
_______________________________________________ Bioconductor mailing
list Bioconductor@stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]