pairwise alignment and homology score

0

Entering edit mode

erikafalisi@tin.it ▴ 10

@erikafalisitinit-5040

Last seen 9.6 years ago

Dear List, we are trying to compare more than 600 amino acid sequences in clustal format (or FASTA format too) with BLOSUM62 matrix to achieve homology score. We used pairwise alignment but we aren't able to compare our sequences with BLOSUM matrix because pairwise command is used to match only two strings (and not a set of sequences) with the matrix. How can we iterate these command to perform all sequences alignments? We also performed Multiple Alignment but we obtain a cramped dendogram. Is it possible to modify dendogram's sizes to make it readable? Thanks, Dr Falisi [[alternative HTML version deleted]]

Alignment Alignment • 1.1k views

ADD COMMENT • link updated 12.3 years ago by Hervé Pagès 16k • written 12.3 years ago by erikafalisi@tin.it ▴ 10

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi, On 01/10/2012 07:02 AM, erikafalisi at tin.it wrote: > > Dear List, > we are trying to compare more than 600 amino acid sequences in clustal format (or FASTA format too) with BLOSUM62 matrix to achieve homology score. We used pairwise alignment but we aren't able to compare our sequences with BLOSUM matrix because pairwise command is used to match only two strings (and not a set of sequences) with the matrix. How can we iterate these command to perform all sequences alignments? Assuming you've managed to load your 600 sequences in a AAStringSet object 'myseqs', have you tried to do something along the lines of: scores <- matrix(integer(0), nrow=length(myseqs), ncol=length(myseqs)) for (i in 1:length(myseqs)) for (j in 1:i) scores[i, j] <- pairwiseAlignment(myseqs[i], myseqs[j], substitutionMatrix="BLOSUM62", scoreOnly=TRUE) One problem that arises is that the input of a clustering tool like hclust() is expected to be a 'dist' object i.e. a matrix of distances, not a matrix of scores. Since, unlike with the distance, the highest the score, the "closest" the sequences are, using scores as the input of hclust() is guaranteed to produce wrong results. I'm not sure what's the best way to transform pairwise scores into pairwise distances though (I'm not an expert on the topic) but it seems to me that the latter could be inferred from the former with something like: dist(x, y) <- 1 - score(x, y)^2 / (score(x, x) * score(y, y)) The distance defined above looks like it could be close enough to a true distance (in the mathematical sense) but that's pure speculation since I don't really have a proof I could show you (the tricky part being to prove it satisfies the triangle inequality). Anyway, as far as hclust() is concerned, I'm not sure it matters. According to the man page, the first argument to hclust() only needs to be "a dissimilarity structure as produced by ?dist?". Nothing is said about it being a true distance. So: dist <- 1 - scores^2 / (diag(scores) %*% t(diag(scores))) Finally, plot the dendogram with: plot(hclust(as.dist(dist))) > We also performed Multiple Alignment but we obtain a cramped dendogram. Is it possible to modify dendogram's sizes to make it readable? What's a "cramped" dendogram? Not sure about modifying the size of the dendogram. I suggest you look at the man page for hclust() and in particular at the plot method for class 'hclust' (it has a lot of args). Others on the list might have better suggestions about producing nice dendogram plots. Cheers, H. > Thanks, > Dr Falisi > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 12.3 years ago Hervé Pagès 16k

0

Entering edit mode

Hi again, On 01/11/2012 04:20 PM, Hervé Pagès wrote: > Hi, > > On 01/10/2012 07:02 AM, erikafalisi at tin.it wrote: >> >> Dear List, >> we are trying to compare more than 600 amino acid sequences in clustal >> format (or FASTA format too) with BLOSUM62 matrix to achieve homology >> score. We used pairwise alignment but we aren't able to compare our >> sequences with BLOSUM matrix because pairwise command is used to match >> only two strings (and not a set of sequences) with the matrix. How can >> we iterate these command to perform all sequences alignments? > > Assuming you've managed to load your 600 sequences in a AAStringSet > object 'myseqs', have you tried to do something along the lines of: > > scores <- matrix(integer(0), nrow=length(myseqs), ncol=length(myseqs)) > for (i in 1:length(myseqs)) > for (j in 1:i) > scores[i, j] <- pairwiseAlignment(myseqs[i], myseqs[j], > substitutionMatrix="BLOSUM62", > scoreOnly=TRUE) > > One problem that arises is that the input of a clustering tool like > hclust() is expected to be a 'dist' object i.e. a matrix of distances, > not a matrix of scores. Since, unlike with the distance, the highest > the score, the "closest" the sequences are, using scores as the input > of hclust() is guaranteed to produce wrong results. > > I'm not sure what's the best way to transform pairwise scores into > pairwise distances though (I'm not an expert on the topic) but it > seems to me that the latter could be inferred from the former with > something like: > > dist(x, y) <- 1 - score(x, y)^2 / (score(x, x) * score(y, y)) > > The distance defined above looks like it could be close enough to a > true distance (in the mathematical sense) but that's pure speculation > since I don't really have a proof I could show you (the tricky part > being to prove it satisfies the triangle inequality). Anyway, as far > as hclust() is concerned, I'm not sure it matters. According to the > man page, the first argument to hclust() only needs to be > "a dissimilarity structure as produced by ?dist?". Nothing is said > about it being a true distance. > > So: > > dist <- 1 - scores^2 / (diag(scores) %*% t(diag(scores))) An update on this. With the possibility of negative scores, the transformation above from pairwise scores to pairwise (pseudo)distances is not doing the right thing (giving the same result for a negative score and its opposite). Seems like a more accurate (and simpler) way to go is to just make the (pseudo)distances (or dissimilarity measures) be the opposite of the scores. On a concrete example: myseqs <- AAStringSet(c("PAHHEAE", # cluster 1 "MGBB", # cluster 2 "PAHEAE", # cluster 1 "PAWHEAE", # cluster 1 "VGGDN", # cluster 2 "PAYHEAE", # cluster 1 "MGGBB", # cluster 2 "VGGBB", # cluster 2 "KMCEKR")) # cluster 3 scores <- matrix(integer(0), nrow=length(myseqs), ncol=length(myseqs)) for (i in 1:length(myseqs)) for (j in 1:i) scores[i, j] <- pairwiseAlignment(myseqs[i], myseqs[j], substitutionMatrix="BLOSUM62", scoreOnly=TRUE) Comparing different approaches: a. dist <- 1 - scores^2 / (diag(scores) %*% t(diag(scores))) plot(hclust(as.dist(dist))) Not good. In particular seqs 2 and 6 look very similar on the plot (they are very close) but their pairwise score (scores[6, 2] is -25) indicates that they are not. b. dist = -scores plot(hclust(as.dist(dist))) Much better. The plot reflects the expected clustering. Note that a result similar to b. can be obtained by computing the pairwise "edit distances" (aka Levenshtein distances): c. dist <- stringDist(myseqs, method="levenshtein") plot(hclust(as.dist(dist))) The clustering obtained with this method looks a lot like the one obtained in b. but this method is slightly more efficient since stringDist() takes care of the looping at the C level. See ?stringDist for more details. Finally be aware that stringDist() should not be used with 'method' set to "quality" or "substitutionMatrix" since in that case the returned object will contain the pairwise scores (i.e. high values mean similar and low values mean dissimilar) and therefore is not suitable for clustering. I'm planning to fix this by populating the returned object with the opposites of the scores. In the mean time you can do: d. dist <- -stringDist(myseqs, method="substitutionMatrix", substitutionMatrix=BLOSUM62, gapOpening=-10, gapExtension=-4) plot(hclust(dist)) This produces *exactly* the same result as b. but with slightly more compact/efficient code. Cheers, H. > > Finally, plot the dendogram with: > > plot(hclust(as.dist(dist))) > >> We also performed Multiple Alignment but we obtain a cramped >> dendogram. Is it possible to modify dendogram's sizes to make it >> readable? > > What's a "cramped" dendogram? Not sure about modifying the size of the > dendogram. I suggest you look at the man page for hclust() and in > particular at the plot method for class 'hclust' (it has a lot of args). > Others on the list might have better suggestions about producing > nice dendogram plots. > > Cheers, > H. > >> Thanks, >> Dr Falisi >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.3 years ago Hervé Pagès 16k

Login before adding your answer.