Pairwise Alignment on Large Protein Sequence Data Set

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.7 years ago

Basically, I have a runtime/performance issue question. I wrote the following nested while-loop using the "Biostrings" package (biocLite) in order to link protein sequences from two species if they have a >90% alignment score. I input two protein genomes (15,000-85,000 protein sequences each), compare each protein's amino acid sequence in SeqData1 to each amino acid sequence from SeqData2, compute an alignment score, and if the score is >90% I concatenate a list of the protein names that match and the sequence of the SeqData2 protein. Is there a more efficient solution to my currently-projected runtime of 1 month? Maybe a solution involving a growing data.table? Im not sure what the most efficient implementation of it is in R (I am a python programmer by trade). Thank you in advance. -- output of sessionInfo(): SeqScore <- function() { source("http://bioconductor.org/biocLite.R") biocLite() require("Biostrings") data(BLOSUM100) SeqData1 <- readDNAStringSet("SeqData1.fasta") proNum1 = 84390 # number of proteins in Seq1 SeqData2 <- readDNAStringSet("SeqData2.fasta") proNum2 = 15194 # number of proteins in Seq #Create empty list to fill with percent scores and matching sequences: DList=NULL QueSeqList = NULL TotList = NULL #initiating the counters: i=1 j=1 c=0 #Perform alignment and generate percent identity scores: while(i<=proNum1) { while(j<=proNum2) { SeqAlign <- pairwiseAlignment(SeqData1[i], SeqData2[j], substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5) PercAlign <- pid(SeqAlign) if(PercAlign>=90) { DList = c(DList, names(SeqData1[i]), names(SeqData2[j])) QueSeqList = c(QueSeqList, toString(SeqData2[j])) c=c+1 } else{c=c+1; print(c) } j=j+1 } i=i+1 j=1 #to reset the inner while loop } unlist<-t(sapply(DList, unlist)); outputMatrix<-cbind(DList,QueSeqList) outputMatrix<-as.matrix(outputMatrix, ncol=3) write.csv(outputMatrix, "outputMatrix.csv") } -- Sent via the guest posting facility at bioconductor.org.

Alignment genomes Alignment genomes • 2.3k views

ADD COMMENT • link updated 10.4 years ago by Hervé Pagès 16k • written 10.4 years ago by Guest User ★ 13k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 hour ago

Seattle, WA, United States

Hi Thomas, The projected runtime for your code is more likely to be 5 or 6 months (but I don't know the average length of your protein sequences so I can't really say). Let's start with a couple of easy things you can do to reduce it to 3 months :-) First, be aware that loops in R tend to very inefficient so one should try hard to avoid them. One way to avoid them is by taking advantage of vectorization. Lots of functions that expect vector-like input are vectorized. pairwiseAlignment() is one of them: if 'pattern' contains more than 1 pattern (i.e. is an AAStringSet object of length > 1), it will return a PairwiseAlignments object "parallel" to 'pattern' (i.e. of the same length as pattern). With 1 pattern: > pwa1 <- pairwiseAlignment(aa1[1], aa2[1]) > length(pwa1) [1] 1 > score(pwa1) [1] -3804.555 > pid(pwa1) [1] 20.9826 With 15 patterns: > pwa15 <- pairwiseAlignment(aa1[1:15], aa2[1]) > length(pwa15) [1] 15 > score(pwa15) [1] -3804.555 -4449.021 -4030.539 -3694.222 -3970.066 -4586.336 -3769.805 [8] -4644.705 -3674.032 -3742.815 -4550.244 -3827.098 -4003.654 -4186.814 [15] -3705.548 > pid(pwa15) [1] 20.982600 20.379147 20.679012 13.903743 20.707071 20.114943 4.797980 [8] 18.921476 14.270833 16.474292 21.348315 19.088937 20.777892 20.320320 [15] 8.839191 The score and pid vectors are the same as if you were doing something like: sapply(1:15, function(i) score(pairwiseAlignment(aa1[i], aa2[1]))) except that vectorization gives you an interesting speed-up. The 2nd thing that you should definitely avoid is to grow an object in a loop, starting with an empty object. This is very inefficient. In your case, you're going to write your results to a file at the end anyway so you don't need to keep them in memory. Instead you could call write.table() (or write.csv()) in the loop to grow a file. These functions have an 'append' argument that you could set to TRUE for that (it's FALSE by default). But I'm going to take a different approach here. I'll keep the results in memory but they will be presented as a list that is parallel to the 2nd set of proteins ('SeqData2'). Each list element will be a character vector containing the names of the proteins in the 1st set that are similar. No need to store the protein sequences in that object. The idea is to represent the results in an object that is as light as possible. Then, in a second step, we can use that object to make a data.frame and write that data.frame to a file. So this gives you something like this: ## Returns a list parallel to 'y'. Each list element is a character ## vector containing names from 'x'. findSimilarProteins <- function(x, y, min.pid=90, substitutionMatrix="BLOSUM100", gapOpening=-10, gapExtension=-4) { if (!is(x, "AAStringSet")) stop("'x' must be an AAStringSet object") if (!is(y, "AAStringSet")) stop("'y' must be an AAStringSet object") x_names <- names(x) if (is.null(x_names) || any(duplicated(x_names))) stop("'x' must have unique names") y_names <- names(y) if (is.null(y_names) || any(duplicated(y_names))) stop("'y' must have unique names") ## Only one level of looping. No easy way to avoid that one :-/ lapply(y, function(yi) { pwa <- pairwiseAlignment(x, yi, substitutionMatrix=substitutionMatrix, gapOpening=gapOpening, gapExtension=gapExtension) names(x)[pid(pwa) >= min.pid] }) } Then: hits <- findSimilarProteins(SeqData1, SeqData2, gapOpening=0, gapExtension=-5) This is still a long computation... about 3-4 months on your data! Once you have 'hits' it's easy (and fast) to put the results in a data.frame (including the sequences) and to send them to a file. Those steps take only a few seconds: id1 <- unlist(hits, use.names=FALSE) id2 <- rep.int(names(hits), elementLengths(hits)) df <- data.frame(id1=id1, id2=id2, seq1=SeqData1[id1], seq2=SeqData2[id2]) write.table(df, file="similar_proteins.txt", quote=FALSE, sep="\t") Finally you can reduce findSimilarProteins() running time by parallelizing it. For this, install and load the BiocParallel package and replace lapply() by bplapply() in findSimilarProteins(). If you have access to a cluster with hundreds of nodes, you might be able to run this in a couple of days only. Otherwise, maybe you could just use BLAST which is way faster than the Needleman-Wunsch and Smith-Waterman algos implemented by pairwiseAlignment(). Cheers, H. On 12/17/2013 03:22 PM, Thomas Atanasov [guest] wrote: > > Basically, I have a runtime/performance issue question. I wrote the following nested while-loop using the "Biostrings" package (biocLite) in order to link protein sequences from two species if they have a >90% alignment score. > > I input two protein genomes (15,000-85,000 protein sequences each), compare each protein's amino acid sequence in SeqData1 to each amino acid sequence from SeqData2, compute an alignment score, and if the score is >90% I concatenate a list of the protein names that match and the sequence of the SeqData2 protein. > > Is there a more efficient solution to my currently-projected runtime of 1 month? Maybe a solution involving a growing data.table? Im not sure what the most efficient implementation of it is in R (I am a python programmer by trade). > > Thank you in advance. > > -- output of sessionInfo(): > > SeqScore <- function() > { > > source("http://bioconductor.org/biocLite.R") > biocLite() > require("Biostrings") > data(BLOSUM100) > > > SeqData1 <- readDNAStringSet("SeqData1.fasta") > proNum1 = 84390 # number of proteins in Seq1 > > SeqData2 <- readDNAStringSet("SeqData2.fasta") > proNum2 = 15194 # number of proteins in Seq > > > #Create empty list to fill with percent scores and matching sequences: > DList=NULL > QueSeqList = NULL > TotList = NULL > > #initiating the counters: > i=1 > j=1 > c=0 > > #Perform alignment and generate percent identity scores: > while(i<=proNum1) > { > while(j<=proNum2) > { > SeqAlign <- pairwiseAlignment(SeqData1[i], SeqData2[j], substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5) > PercAlign <- pid(SeqAlign) > if(PercAlign>=90) > { > DList = c(DList, names(SeqData1[i]), names(SeqData2[j])) > QueSeqList = c(QueSeqList, toString(SeqData2[j])) > c=c+1 > } > else{c=c+1; print(c) > } > j=j+1 > } > i=i+1 > j=1 #to reset the inner while loop > } > unlist<-t(sapply(DList, unlist)); > outputMatrix<-cbind(DList,QueSeqList) > outputMatrix<-as.matrix(outputMatrix, ncol=3) > write.csv(outputMatrix, "outputMatrix.csv") > } > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 10.4 years ago Hervé Pagès 16k

Login before adding your answer.