Search
Question: How to create an output file with sequence ID's from multiple BLAST result using blastSequences in "Biostrings"
0
gravatar for Mathilde
2.3 years ago by
Mathilde0
Denmark/Germany
Mathilde0 wrote:

Hello all,

I don't have a problem with a specific command. My problem is that I am stuck.

What I wish to do is this:

I have a list of sequences (42 with a length of app. 300-500bp) in a fasta file. I want to BLAST these in NCBI (blastn) and get a list of top-hits corresponding to my list.

(not very advanced I am completely new in R and/or other programming language) 

I could (and am at the moment) doing this blast manually, however there will be many more sequences for me in the future and it feels so inefficient to do it manually when I know(!) a simple command could do this for me.

I started out following the description by Kevin Keenan here: http://rstudio-pubs-static.s3.amazonaws.com/12097_1352791b169f423f910d93222a4c2d85.html

which is very user-friendly, however he ends with the note: "For further information on how to proceed with this data structure, see the Biostringspackage in Bioconductor." 

- I have tried searching for what to do next for hours now... So I decided to ask for help here:

Has anyone done the same and can guide me through it?

Will be much appreciated.

All the best,

Mathilde

ADD COMMENTlink modified 2.3 years ago by Martin Morgan ♦♦ 20k • written 2.3 years ago by Mathilde0
1

What to do next depends on your specific research question and the purpose of your analysis. Regarding Biostrings, for learning what can be done and how to use it the best place to look would be the package web page here. Take a look at the vignettes, starting possibly with the one named "Biostrings Quick Overview".

ADD REPLYlink written 2.3 years ago by Diego Diez700

No but the result "the list of ID for best hit" is not produced at the end of K.Keenan's "guide".

I know what to do with it once I got the info :)

 

I have this now:

[[1]]
[[1]][[1]]
DNAMultipleAlignment with 2 rows and 324 columns
     aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[1]][[2]]
DNAMultipleAlignment with 2 rows and 324 columns
     aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[2]]
[[2]][[1]]
DNAMultipleAlignment with 2 rows and 307 columns
     aln
[1] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA
[2] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA

[[2]][[2]]
DNAMultipleAlignment with 2 rows and 307 columns
     aln
[1] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA
[2] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA

[[3]]
[[3]][[1]]
DNAMultipleAlignment with 2 rows and 519 columns
     aln
[1] TGATCCTGCCAGTAGTGTATGCTTCTCCTAAAGACT...CCCTTGGGCAATGCCCGAGGGCGTTAGGGG-ACATA
[2] TGATCCTGCCAGTAGTGTATGCTTCTCCTAAAGACT...CCCTTGGGCAATGCCCGAGGGCGTTAGGGGCACATA

[[3]][[2]]
DNAMultipleAlignment with 2 rows and 502 columns
     aln
[1] TATGCTTCTCCTAAAGACTAAGCCATGCATGCCTTC...CCCTTGGGCAATGCCCGAGGGCGTTAGGGG-ACATA
[2] TATGCTTCTCCTAAAGACTAAGCCATGCATGCCTTC...CCCTTGGGCAATGCCCGAGGGCGTTAGGGGCACATA

And so on....

 

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Mathilde0
1
gravatar for Marc Carlson
2.3 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

Hi Mathilde,

In your case it sounds like you might just want to use the argument to the blastSequences function called 'as' and set it to 'data.frame'.  This can return a lot of other data back to you about what exactly your sequence aligns to.  So just modifying the example from the man page here it could look like this:

 

library(annotate)

blastSequences(x = "GGCCTTCATTTACCCAAAATG", as = 'data.frame')

 

Hope that helps you,

 

 Marc

ADD COMMENTlink written 2.3 years ago by Marc Carlson7.2k

Thank you for you reply.

However I am still struggling.

the blastSequences() does not work for me.

estimated response time 33 seconds
elapsed time 35 seconds
timeout after 61 seconds; wait another 60 seconds? [y/n]

(I have pressed "y" many times and increased the timeout time to 320 sec)

A problem which I think is related to large query sequences - as pointed at by K. Keenan. which is why he has written an alternative function blastSeqKK(), adding the argument of attempts, this one (from here: http://rstudio-pubs-static.s3.amazonaws.com/12097_1352791b169f423f910d93222a4c2d85.html):

# function definition blastSeqKK <- function (x, database = "nr", hitListSize = "10", filter = "L", expect = "10", program = "blastn", attempts = 10) { baseUrl <- "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi" query <- paste("QUERY=", as.character(x), "&DATABASE=", database, "&HITLIST_SIZE=", hitListSize, "&FILTER=", filter, "&EXPECT=", expect, "&PROGRAM=", program, sep = "") url0 <- sprintf("%s?%s&CMD=Put", baseUrl, query) results <- tempfile() Sys.sleep(5) require(XML) post <- htmlTreeParse(url0, useInternalNodes = TRUE) x <- post[["string(//comment()[contains(., \"QBlastInfoBegin\")])"]] rid <- sub(".*RID = ([[:alnum:]]+).*", "\\1", x) rtoe <- as.integer(sub(".*RTOE = ([[:digit:]]+).*", "\\1", x)) url1 <- sprintf("%s?RID=%s&FORMAT_TYPE=XML&CMD=Get", baseUrl, rid) Sys.sleep(rtoe) .tryParseResult <- function(url, attempts){ for (i in 1:(attempts+1)) { result <- tryCatch({ xmlTreeParse(url, useInternalNodes=TRUE, error = xmlErrorCumulator(immediate=FALSE)) }, error=function(err) NULL) if (!is.null(result)) return(result) Sys.sleep(10) } stop(paste("no results after ", attempts, " attempts; please try again later", sep = "")) } result <- .tryParseResult(url1, attempts) qseq <- xpathApply(result, "//Hsp_qseq", xmlValue) hseq <- xpathApply(result, "//Hsp_hseq", xmlValue) require(Biostrings) res <- list() for (i in seq_len(length(qseq))) { res[i] <- DNAMultipleAlignment(c(hseq[[i]], qseq[[i]]), rowmask = as(IRanges(), "NormalIRanges"), colmask = as(IRanges(), "NormalIRanges")) } res }

However, as I understand this there is no longer the choice/argument of getting the output as a "data.fame", like in the original function:

blastSequences(x, database, hitListSize, filter, expect, program,
      timeout=40, as=c("DNAMultipleAlignment", "data.frame", "XML"))

So this is where I am stuck...

I tried to simply (naïve) add the argument:

# function definition
blastSeqKK <- function (x, database = "nr", hitListSize = "10", 
                        filter = "L", expect = "10", program = "blastn",timeout=40,
 as=c("DNAMultipleAlignment", "data.frame", "XML"),
                        attempts = 10) 

But it was "ignored" and the output when running:

res <- lapply(mySeq, function(x){
  # collapse seq into string
  seqCollapse <- paste(toupper(as.character(x)), collapse = "")
  # run blast
  blastRes <- blastSeqKK(x = seqCollapse, hitListSize = 2, timeout=60, 
    attempts = 30, as=c("data.frame") )
  return(blastRes)

Was still this:

> head(res)
[[1]]
[[1]][[1]]
DNAMultipleAlignment with 2 rows and 324 columns
     aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[1]][[2]]
DNAMultipleAlignment with 2 rows and 324 columns
     aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[2]]
[[2]][[1]]
DNAMultipleAlignment with 2 rows and 307 columns
     aln

Sooo, I don't know how to proceed right now, so again any suggestions would be much appreciated.

/Mathilde

ADD REPLYlink written 2.3 years ago by Mathilde0
1
gravatar for Martin Morgan
2.3 years ago by
Martin Morgan ♦♦ 20k
United States
Martin Morgan ♦♦ 20k wrote:

I think you can change the final lines of blastSequenceKK() from

  result <- .tryParseResult(url1, attempts)
  qseq <- xpathApply(result, "//Hsp_qseq", xmlValue)
  hseq <- xpathApply(result, "//Hsp_hseq", xmlValue)
  require(Biostrings)
  res <- list()
  for (i in seq_len(length(qseq))) {
    res[i] <- DNAMultipleAlignment(c(hseq[[i]], qseq[[i]]), 
                 rowmask = as(IRanges(), "NormalIRanges"),
                 colmask = as(IRanges(), "NormalIRanges"))
  }
  res

to

  result <- .tryParseResult(url1, attempts)
  annotate:::.blastSequencesToDataFrame(result)

to return the data.frame that Marc mentioned. Also, to read the fasta file I'd do

library(Biostrings)
dna = readDNAStringSet("path/to.fasta")
lapply(as.character(dna), blastSequencesKK)

Actually, I'd just stick with blastSequences() and increase the timeout to whatever is tolerable. UPDATE: an earlier version included submitting multiple sequences as queries in parallel, but it's better to submit several sequences in a single query; this is supported in annotate version 1.47.3 and later.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Martin Morgan ♦♦ 20k

Thank you so much for the detailed answer.

Your first advice worked, and I can get the "data.fame" output - perfect :)

However the function blastSeqKK still only works when the fasta file contains only 2 sequences - more, and it gives the error of "no result after xx attempts" I have set the attempts up to 50.

I also tried your 2nd advice and let it run for 4h before I stopped it - this was a file containing only 10 sequences.
I think I will try to leave it overnight, just to see if it eventually comes to and end - However I don't understand why it is struggling so? - A manual blast in NCBI's webpage takes only a few sec for a file with all (42) seq. Hhm...

ADD REPLYlink written 2.3 years ago by Mathilde0
1

I looked at the code, which is quite old. I think you can submit multiple fasta sequences, but unfortunately the code for processing the results does not remember which alignment goes with which submission. So the work-around is to define the following function...

.blastSequencesToDataFrame <- function(xml) {
    if (xpathSApply(xml, "count(//Hit)") == 0L) {
        message("'blastSequences' returned 0 matches")
        return(data.frame())
    }

    iter <- xml["//Iteration"]
    iterdf <- xmlToDataFrame(iter, stringsAsFactors=FALSE)
    iterlen <- sapply(iter, xpathSApply, "count(.//Hsp)")

    hit <- xml["//Hit"]
    hitdf <- xmlToDataFrame(hit, stringsAsFactors=FALSE)
    hitdf <- hitdf[, names(hitdf) != "Hit_hsps", drop=FALSE]
    
    hitlen <- sapply(hit, xpathSApply, "count(.//Hsp)")
    hsp <- xmlToDataFrame(xml["//Hsp"] , stringsAsFactors=FALSE)

    df <- cbind(
        iterdf[rep(seq_len(nrow(iterdf)), iterlen),, drop=FALSE],
        hitdf[rep(seq_len(nrow(hitdf)), hitlen),, drop=FALSE],
        hsp)
    rownames(df) <- NULL
    df
}

Then submit multiple sequences, retrieve the results as XML, and parse them

dna1 = paste(sprintf(">%s\n%s", names(dna), as.character(dna)), collapse="\n")
res = blastSequences(dna1, as="XML")
df = .blastSequencesToDataFrame(res)

This will be updated in the 'devel' version 1.47.3 of annotate.

ADD REPLYlink written 2.3 years ago by Martin Morgan ♦♦ 20k

Again many thanks! - I am embarrassed to say I still can't get it to work... The error now is, when I try to run res = blastSequences(dna1, as="XML") I get:

Error: failed to load HTTP resource

I have tried to look up help here:

?htmlParse
?xmlToDataFrame

But did not find any obvious solutions...

 

ADD REPLYlink written 2.3 years ago by Mathilde0

Is this a transient network issue?

ADD REPLYlink written 2.3 years ago by Martin Morgan ♦♦ 20k

No, I dont think so.

I don't experience any problems with the network.

I tried to run it many times during the day.

Will be out of the office for a few days but will try again thursday.

ADD REPLYlink written 2.3 years ago by Mathilde0

I don't see the error, so you'll somehow have to make this reproducible, e.g., by simplifying dna1 or by making a saved version of it or the fasta file available (e.g., to me at mtmorgan at fredhutch dot org),

ADD REPLYlink written 2.3 years ago by Martin Morgan ♦♦ 20k

Hi again,

I tried to send it to you (on the 23rd of July) - but not sure if I send the files you asked for? - did you receive it? (it is not to rush you, just making sure if the email reached you)

 

Best,

Mathilde

ADD REPLYlink written 2.3 years ago by Mathilde0

Yes, I received it and looked at it over the weekend. The query string (fasta sequence and other options) can only be about 4000 characters total; I think you'll have success with, e.g., 10 records at a time. It might also be necessary to encode the query  string with utilis::URLencode(dna1).

It seems like it should be possible to simply upload the file, but I was not able to manage to do that through the blast REST API.

ADD REPLYlink written 2.3 years ago by Martin Morgan ♦♦ 20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 123 users visited in the last hour