Question: How to create an output file with sequence ID's from multiple BLAST result using blastSequences in "Biostrings"
0
4.3 years ago by
Mathilde0
Denmark/Germany
Mathilde0 wrote:

Hello all,

I don't have a problem with a specific command. My problem is that I am stuck.

What I wish to do is this:

I have a list of sequences (42 with a length of app. 300-500bp) in a fasta file. I want to BLAST these in NCBI (blastn) and get a list of top-hits corresponding to my list.

(not very advanced I am completely new in R and/or other programming language)

I could (and am at the moment) doing this blast manually, however there will be many more sequences for me in the future and it feels so inefficient to do it manually when I know(!) a simple command could do this for me.

I started out following the description by Kevin Keenan here: http://rstudio-pubs-static.s3.amazonaws.com/12097_1352791b169f423f910d93222a4c2d85.html

which is very user-friendly, however he ends with the note: "For further information on how to proceed with this data structure, see the Biostringspackage in Bioconductor."

- I have tried searching for what to do next for hours now... So I decided to ask for help here:

Has anyone done the same and can guide me through it?

Will be much appreciated.

All the best,

Mathilde

annotation biostrings blast • 1.9k views
modified 4.3 years ago by Martin Morgan ♦♦ 24k • written 4.3 years ago by Mathilde0
1

What to do next depends on your specific research question and the purpose of your analysis. Regarding Biostrings, for learning what can be done and how to use it the best place to look would be the package web page here. Take a look at the vignettes, starting possibly with the one named "Biostrings Quick Overview".

No but the result "the list of ID for best hit" is not produced at the end of K.Keenan's "guide".

I know what to do with it once I got the info :)

I have this now:

[[1]]
[[1]][[1]]
DNAMultipleAlignment with 2 rows and 324 columns
aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[1]][[2]]
DNAMultipleAlignment with 2 rows and 324 columns
aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[2]]
[[2]][[1]]
DNAMultipleAlignment with 2 rows and 307 columns
aln
[1] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA
[2] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA

[[2]][[2]]
DNAMultipleAlignment with 2 rows and 307 columns
aln
[1] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA
[2] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA

[[3]]
[[3]][[1]]
DNAMultipleAlignment with 2 rows and 519 columns
aln
[1] TGATCCTGCCAGTAGTGTATGCTTCTCCTAAAGACT...CCCTTGGGCAATGCCCGAGGGCGTTAGGGG-ACATA
[2] TGATCCTGCCAGTAGTGTATGCTTCTCCTAAAGACT...CCCTTGGGCAATGCCCGAGGGCGTTAGGGGCACATA

[[3]][[2]]
DNAMultipleAlignment with 2 rows and 502 columns
aln
[1] TATGCTTCTCCTAAAGACTAAGCCATGCATGCCTTC...CCCTTGGGCAATGCCCGAGGGCGTTAGGGG-ACATA
[2] TATGCTTCTCCTAAAGACTAAGCCATGCATGCCTTC...CCCTTGGGCAATGCCCGAGGGCGTTAGGGGCACATA

And so on....

Answer: How to create an output file with sequence ID's from multiple BLAST result using
1
4.3 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

Hi Mathilde,

In your case it sounds like you might just want to use the argument to the blastSequences function called 'as' and set it to 'data.frame'.  This can return a lot of other data back to you about what exactly your sequence aligns to.  So just modifying the example from the man page here it could look like this:

library(annotate)

blastSequences(x = "GGCCTTCATTTACCCAAAATG", as = 'data.frame')

Hope that helps you,

Marc

However I am still struggling.

the blastSequences() does not work for me.

estimated response time 33 seconds
elapsed time 35 seconds
timeout after 61 seconds; wait another 60 seconds? [y/n]

(I have pressed "y" many times and increased the timeout time to 320 sec)

A problem which I think is related to large query sequences - as pointed at by K. Keenan. which is why he has written an alternative function blastSeqKK(), adding the argument of attempts, this one (from here: http://rstudio-pubs-static.s3.amazonaws.com/12097_1352791b169f423f910d93222a4c2d85.html):

# function definition blastSeqKK <- function (x, database = "nr", hitListSize = "10", filter = "L", expect = "10", program = "blastn", attempts = 10) { baseUrl <- "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi" query <- paste("QUERY=", as.character(x), "&DATABASE=", database, "&HITLIST_SIZE=", hitListSize, "&FILTER=", filter, "&EXPECT=", expect, "&PROGRAM=", program, sep = "") url0 <- sprintf("%s?%s&CMD=Put", baseUrl, query) results <- tempfile() Sys.sleep(5) require(XML) post <- htmlTreeParse(url0, useInternalNodes = TRUE) x <- post[["string(//comment()[contains(., \"QBlastInfoBegin\")])"]] rid <- sub(".*RID = ([[:alnum:]]+).*", "\\1", x) rtoe <- as.integer(sub(".*RTOE = ([[:digit:]]+).*", "\\1", x)) url1 <- sprintf("%s?RID=%s&FORMAT_TYPE=XML&CMD=Get", baseUrl, rid) Sys.sleep(rtoe) .tryParseResult <- function(url, attempts){ for (i in 1:(attempts+1)) { result <- tryCatch({ xmlTreeParse(url, useInternalNodes=TRUE, error = xmlErrorCumulator(immediate=FALSE)) }, error=function(err) NULL) if (!is.null(result)) return(result) Sys.sleep(10) } stop(paste("no results after ", attempts, " attempts; please try again later", sep = "")) } result <- .tryParseResult(url1, attempts) qseq <- xpathApply(result, "//Hsp_qseq", xmlValue) hseq <- xpathApply(result, "//Hsp_hseq", xmlValue) require(Biostrings) res <- list() for (i in seq_len(length(qseq))) { res[i] <- DNAMultipleAlignment(c(hseq[[i]], qseq[[i]]), rowmask = as(IRanges(), "NormalIRanges"), colmask = as(IRanges(), "NormalIRanges")) } res }

However, as I understand this there is no longer the choice/argument of getting the output as a "data.fame", like in the original function:

blastSequences(x, database, hitListSize, filter, expect, program,
timeout=40, as=c("DNAMultipleAlignment", "data.frame", "XML"))

So this is where I am stuck...

I tried to simply (naïve) add the argument:

# function definition
blastSeqKK <- function (x, database = "nr", hitListSize = "10",
filter = "L", expect = "10", program = "blastn",timeout=40,
as=c("DNAMultipleAlignment", "data.frame", "XML"),
attempts = 10) 

But it was "ignored" and the output when running:

res <- lapply(mySeq, function(x){
# collapse seq into string
seqCollapse <- paste(toupper(as.character(x)), collapse = "")
# run blast
blastRes <- blastSeqKK(x = seqCollapse, hitListSize = 2, timeout=60,
attempts = 30, as=c("data.frame") )
return(blastRes)

Was still this:

[[1]]
[[1]][[1]]
DNAMultipleAlignment with 2 rows and 324 columns
aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[1]][[2]]
DNAMultipleAlignment with 2 rows and 324 columns
aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[2]]
[[2]][[1]]
DNAMultipleAlignment with 2 rows and 307 columns
aln


Sooo, I don't know how to proceed right now, so again any suggestions would be much appreciated.

/Mathilde

Answer: How to create an output file with sequence ID's from multiple BLAST result using
1
4.3 years ago by
Martin Morgan ♦♦ 24k
United States
Martin Morgan ♦♦ 24k wrote:

I think you can change the final lines of blastSequenceKK() from

  result <- .tryParseResult(url1, attempts)
qseq <- xpathApply(result, "//Hsp_qseq", xmlValue)
hseq <- xpathApply(result, "//Hsp_hseq", xmlValue)
require(Biostrings)
res <- list()
for (i in seq_len(length(qseq))) {
res[i] <- DNAMultipleAlignment(c(hseq[[i]], qseq[[i]]),
}
res


to

  result <- .tryParseResult(url1, attempts)
annotate:::.blastSequencesToDataFrame(result)

to return the data.frame that Marc mentioned. Also, to read the fasta file I'd do

library(Biostrings)
lapply(as.character(dna), blastSequencesKK)

Actually, I'd just stick with blastSequences() and increase the timeout to whatever is tolerable. UPDATE: an earlier version included submitting multiple sequences as queries in parallel, but it's better to submit several sequences in a single query; this is supported in annotate version 1.47.3 and later.

Thank you so much for the detailed answer.

Your first advice worked, and I can get the "data.fame" output - perfect :)

However the function blastSeqKK still only works when the fasta file contains only 2 sequences - more, and it gives the error of "no result after xx attempts" I have set the attempts up to 50.

I also tried your 2nd advice and let it run for 4h before I stopped it - this was a file containing only 10 sequences.
I think I will try to leave it overnight, just to see if it eventually comes to and end - However I don't understand why it is struggling so? - A manual blast in NCBI's webpage takes only a few sec for a file with all (42) seq. Hhm...

1

I looked at the code, which is quite old. I think you can submit multiple fasta sequences, but unfortunately the code for processing the results does not remember which alignment goes with which submission. So the work-around is to define the following function...

.blastSequencesToDataFrame <- function(xml) {
if (xpathSApply(xml, "count(//Hit)") == 0L) {
message("'blastSequences' returned 0 matches")
return(data.frame())
}

iter <- xml["//Iteration"]
iterdf <- xmlToDataFrame(iter, stringsAsFactors=FALSE)
iterlen <- sapply(iter, xpathSApply, "count(.//Hsp)")

hit <- xml["//Hit"]
hitdf <- xmlToDataFrame(hit, stringsAsFactors=FALSE)
hitdf <- hitdf[, names(hitdf) != "Hit_hsps", drop=FALSE]

hitlen <- sapply(hit, xpathSApply, "count(.//Hsp)")
hsp <- xmlToDataFrame(xml["//Hsp"] , stringsAsFactors=FALSE)

df <- cbind(
iterdf[rep(seq_len(nrow(iterdf)), iterlen),, drop=FALSE],
hitdf[rep(seq_len(nrow(hitdf)), hitlen),, drop=FALSE],
hsp)
rownames(df) <- NULL
df
}

Then submit multiple sequences, retrieve the results as XML, and parse them

dna1 = paste(sprintf(">%s\n%s", names(dna), as.character(dna)), collapse="\n")
res = blastSequences(dna1, as="XML")
df = .blastSequencesToDataFrame(res)

This will be updated in the 'devel' version 1.47.3 of annotate.

Again many thanks! - I am embarrassed to say I still can't get it to work... The error now is, when I try to run res = blastSequences(dna1, as="XML") I get:

Error: failed to load HTTP resource

I have tried to look up help here:

?htmlParse
?xmlToDataFrame

But did not find any obvious solutions...

Is this a transient network issue?

No, I dont think so.

I don't experience any problems with the network.

I tried to run it many times during the day.

Will be out of the office for a few days but will try again thursday.

I don't see the error, so you'll somehow have to make this reproducible, e.g., by simplifying dna1 or by making a saved version of it or the fasta file available (e.g., to me at mtmorgan at fredhutch dot org),

Hi again,

I tried to send it to you (on the 23rd of July) - but not sure if I send the files you asked for? - did you receive it? (it is not to rush you, just making sure if the email reached you)

Best,

Mathilde

Yes, I received it and looked at it over the weekend. The query string (fasta sequence and other options) can only be about 4000 characters total; I think you'll have success with, e.g., 10 records at a time. It might also be necessary to encode the query  string with utilis::URLencode(dna1).

It seems like it should be possible to simply upload the file, but I was not able to manage to do that through the blast REST API.