Question

How to create an output file with sequence ID's from multiple BLAST result using blastSequences in "Biostrings"

0

Entering edit mode

Mathilde • 0

@mathilde-8405

Last seen 3.2 years ago

Denmark/Germany

Hello all,

I don't have a problem with a specific command. My problem is that I am stuck.

What I wish to do is this:

I have a list of sequences (42 with a length of app. 300-500bp) in a fasta file. I want to BLAST these in NCBI (blastn) and get a list of top-hits corresponding to my list.

(not very advanced I am completely new in R and/or other programming language)

I could (and am at the moment) doing this blast manually, however there will be many more sequences for me in the future and it feels so inefficient to do it manually when I know(!) a simple command could do this for me.

I started out following the description by Kevin Keenan here: http://rstudio-pubs-static.s3.amazonaws.com/12097_1352791b169f423f910d93222a4c2d85.html

which is very user-friendly, however he ends with the note: "For further information on how to proceed with this data structure, see the Biostringspackage in Bioconductor."

- I have tried searching for what to do next for hours now... So I decided to ask for help here:

Has anyone done the same and can guide me through it?

Will be much appreciated.

All the best,

Mathilde

annotation blast biostrings • 4.3k views

ADD COMMENT • link updated 8.8 years ago by Martin Morgan 25k • written 8.8 years ago by Mathilde • 0

1

Entering edit mode

What to do next depends on your specific research question and the purpose of your analysis. Regarding Biostrings, for learning what can be done and how to use it the best place to look would be the package web page here. Take a look at the vignettes, starting possibly with the one named "Biostrings Quick Overview".

ADD REPLY • link 8.8 years ago Diego Diez ▴ 760

0

Entering edit mode

No but the result "the list of ID for best hit" is not produced at the end of K.Keenan's "guide".

I know what to do with it once I got the info :)

I have this now:

[[1]]
[[1]][[1]]
DNAMultipleAlignment with 2 rows and 324 columns
aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[1]][[2]]
DNAMultipleAlignment with 2 rows and 324 columns
aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[2]]
[[2]][[1]]
DNAMultipleAlignment with 2 rows and 307 columns
aln
[1] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA
[2] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA

[[2]][[2]]
DNAMultipleAlignment with 2 rows and 307 columns
aln
[1] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA
[2] TTCTCTCTGAATCTGCGAACGGCTCCGCAAACCAGT...ACCTATCAACTAGACGGCAGCGTAATGGACATGCTA

[[3]]
[[3]][[1]]
DNAMultipleAlignment with 2 rows and 519 columns
aln
[1] TGATCCTGCCAGTAGTGTATGCTTCTCCTAAAGACT...CCCTTGGGCAATGCCCGAGGGCGTTAGGGG-ACATA
[2] TGATCCTGCCAGTAGTGTATGCTTCTCCTAAAGACT...CCCTTGGGCAATGCCCGAGGGCGTTAGGGGCACATA

[[3]][[2]]
DNAMultipleAlignment with 2 rows and 502 columns
aln
[1] TATGCTTCTCCTAAAGACTAAGCCATGCATGCCTTC...CCCTTGGGCAATGCCCGAGGGCGTTAGGGG-ACATA
[2] TATGCTTCTCCTAAAGACTAAGCCATGCATGCCTTC...CCCTTGGGCAATGCCCGAGGGCGTTAGGGGCACATA

And so on....

ADD REPLY • link 8.8 years ago Mathilde • 0

score 1 · Answer 1 · 2015-07-16

1

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Mathilde,

In your case it sounds like you might just want to use the argument to the blastSequences function called 'as' and set it to 'data.frame'. This can return a lot of other data back to you about what exactly your sequence aligns to. So just modifying the example from the man page here it could look like this:

library(annotate)

blastSequences(x = "GGCCTTCATTTACCCAAAATG", as = 'data.frame')

Hope that helps you,

Marc

ADD COMMENT • link 8.8 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Thank you for you reply.

However I am still struggling.

the blastSequences() does not work for me.

estimated response time 33 seconds
elapsed time 35 seconds
timeout after 61 seconds; wait another 60 seconds? [y/n]

(I have pressed "y" many times and increased the timeout time to 320 sec)

A problem which I think is related to large query sequences - as pointed at by K. Keenan. which is why he has written an alternative function blastSeqKK(), adding the argument of attempts, this one (from here: http://rstudio-pubs-static.s3.amazonaws.com/12097_1352791b169f423f910d93222a4c2d85.html):

# function definition blastSeqKK <- function (x, database = "nr", hitListSize = "10", filter = "L", expect = "10", program = "blastn", attempts = 10) { baseUrl <- "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi" query <- paste("QUERY=", as.character(x), "&DATABASE=", database, "&HITLIST_SIZE=", hitListSize, "&FILTER=", filter, "&EXPECT=", expect, "&PROGRAM=", program, sep = "") url0 <- sprintf("%s?%s&CMD=Put", baseUrl, query) results <- tempfile() Sys.sleep(5) require(XML) post <- htmlTreeParse(url0, useInternalNodes = TRUE) x <- post[["string(//comment()[contains(., \"QBlastInfoBegin\")])"]] rid <- sub(".*RID = ([[:alnum:]]+).*", "\\1", x) rtoe <- as.integer(sub(".*RTOE = ([[:digit:]]+).*", "\\1", x)) url1 <- sprintf("%s?RID=%s&FORMAT_TYPE=XML&CMD=Get", baseUrl, rid) Sys.sleep(rtoe) .tryParseResult <- function(url, attempts){ for (i in 1:(attempts+1)) { result <- tryCatch({ xmlTreeParse(url, useInternalNodes=TRUE, error = xmlErrorCumulator(immediate=FALSE)) }, error=function(err) NULL) if (!is.null(result)) return(result) Sys.sleep(10) } stop(paste("no results after ", attempts, " attempts; please try again later", sep = "")) } result <- .tryParseResult(url1, attempts) qseq <- xpathApply(result, "//Hsp_qseq", xmlValue) hseq <- xpathApply(result, "//Hsp_hseq", xmlValue) require(Biostrings) res <- list() for (i in seq_len(length(qseq))) { res[i] <- DNAMultipleAlignment(c(hseq[[i]], qseq[[i]]), rowmask = as(IRanges(), "NormalIRanges"), colmask = as(IRanges(), "NormalIRanges")) } res }

However, as I understand this there is no longer the choice/argument of getting the output as a "data.fame", like in the original function:

blastSequences(x, database, hitListSize, filter, expect, program,
      timeout=40, as=c("DNAMultipleAlignment", "data.frame", "XML"))

So this is where I am stuck...

I tried to simply (naïve) add the argument:

# function definition
blastSeqKK <- function (x, database = "nr", hitListSize = "10", 
                        filter = "L", expect = "10", program = "blastn",timeout=40,
 as=c("DNAMultipleAlignment", "data.frame", "XML"),
                        attempts = 10)

But it was "ignored" and the output when running:

res <- lapply(mySeq, function(x){
  # collapse seq into string
  seqCollapse <- paste(toupper(as.character(x)), collapse = "")
  # run blast
  blastRes <- blastSeqKK(x = seqCollapse, hitListSize = 2, timeout=60, 
    attempts = 30, as=c("data.frame") )
  return(blastRes)

Was still this:

> head(res)
[[1]]
[[1]][[1]]
DNAMultipleAlignment with 2 rows and 324 columns
aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[1]][[2]]
DNAMultipleAlignment with 2 rows and 324 columns
aln
[1] CTGAATCTGCGAACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGGTTACAACGG
[2] CTGAATCTGCGTACGGCTCCGCACACCAGTTGCAAA...GATGGCAGCGTAAGGGACATGCTATGG-TACAACGG

[[2]]
[[2]][[1]]
DNAMultipleAlignment with 2 rows and 307 columns
     aln

Sooo, I don't know how to proceed right now, so again any suggestions would be much appreciated.

/Mathilde

ADD REPLY • link 8.8 years ago Mathilde • 0

score 1 · Answer 2 · 2015-07-18

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 17 hours ago

United States

I think you can change the final lines of blastSequenceKK() from

  result <- .tryParseResult(url1, attempts)
  qseq <- xpathApply(result, "//Hsp_qseq", xmlValue)
  hseq <- xpathApply(result, "//Hsp_hseq", xmlValue)
  require(Biostrings)
  res <- list()
  for (i in seq_len(length(qseq))) {
    res[i] <- DNAMultipleAlignment(c(hseq[[i]], qseq[[i]]), 
                 rowmask = as(IRanges(), "NormalIRanges"),
                 colmask = as(IRanges(), "NormalIRanges"))
  }
  res

to

  result <- .tryParseResult(url1, attempts)
  annotate:::.blastSequencesToDataFrame(result)

to return the data.frame that Marc mentioned. Also, to read the fasta file I'd do

library(Biostrings)
dna = readDNAStringSet("path/to.fasta")
lapply(as.character(dna), blastSequencesKK)

Actually, I'd just stick with blastSequences() and increase the timeout to whatever is tolerable. UPDATE: an earlier version included submitting multiple sequences as queries in parallel, but it's better to submit several sequences in a single query; this is supported in annotate version 1.47.3 and later.

ADD COMMENT • link 8.8 years ago • updated 8.7 years ago Martin Morgan 25k

0

Entering edit mode

Thank you so much for the detailed answer.

Your first advice worked, and I can get the "data.fame" output - perfect :)

However the function blastSeqKK still only works when the fasta file contains only 2 sequences - more, and it gives the error of "no result after xx attempts" I have set the attempts up to 50.

I also tried your 2nd advice and let it run for 4h before I stopped it - this was a file containing only 10 sequences.
I think I will try to leave it overnight, just to see if it eventually comes to and end - However I don't understand why it is struggling so? - A manual blast in NCBI's webpage takes only a few sec for a file with all (42) seq. Hhm...

ADD REPLY • link 8.7 years ago Mathilde • 0

1

Entering edit mode

I looked at the code, which is quite old. I think you can submit multiple fasta sequences, but unfortunately the code for processing the results does not remember which alignment goes with which submission. So the work-around is to define the following function...

.blastSequencesToDataFrame <- function(xml) {
    if (xpathSApply(xml, "count(//Hit)") == 0L) {
        message("'blastSequences' returned 0 matches")
        return(data.frame())
    }

    iter <- xml["//Iteration"]
    iterdf <- xmlToDataFrame(iter, stringsAsFactors=FALSE)
    iterlen <- sapply(iter, xpathSApply, "count(.//Hsp)")

    hit <- xml["//Hit"]
    hitdf <- xmlToDataFrame(hit, stringsAsFactors=FALSE)
    hitdf <- hitdf[, names(hitdf) != "Hit_hsps", drop=FALSE]
    
    hitlen <- sapply(hit, xpathSApply, "count(.//Hsp)")
    hsp <- xmlToDataFrame(xml["//Hsp"] , stringsAsFactors=FALSE)

    df <- cbind(
        iterdf[rep(seq_len(nrow(iterdf)), iterlen),, drop=FALSE],
        hitdf[rep(seq_len(nrow(hitdf)), hitlen),, drop=FALSE],
        hsp)
    rownames(df) <- NULL
    df
}

Then submit multiple sequences, retrieve the results as XML, and parse them

dna1 = paste(sprintf(">%s\n%s", names(dna), as.character(dna)), collapse="\n")
res = blastSequences(dna1, as="XML")
df = .blastSequencesToDataFrame(res)

This will be updated in the 'devel' version 1.47.3 of annotate.

ADD REPLY • link 8.7 years ago Martin Morgan 25k

0

Entering edit mode

Again many thanks! - I am embarrassed to say I still can't get it to work... The error now is, when I try to run res = blastSequences(dna1, as="XML") I get:

Error: failed to load HTTP resource

I have tried to look up help here:

?htmlParse
?xmlToDataFrame

But did not find any obvious solutions...

ADD REPLY • link 8.7 years ago Mathilde • 0

0

Entering edit mode

Is this a transient network issue?

ADD REPLY • link 8.7 years ago Martin Morgan 25k

0

Entering edit mode

No, I dont think so.

I don't experience any problems with the network.

I tried to run it many times during the day.

Will be out of the office for a few days but will try again thursday.

ADD REPLY • link 8.7 years ago Mathilde • 0

0

Entering edit mode

I don't see the error, so you'll somehow have to make this reproducible, e.g., by simplifying dna1 or by making a saved version of it or the fasta file available (e.g., to me at mtmorgan at fredhutch dot org),

ADD REPLY • link 8.7 years ago Martin Morgan 25k

0

Entering edit mode

Hi again,

I tried to send it to you (on the 23rd of July) - but not sure if I send the files you asked for? - did you receive it? (it is not to rush you, just making sure if the email reached you)

Best,

Mathilde

ADD REPLY • link 8.7 years ago Mathilde • 0

0

Entering edit mode

Yes, I received it and looked at it over the weekend. The query string (fasta sequence and other options) can only be about 4000 characters total; I think you'll have success with, e.g., 10 records at a time. It might also be necessary to encode the query string with utilis::URLencode(dna1).

It seems like it should be possible to simply upload the file, but I was not able to manage to do that through the blast REST API.

ADD REPLY • link 8.7 years ago Martin Morgan 25k