BLAST search sequence for species ID from R?

1

Entering edit mode

jos matejus ▴ 40

@jos-matejus-3848

Last seen 9.6 years ago

Dear list members, A colleague has asked whether I can help him with a bioinformatics problem he has as he knows I use R (although I don't usually use R for this type of problem) and I was hoping someone might be able to point me in the right direction. I have searched the mailing list archives and also Googled this particular query, but without success. I ask forgiveness in advance if the question is not appropriate for this forum. Anyway, the background is that my colleague has a sample collected from the field containing many species of related insects (same genus) which he has obtained lots of sequence information (from 454). The sequences are saved in a single fasta file. What he wants to do is to query Genbank to match each sequence from the fasta file to particular species (A nucleotide blast search I believe) and return the top ranked match for each sequence. He can do this manually via the web page, but he will have a lot of these files in the future and was looking for some way of automating the process (hence using R). He ultimately wants to be able to restrict the Blast search to a list of preselected Accession numbers or within genus. As I am not familiar with this field I was wondering whether anyone knows of an existing function (or functions) that can do the job. I am looking at the package seqinr at the moment to see whether this would fit the bill and also whether the Biostrings package would be appropriate. However, the learning curve looks a little steep and I wanted to make sure I was going down the right road before investing lots of time. Also, is there a package that I can use to access the Genbank database directly from within R to do the Blast searches? Many many thanks in advance Jos

Biostrings PROcess Biostrings PROcess • 2.8k views

ADD COMMENT • link updated 14.4 years ago by Michael Dondrup ▴ 550 • written 14.4 years ago by jos matejus ▴ 40

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Mon, Dec 14, 2009 at 6:02 AM, jos matejus <matejus106 at="" googlemail.com=""> wrote: > Dear list members, > > A colleague has asked whether I can help him with a bioinformatics > problem he has as he knows I use R (although I don't usually use R for > this type of problem) and I was hoping someone might be able to point > me in the right direction. I have searched the mailing list archives > and also Googled this particular query, but without success. I ask > forgiveness in advance if the question is not appropriate for this > forum. > > Anyway, the background is that my colleague has a sample collected > from the field containing many species of related insects (same genus) > which he has obtained lots of sequence information (from 454). The > sequences are saved in a single fasta file. What he wants to do is to > query Genbank to match each sequence from the fasta file to particular > species (A nucleotide blast search I believe) and return the top > ranked match for each sequence. He can do this manually via the web > page, but he will have a lot of these files in the future and was > looking for some way of automating the process (hence using R). He > ultimately wants to be able to restrict the Blast search to a list of > preselected ?Accession numbers or within genus. > > As I am not familiar with this field I was wondering whether anyone > knows of an existing function (or functions) that can do the job. I am > looking at the package seqinr at the moment to see whether this would > fit the bill and also whether the Biostrings package would be > appropriate. However, the learning curve looks a little steep and I > wanted to make sure I was going down the right road before investing > lots of time. > > Also, is there a package that I can use to access the Genbank database > directly from within R to do the Blast searches? Hi, Jos. R/Bioconductor has the tools that you could use to rewrite blast if you like, but for cross-species comparisons, blast is a pretty good tool. You might consider downloading the blast source or a convenient binary from NCBI and running it locally. Of course, a command-line utility like blast can be easily scripted from R. As for getting records from GenBank, consider whether you need programmatic control over the process at all. It may be that your colleague already has a way to get the sequences he/she wants directly and just wants help with running blast. However, if it does appear that programmatic access is necessary, have a look at the EUtilities web services. With EUtilities, it is pretty straightforward to construct a URL, use something like getURL() from the Rcurl package, and get the results. Hope that helps. Sean

ADD COMMENT • link 14.4 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: > On Mon, Dec 14, 2009 at 6:02 AM, jos matejus <matejus106 at="" googlemail.com=""> wrote: >> Dear list members, >> >> A colleague has asked whether I can help him with a bioinformatics >> problem he has as he knows I use R (although I don't usually use R for >> this type of problem) and I was hoping someone might be able to point >> me in the right direction. I have searched the mailing list archives >> and also Googled this particular query, but without success. I ask >> forgiveness in advance if the question is not appropriate for this >> forum. >> >> Anyway, the background is that my colleague has a sample collected >> from the field containing many species of related insects (same genus) >> which he has obtained lots of sequence information (from 454). The >> sequences are saved in a single fasta file. What he wants to do is to >> query Genbank to match each sequence from the fasta file to particular >> species (A nucleotide blast search I believe) and return the top >> ranked match for each sequence. He can do this manually via the web >> page, but he will have a lot of these files in the future and was >> looking for some way of automating the process (hence using R). He >> ultimately wants to be able to restrict the Blast search to a list of >> preselected Accession numbers or within genus. >> >> As I am not familiar with this field I was wondering whether anyone >> knows of an existing function (or functions) that can do the job. I am >> looking at the package seqinr at the moment to see whether this would >> fit the bill and also whether the Biostrings package would be >> appropriate. However, the learning curve looks a little steep and I >> wanted to make sure I was going down the right road before investing >> lots of time. To comment, in some ways I think your colleague needs more than a blast search from you. The 454 reads have sequencing errors and (presumably) amplification biases, and their variation in length means that blast matches are not equivalently powered (maybe the expect value accomodates this? it's not an area I'm particularly familiar with). BLASTing all reads is probably hugely inefficient (e.g., because many reads are likely duplicated). Also maybe worth thinking about how you'll help your colleague in the down-stream analysis -- they will have count data for each taxonomic unit, but presumably this is in the context of an experimental design requiring analysis, including appropriate error models for the count data. And perhaps there is ambiguity about 'species', both in terms of entities not represented in the blast data base and the phylogenetic question about delineating taxonomic units. Martin >> >> Also, is there a package that I can use to access the Genbank database >> directly from within R to do the Blast searches? > > Hi, Jos. > > R/Bioconductor has the tools that you could use to rewrite blast if > you like, but for cross-species comparisons, blast is a pretty good > tool. You might consider downloading the blast source or a convenient > binary from NCBI and running it locally. Of course, a command-line > utility like blast can be easily scripted from R. As for getting > records from GenBank, consider whether you need programmatic control > over the process at all. It may be that your colleague already has a > way to get the sequences he/she wants directly and just wants help > with running blast. However, if it does appear that programmatic > access is necessary, have a look at the EUtilities web services. With > EUtilities, it is pretty straightforward to construct a URL, use > something like getURL() from the Rcurl package, and get the results. > > Hope that helps. > > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 14.4 years ago Martin Morgan 25k

0

Entering edit mode

hi, seeing now Martins' comment, i recall the following related forthcoming PSB10 papers that may be addressing what you ask for: Clemente, Jansson and Valiente Accurate Taxonomic Assignment of Short Pyrosequencing Reads Pacific Symposium on Biocomputing 15:3-9(2010) http://psb.stanford.edu/psb-online/proceedings/psb10/clemente.pdf Essinger and Rosen Benchmarking BLAST Accuracy of Genus/Phyla Classification of Metagenomic Reads Pacific Symposium on Biocomputing 15:10-20(2010) http://psb.stanford.edu/psb-online/proceedings/psb10/essinger.pdf although i do not see any comment in their abstracts about providing a package for R and Bioconductor and so you'll have to contact those authors to request the software. cheers, robert. On Mon, 2009-12-14 at 09:08 -0800, Martin Morgan wrote: > Sean Davis wrote: > > On Mon, Dec 14, 2009 at 6:02 AM, jos matejus <matejus106 at="" googlemail.com=""> wrote: > >> Dear list members, > >> > >> A colleague has asked whether I can help him with a bioinformatics > >> problem he has as he knows I use R (although I don't usually use R for > >> this type of problem) and I was hoping someone might be able to point > >> me in the right direction. I have searched the mailing list archives > >> and also Googled this particular query, but without success. I ask > >> forgiveness in advance if the question is not appropriate for this > >> forum. > >> > >> Anyway, the background is that my colleague has a sample collected > >> from the field containing many species of related insects (same genus) > >> which he has obtained lots of sequence information (from 454). The > >> sequences are saved in a single fasta file. What he wants to do is to > >> query Genbank to match each sequence from the fasta file to particular > >> species (A nucleotide blast search I believe) and return the top > >> ranked match for each sequence. He can do this manually via the web > >> page, but he will have a lot of these files in the future and was > >> looking for some way of automating the process (hence using R). He > >> ultimately wants to be able to restrict the Blast search to a list of > >> preselected Accession numbers or within genus. > >> > >> As I am not familiar with this field I was wondering whether anyone > >> knows of an existing function (or functions) that can do the job. I am > >> looking at the package seqinr at the moment to see whether this would > >> fit the bill and also whether the Biostrings package would be > >> appropriate. However, the learning curve looks a little steep and I > >> wanted to make sure I was going down the right road before investing > >> lots of time. > > To comment, in some ways I think your colleague needs more than a blast > search from you. The 454 reads have sequencing errors and (presumably) > amplification biases, and their variation in length means that blast > matches are not equivalently powered (maybe the expect value accomodates > this? it's not an area I'm particularly familiar with). BLASTing all > reads is probably hugely inefficient (e.g., because many reads are > likely duplicated). Also maybe worth thinking about how you'll help your > colleague in the down-stream analysis -- they will have count data for > each taxonomic unit, but presumably this is in the context of an > experimental design requiring analysis, including appropriate error > models for the count data. And perhaps there is ambiguity about > 'species', both in terms of entities not represented in the blast data > base and the phylogenetic question about delineating taxonomic units. > > Martin > > >> > >> Also, is there a package that I can use to access the Genbank database > >> directly from within R to do the Blast searches? > > > > Hi, Jos. > > > > R/Bioconductor has the tools that you could use to rewrite blast if > > you like, but for cross-species comparisons, blast is a pretty good > > tool. You might consider downloading the blast source or a convenient > > binary from NCBI and running it locally. Of course, a command- line > > utility like blast can be easily scripted from R. As for getting > > records from GenBank, consider whether you need programmatic control > > over the process at all. It may be that your colleague already has a > > way to get the sequences he/she wants directly and just wants help > > with running blast. However, if it does appear that programmatic > > access is necessary, have a look at the EUtilities web services. With > > EUtilities, it is pretty straightforward to construct a URL, use > > something like getURL() from the Rcurl package, and get the results. > > > > Hope that helps. > > > > Sean > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.4 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

Dear all, Many thanks for all the advice, it is greatly appreciated. I am always impressed by the amount of time and effort people are willing to give. I'll let the list know how I get on. Seasons greetings to you all Jos On Monday, December 14, 2009, Sean Davis <seandavi at="" gmail.com=""> wrote: > On Mon, Dec 14, 2009 at 6:02 AM, jos matejus <matejus106 at="" googlemail.com=""> wrote: >> Dear list members, >> >> A colleague has asked whether I can help him with a bioinformatics >> problem he has as he knows I use R (although I don't usually use R for >> this type of problem) and I was hoping someone might be able to point >> me in the right direction. I have searched the mailing list archives >> and also Googled this particular query, but without success. I ask >> forgiveness in advance if the question is not appropriate for this >> forum. >> >> Anyway, the background is that my colleague has a sample collected >> from the field containing many species of related insects (same genus) >> which he has obtained lots of sequence information (from 454). The >> sequences are saved in a single fasta file. What he wants to do is to >> query Genbank to match each sequence from the fasta file to particular >> species (A nucleotide blast search I believe) and return the top >> ranked match for each sequence. He can do this manually via the web >> page, but he will have a lot of these files in the future and was >> looking for some way of automating the process (hence using R). He >> ultimately wants to be able to restrict the Blast search to a list of >> preselected ?Accession numbers or within genus. >> >> As I am not familiar with this field I was wondering whether anyone >> knows of an existing function (or functions) that can do the job. I am >> looking at the package seqinr at the moment to see whether this would >> fit the bill and also whether the Biostrings package would be >> appropriate. However, the learning curve looks a little steep and I >> wanted to make sure I was going down the right road before investing >> lots of time. >> >> Also, is there a package that I can use to access the Genbank database >> directly from within R to do the Blast searches? > > Hi, Jos. > > R/Bioconductor has the tools that you could use to rewrite blast if > you like, but for cross-species comparisons, blast is a pretty good > tool. ?You might consider downloading the blast source or a convenient > binary from NCBI and running it locally. ?Of course, a command-line > utility like blast can be easily scripted from R. ?As for getting > records from GenBank, consider whether you need programmatic control > over the process at all. ?It may be that your colleague already has a > way to get the sequences he/she wants directly and just wants help > with running blast. ?However, if it does appear that programmatic > access is necessary, have a look at the EUtilities web services. ?With > EUtilities, it is pretty straightforward to construct a URL, use > something like getURL() from the Rcurl package, and get the results. > > Hope that helps. > > Sean >

ADD REPLY • link 14.4 years ago jos matejus ▴ 40

0

Entering edit mode

Michael Dondrup ▴ 550

@michael-dondrup-3849

Last seen 9.6 years ago

Hi, if your search involves > thousands of sequences, it might be a good idea to download and install the NCBI blast software and the nucleotide (nt) database locally and perform the searches in a program call on a fast machine. (So this is not an R task) You can then use R to further process the output files, e.g. counting top-hits to single species. If the blast output is in tabular form ( blast option for this is "-m 8" ) this should be the easiest way of getting the data into R. A different story will be to tune the blast parameters to work effectively in this short-read 454 setting. I would like to propose you have a look at other metagenomics approaches, e.g. the MEGAN software: http://www-ab.informatik.uni-tuebingen.de/software/megan This might also be a good tool to try. It does not use R though, but you can get some ideas about how to set blast parameters like decrease word size and turn of low complexity filtering. Best Michael Am Dec 14, 2009 um 12:02 PM schrieb jos matejus: > Dear list members, > > A colleague has asked whether I can help him with a bioinformatics > problem he has as he knows I use R (although I don't usually use R for > this type of problem) and I was hoping someone might be able to point > me in the right direction. I have searched the mailing list archives > and also Googled this particular query, but without success. I ask > forgiveness in advance if the question is not appropriate for this > forum. > > Anyway, the background is that my colleague has a sample collected > from the field containing many species of related insects (same genus) > which he has obtained lots of sequence information (from 454). The > sequences are saved in a single fasta file. What he wants to do is to > query Genbank to match each sequence from the fasta file to particular > species (A nucleotide blast search I believe) and return the top > ranked match for each sequence. He can do this manually via the web > page, but he will have a lot of these files in the future and was > looking for some way of automating the process (hence using R). He > ultimately wants to be able to restrict the Blast search to a list of > preselected Accession numbers or within genus. > > As I am not familiar with this field I was wondering whether anyone > knows of an existing function (or functions) that can do the job. I am > looking at the package seqinr at the moment to see whether this would > fit the bill and also whether the Biostrings package would be > appropriate. However, the learning curve looks a little steep and I > wanted to make sure I was going down the right road before investing > lots of time. > > Also, is there a package that I can use to access the Genbank database > directly from within R to do the Blast searches? > > Many many thanks in advance > Jos > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.4 years ago Michael Dondrup ▴ 550

0

Entering edit mode

Dear Jos, Michael. To add to Michael's good advice. Roche 454, as I understand it, generates ~450bp reads, so there should be no problem with the default BLAST word size parameter. Once you downloaded your nucleotide database files and your copy of NCBI blastall program, do something like: formatdb -i nucleotide_file -o T -p F -n nucleotide_db This generates a database to search; then to search your 454 sequences, do: blastall -a 2 -p blastn -i 454_file -d nucleotide_db -e 0.01 -F "m S" -m 8 -o output_file -F "m S" means it will ignore simple sequences in look up phase of BLAST but use them in extending HSPs (I usually use it with protein searches), you can also try -F F or -F T if it runs slow etc. Type: blastall with no arguments to see full list of what the arguments mean, also formatdb to see those too. You could also try; for species sequences, I usually find ftp://ftp.ncbi.nih.gov/refseq/release download site quite useful, though for bacterial species I don't know. Also sounds like some Perl programming magic would be very useful to you here. Good luck. Kind regards, John. -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor- bounces@stat.math.ethz.ch] On Behalf Of Michael Dondrup Sent: 14 December 2009 12:34 To: jos matejus Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] BLAST search sequence for species ID from R? Hi, if your search involves > thousands of sequences, it might be a good idea to download and install the NCBI blast software and the nucleotide (nt) database locally and perform the searches in a program call on a fast machine. (So this is not an R task) You can then use R to further process the output files, e.g. counting top-hits to single species. If the blast output is in tabular form ( blast option for this is "-m 8" ) this should be the easiest way of getting the data into R. A different story will be to tune the blast parameters to work effectively in this short-read 454 setting. I would like to propose you have a look at other metagenomics approaches, e.g. the MEGAN software: http://www-ab.informatik.uni-tuebingen.de/software/megan This might also be a good tool to try. It does not use R though, but you can get some ideas about how to set blast parameters like decrease word size and turn of low complexity filtering. Best Michael Am Dec 14, 2009 um 12:02 PM schrieb jos matejus: > Dear list members, > > A colleague has asked whether I can help him with a bioinformatics > problem he has as he knows I use R (although I don't usually use R for > this type of problem) and I was hoping someone might be able to point > me in the right direction. I have searched the mailing list archives > and also Googled this particular query, but without success. I ask > forgiveness in advance if the question is not appropriate for this > forum. > > Anyway, the background is that my colleague has a sample collected > from the field containing many species of related insects (same genus) > which he has obtained lots of sequence information (from 454). The > sequences are saved in a single fasta file. What he wants to do is to > query Genbank to match each sequence from the fasta file to particular > species (A nucleotide blast search I believe) and return the top > ranked match for each sequence. He can do this manually via the web > page, but he will have a lot of these files in the future and was > looking for some way of automating the process (hence using R). He > ultimately wants to be able to restrict the Blast search to a list of > preselected Accession numbers or within genus. > > As I am not familiar with this field I was wondering whether anyone > knows of an existing function (or functions) that can do the job. I am > looking at the package seqinr at the moment to see whether this would > fit the bill and also whether the Biostrings package would be > appropriate. However, the learning curve looks a little steep and I > wanted to make sure I was going down the right road before investing > lots of time. > > Also, is there a package that I can use to access the Genbank database > directly from within R to do the Blast searches? > > Many many thanks in advance > Jos > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.4 years ago John Herbert ▴ 90

0

Entering edit mode

John Herbert wrote: > Dear Jos, Michael. > To add to Michael's good advice. > > Roche 454, as I understand it, generates ~450bp reads, so there should be no problem with the default BLAST word size parameter. > > Once you downloaded your nucleotide database files and your copy of NCBI blastall program, do something like: > > formatdb -i nucleotide_file -o T -p F -n nucleotide_db > > This generates a database to search; then to search your 454 sequences, do: > > blastall -a 2 -p blastn -i 454_file -d nucleotide_db -e 0.01 -F "m S" -m 8 -o output_file > > -F "m S" means it will ignore simple sequences in look up phase of BLAST but use them in extending HSPs (I usually use it with protein searches), you can also try -F F or -F T if it runs slow etc. > Type: blastall with no arguments to see full list of what the arguments mean, also formatdb to see those too. > > You could also try; for species sequences, I usually find ftp://ftp.ncbi.nih.gov/refseq/release download site quite useful, though for bacterial species I don't know. > > Also sounds like some Perl programming magic would be very useful to you here. > > The Bioconductor infrastructure here is really powerful (especially if you're an R user and not a perl user ;) -- library(ShortRead); fq = readFastq("/some/file.fastq") gets the 454 data in as a DNAStringSet. There are really great string manipulation / pattern matching tools in IRanges / Biostrings / ShortRead to operate on this data in just a few lines, .e.g., selecting those reads >250 and < 400 nucleotides (fq = fq[width(fq) > 250 & width(fq) < 400]), and then removing the trailing 10 (?) bp (narrow(fq, start=1, end=width(fq)-10)) which tend to have lower quality -- easy to check with q1 = narrow(quality(fq), start=width(fq)-10, end=width(fq)); plot(colMeans(as(q1, "matrix"))) -- and save to a fastq file again (with writeFastq). Since these sorts of data often involve a PCR reaction during sample prep, trimLRPatterns is useful for removing primer remnants. The tables() function in ShortRead can be useful for summarizing the reads that you do have, including on narrowed portions (e.g., first 10 bp) where sequence / sample prep artifacts are likely. And 454 data tends to be 'small' so the memory requirements are often reasonable. Perl is also useful as a scripting language, but to launch blast form within R is just a call to system(). Martin > Good luck. > > Kind regards, > > John. > > -----Original Message----- > From: bioconductor-bounces at stat.math.ethz.ch [mailto :bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Michael Dondrup > Sent: 14 December 2009 12:34 > To: jos matejus > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] BLAST search sequence for species ID from R? > > Hi, > > if your search involves > thousands of sequences, it might be a good idea to download and install the NCBI blast software > and the nucleotide (nt) database locally and perform the searches in a program call on a fast machine. (So this is not an R task) > > You can then use R to further process the output files, e.g. counting top-hits to single species. If the blast output is in tabular form > ( blast option for this is "-m 8" ) > this should be the easiest way of getting the data into R. > > A different story will be to tune the blast parameters to work effectively in this short-read 454 setting. I would like to > propose you have a look at other metagenomics approaches, e.g. the MEGAN software: > http://www-ab.informatik.uni-tuebingen.de/software/megan This might also be a good tool to try. > It does not use R though, but you can get some ideas about how to set blast parameters like decrease word size and turn of low > complexity filtering. > > Best > Michael > > > Am Dec 14, 2009 um 12:02 PM schrieb jos matejus: > > >> Dear list members, >> >> A colleague has asked whether I can help him with a bioinformatics >> problem he has as he knows I use R (although I don't usually use R for >> this type of problem) and I was hoping someone might be able to point >> me in the right direction. I have searched the mailing list archives >> and also Googled this particular query, but without success. I ask >> forgiveness in advance if the question is not appropriate for this >> forum. >> >> Anyway, the background is that my colleague has a sample collected >> from the field containing many species of related insects (same genus) >> which he has obtained lots of sequence information (from 454). The >> sequences are saved in a single fasta file. What he wants to do is to >> query Genbank to match each sequence from the fasta file to particular >> species (A nucleotide blast search I believe) and return the top >> ranked match for each sequence. He can do this manually via the web >> page, but he will have a lot of these files in the future and was >> looking for some way of automating the process (hence using R). He >> ultimately wants to be able to restrict the Blast search to a list of >> preselected Accession numbers or within genus. >> >> As I am not familiar with this field I was wondering whether anyone >> knows of an existing function (or functions) that can do the job. I am >> looking at the package seqinr at the moment to see whether this would >> fit the bill and also whether the Biostrings package would be >> appropriate. However, the learning curve looks a little steep and I >> wanted to make sure I was going down the right road before investing >> lots of time. >> >> Also, is there a package that I can use to access the Genbank database >> directly from within R to do the Blast searches? >> >> Many many thanks in advance >> Jos >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 14.4 years ago Martin Morgan 25k

0

Entering edit mode

Martin Morgan wrote: > John Herbert wrote: >> Dear Jos, Michael. >> To add to Michael's good advice. >> Roche 454, as I understand it, generates ~450bp reads, so there >> should be no problem with the default BLAST word size parameter. >> >> Once you downloaded your nucleotide database files and your copy of >> NCBI blastall program, do something like: >> >> formatdb -i nucleotide_file -o T -p F -n nucleotide_db >> >> This generates a database to search; then to search your 454 >> sequences, do: >> >> blastall -a 2 -p blastn -i 454_file -d nucleotide_db -e 0.01 -F "m S" >> -m 8 -o output_file >> >> -F "m S" means it will ignore simple sequences in look up phase of >> BLAST but use them in extending HSPs (I usually use it with protein >> searches), you can also try -F F or -F T if it runs slow etc. >> Type: blastall with no arguments to see full list of what the >> arguments mean, also formatdb to see those too. >> You could also try; for species sequences, I usually find >> ftp://ftp.ncbi.nih.gov/refseq/release download site quite useful, >> though for bacterial species I don't know. >> >> Also sounds like some Perl programming magic would be very useful to >> you here. >> >> > The Bioconductor infrastructure here is really powerful (especially if > you're an R user and not a perl user ;) -- library(ShortRead); fq = > readFastq("/some/file.fastq") gets the 454 data in as a DNAStringSet. Oops, that's assuming it's in fastq format; maybe it's just fasta (Biostrings::read.DNAStringSet) or in separate files (ShortRead's read454) or even in columns (ShortRead's readXStringColumns). Martin > There are really great string manipulation / pattern matching tools in > IRanges / Biostrings / ShortRead to operate on this data in just a few > lines, .e.g., selecting those reads >250 and < 400 nucleotides (fq = > fq[width(fq) > 250 & width(fq) < 400]), and then removing the trailing > 10 (?) bp (narrow(fq, start=1, end=width(fq)-10)) which tend to have > lower quality -- easy to check with q1 = narrow(quality(fq), > start=width(fq)-10, end=width(fq)); plot(colMeans(as(q1, "matrix"))) > -- and save to a fastq file again (with writeFastq). Since these sorts > of data often involve a PCR reaction during sample prep, > trimLRPatterns is useful for removing primer remnants. The tables() > function in ShortRead can be useful for summarizing the reads that you > do have, including on narrowed portions (e.g., first 10 bp) where > sequence / sample prep artifacts are likely. And 454 data tends to be > 'small' so the memory requirements are often reasonable. > > Perl is also useful as a scripting language, but to launch blast form > within R is just a call to system(). > > Martin > >> Good luck. >> Kind regards, >> >> John. >> -----Original Message----- >> From: bioconductor-bounces at stat.math.ethz.ch >> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Michael >> Dondrup >> Sent: 14 December 2009 12:34 >> To: jos matejus >> Cc: bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] BLAST search sequence for species ID from R? >> >> Hi, >> >> if your search involves > thousands of sequences, it might be a good >> idea to download and install the NCBI blast software and the >> nucleotide (nt) database locally and perform the searches in a >> program call on a fast machine. (So this is not an R task) >> >> You can then use R to further process the output files, e.g. counting >> top-hits to single species. If the blast output is in tabular form ( >> blast option for this is "-m 8" ) >> this should be the easiest way of getting the data into R. >> >> A different story will be to tune the blast parameters to work >> effectively in this short-read 454 setting. I would like to propose >> you have a look at other metagenomics approaches, e.g. the MEGAN >> software: >> http://www-ab.informatik.uni-tuebingen.de/software/megan This might >> also be a good tool to try. >> It does not use R though, but you can get some ideas about how to set >> blast parameters like decrease word size and turn of low >> complexity filtering. >> >> Best >> Michael >> >> >> Am Dec 14, 2009 um 12:02 PM schrieb jos matejus: >> >> >>> Dear list members, >>> >>> A colleague has asked whether I can help him with a bioinformatics >>> problem he has as he knows I use R (although I don't usually use R for >>> this type of problem) and I was hoping someone might be able to point >>> me in the right direction. I have searched the mailing list archives >>> and also Googled this particular query, but without success. I ask >>> forgiveness in advance if the question is not appropriate for this >>> forum. >>> >>> Anyway, the background is that my colleague has a sample collected >>> from the field containing many species of related insects (same genus) >>> which he has obtained lots of sequence information (from 454). The >>> sequences are saved in a single fasta file. What he wants to do is to >>> query Genbank to match each sequence from the fasta file to particular >>> species (A nucleotide blast search I believe) and return the top >>> ranked match for each sequence. He can do this manually via the web >>> page, but he will have a lot of these files in the future and was >>> looking for some way of automating the process (hence using R). He >>> ultimately wants to be able to restrict the Blast search to a list of >>> preselected Accession numbers or within genus. >>> >>> As I am not familiar with this field I was wondering whether anyone >>> knows of an existing function (or functions) that can do the job. I am >>> looking at the package seqinr at the moment to see whether this would >>> fit the bill and also whether the Biostrings package would be >>> appropriate. However, the learning curve looks a little steep and I >>> wanted to make sure I was going down the right road before investing >>> lots of time. >>> >>> Also, is there a package that I can use to access the Genbank database >>> directly from within R to do the Blast searches? >>> >>> Many many thanks in advance >>> Jos >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 14.4 years ago Martin Morgan 25k

0

Entering edit mode

Michael Dondrup ▴ 550

@michael-dondrup-3849

Last seen 9.6 years ago

Hi, if your search involves > thousands of sequences, it might be a good idea to download and install the NCBI blast software and the nucleotide (nt) database locally and perform the searches in a program call on a fast machine. (So this is not an R task) You can then use R to further process the output files, e.g. counting top-hits to single species. If the blast output is in tabular form ( blast option for this is "-m 8" ) this should be the easiest way of getting the data into R. A different story will be to tune the blast parameters to work effectively in this short-read 454 setting. I would like to propose you have a look at other metagenomics approaches, e.g. the MEGAN software: http://www-ab.informatik.uni-tuebingen.de/software/megan This might also be a good tool to try. It does not use R though, but you can get some ideas about how to set blast parameters like decrease word size and turn of low complexity filtering. Best Michael Am Dec 14, 2009 um 12:02 PM schrieb jos matejus: > Dear list members, > > A colleague has asked whether I can help him with a bioinformatics > problem he has as he knows I use R (although I don't usually use R for > this type of problem) and I was hoping someone might be able to point > me in the right direction. I have searched the mailing list archives > and also Googled this particular query, but without success. I ask > forgiveness in advance if the question is not appropriate for this > forum. > > Anyway, the background is that my colleague has a sample collected > from the field containing many species of related insects (same genus) > which he has obtained lots of sequence information (from 454). The > sequences are saved in a single fasta file. What he wants to do is to > query Genbank to match each sequence from the fasta file to particular > species (A nucleotide blast search I believe) and return the top > ranked match for each sequence. He can do this manually via the web > page, but he will have a lot of these files in the future and was > looking for some way of automating the process (hence using R). He > ultimately wants to be able to restrict the Blast search to a list of > preselected Accession numbers or within genus. > > As I am not familiar with this field I was wondering whether anyone > knows of an existing function (or functions) that can do the job. I am > looking at the package seqinr at the moment to see whether this would > fit the bill and also whether the Biostrings package would be > appropriate. However, the learning curve looks a little steep and I > wanted to make sure I was going down the right road before investing > lots of time. > > Also, is there a package that I can use to access the Genbank database > directly from within R to do the Blast searches? > > Many many thanks in advance > Jos > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.4 years ago Michael Dondrup ▴ 550

Login before adding your answer.