libraries or commands to help with parsing or handling web based database queries
1
0
Entering edit mode
ALAN SMITH ▴ 40
@alan-smith-1941
Last seen 9.6 years ago
Hello Bioconductors I am having a very hard time figuring out how to make web based database query results into a nice neat table (if such a thing is possible in R). I am constantly searching the metabolite database METLIN by copying and pasting addresses. I have to search this database with several hundred entries, often, and would like to automate the process to remove the HUGE amount of time I spend doing this carpel tunnel creating routine. I have found several ways to get the pages source like. library(RCurl) test<-getURL("http://metlin.scripps.edu/metabo_list.php?mass_min=112.0 4885&mass_max=112.0555") #OR url.show("http://metlin.scripps.edu/metabo_list.php?mass_min=112.04885 &mass_max=112.0555") Once I get the URL info I notice that the data I am interested in is between and . Are there any packages or methods in R to remove the information I am interested in? I am having problems manipulating STRINGS in R like selecting all of the text between two strings. I am not a programmer. Thanks, Alan Note I am able to use KEGGSOAP without any trouble.
PROcess KEGGSOAP PROcess KEGGSOAP • 907 views
ADD COMMENT
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 7 days ago
United States
Alan, You will need for this some basic knowledge on how to use regular expressions within R's grep() and gsub() functions. Additional useful fuctions are paste() and Sys.sleep(). Rcurl also provides some useful utilities for this approach. Below is a short example on a similar problem for obtaining peptide MW information from the Expasy site (http://ca.expasy.org/tools/pi_tool.html). ################################################################### myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS") myresult <- NULL for(i in myentries) { myurl <- paste("http://ca.expasy.org/cgi- bin/pi_tool?protein=", i, "&resolution=monoisotopic", sep="") x <- url(myurl) res <- readLines(x) close(x) mylines <- res[grep('Theoretical pI/Mw:',res)] myresult <- c(myresult, as.numeric(gsub('.*/ ','', mylines))) print(myresult) Sys.sleep(1) # halts process for one sec to give database a break } final <- data.frame(Pep=myentries, MW=myresult) cat("\n The MW values for my peptides are:\n") print(final) ################################################################### Thomas On Mon 02/19/07 11:41, ALAN SMITH wrote: > Hello Bioconductors > I am having a very hard time figuring out how to make web based > database query results into a nice neat table (if such a thing is > possible in R). I am constantly searching the metabolite database > METLIN by copying and pasting addresses. I have to search this > database with several hundred entries, often, and would like to > automate the process to remove the HUGE amount of time I spend doing > this carpel tunnel creating routine. I have found several ways to get > the pages source like. > > library(RCurl) > test<-getURL("http://metlin.scripps.edu/metabo_list.php?mass_min=112 .04885&mass_max=112.0555") > #OR > url.show("http://metlin.scripps.edu/metabo_list.php?mass_min=112.048 85&mass_max=112.0555") > > Once I get the URL info I notice that the data I am interested in is > between and . > > Are there any packages or methods in R to remove the information I am > interested in? I am having problems manipulating STRINGS in R like > selecting all of the text between two strings. I am not a programmer. > > Thanks, > Alan > > Note I am able to use KEGGSOAP without any trouble. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Thomas Girke, Ph.D. 1008 Noel T. Keen Hall Center for Plant Cell Biology (CEPCEB) University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437
ADD COMMENT
0
Entering edit mode
Hi Alan, Which parts are you interested in exactly? Looking at the page there are MID, MASS, Name, Formula information which seem to be more easily extracted from the code. However the structure seems a little bit more tricky to me. Regards Benjamin -----Urspr?ngliche Nachricht----- Von: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] Im Auftrag von Thomas Girke Gesendet: 19 February 2007 19:34 An: ALAN SMITH Cc: bioconductor at stat.math.ethz.ch Betreff: Re: [BioC] libraries or commands to help with parsing or handlingweb based database queries Alan, You will need for this some basic knowledge on how to use regular expressions within R's grep() and gsub() functions. Additional useful fuctions are paste() and Sys.sleep(). Rcurl also provides some useful utilities for this approach. Below is a short example on a similar problem for obtaining peptide MW information from the Expasy site (http://ca.expasy.org/tools/pi_tool.html). ################################################################### myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS") myresult <- NULL for(i in myentries) { myurl <- paste("http://ca.expasy.org/cgi- bin/pi_tool?protein=", i, "&resolution=monoisotopic", sep="") x <- url(myurl) res <- readLines(x) close(x) mylines <- res[grep('Theoretical pI/Mw:',res)] myresult <- c(myresult, as.numeric(gsub('.*/ ','', mylines))) print(myresult) Sys.sleep(1) # halts process for one sec to give database a break } final <- data.frame(Pep=myentries, MW=myresult) cat("\n The MW values for my peptides are:\n") print(final) ################################################################### Thomas On Mon 02/19/07 11:41, ALAN SMITH wrote: > Hello Bioconductors > I am having a very hard time figuring out how to make web based > database query results into a nice neat table (if such a thing is > possible in R). I am constantly searching the metabolite database > METLIN by copying and pasting addresses. I have to search this > database with several hundred entries, often, and would like to > automate the process to remove the HUGE amount of time I spend doing > this carpel tunnel creating routine. I have found several ways to get > the pages source like. > > library(RCurl) > test<-getURL("http://metlin.scripps.edu/metabo_list.php?mass_min=112.0 4885&m ass_max=112.0555") > #OR > > Once I get the URL info I notice that the data I am interested in is > between and . > > Are there any packages or methods in R to remove the information I am > interested in? I am having problems manipulating STRINGS in R like > selecting all of the text between two strings. I am not a programmer. > > Thanks, > Alan > > Note I am able to use KEGGSOAP without any trouble. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Thomas Girke, Ph.D. 1008 Noel T. Keen Hall Center for Plant Cell Biology (CEPCEB) University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437 _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Alan, Have you looked at using the XML package? Depending on how malformed the HTML is, it may be useful, as it is designed to parse these types of data. Sean On Tuesday 20 February 2007 08:07, Benjamin Otto wrote: > Hi Alan, > > Which parts are you interested in exactly? > Looking at the page there are MID, MASS, Name, Formula information which > seem to be more easily extracted from the code. However the structure seems > a little bit more tricky to me. > > Regards > > Benjamin > > > > > > -----Urspr?ngliche Nachricht----- > Von: bioconductor-bounces at stat.math.ethz.ch > [mailto:bioconductor-bounces at stat.math.ethz.ch] Im Auftrag von Thomas Girke > Gesendet: 19 February 2007 19:34 > An: ALAN SMITH > Cc: bioconductor at stat.math.ethz.ch > Betreff: Re: [BioC] libraries or commands to help with parsing or > handlingweb based database queries > > Alan, > You will need for this some basic knowledge on how to use regular > expressions within R's grep() and gsub() functions. Additional useful > fuctions are paste() and Sys.sleep(). > > Rcurl also provides some useful utilities for this approach. > > Below is a short example on a similar problem for obtaining peptide MW > information from the Expasy site (http://ca.expasy.org/tools/pi_tool.html). > > > ################################################################### > myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS") > myresult <- NULL > for(i in myentries) { > myurl <- paste("http://ca.expasy.org/cgi- bin/pi_tool?protein=", > i, "&resolution=monoisotopic", sep="") > x <- url(myurl) > res <- readLines(x) > close(x) > mylines <- res[grep('Theoretical pI/Mw:',res)] > myresult <- c(myresult, as.numeric(gsub('.*/ ','', mylines))) > print(myresult) > Sys.sleep(1) # halts process for one sec to give database a break > } > final <- data.frame(Pep=myentries, MW=myresult) > cat("\n The MW values for my peptides are:\n") > print(final) > ################################################################### > > > Thomas > > On Mon 02/19/07 11:41, ALAN SMITH wrote: > > Hello Bioconductors > > I am having a very hard time figuring out how to make web based > > database query results into a nice neat table (if such a thing is > > possible in R). I am constantly searching the metabolite database > > METLIN by copying and pasting addresses. I have to search this > > database with several hundred entries, often, and would like to > > automate the process to remove the HUGE amount of time I spend doing > > this carpel tunnel creating routine. I have found several ways to get > > the pages source like. > > > > library(RCurl) > > test<-getURL("http://metlin.scripps.edu/metabo_list.php?mass_min=112 .04885& >m ass_max=112.0555") > > > #OR > > > > > > > > Once I get the URL info I notice that the data I am interested in is > > between and . > > > > Are there any packages or methods in R to remove the information I am > > interested in? I am having problems manipulating STRINGS in R like > > selecting all of the text between two strings. I am not a programmer. > > > > Thanks, > > Alan > > > > Note I am able to use KEGGSOAP without any trouble. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6