biomaRt: connection stopping
1
0
Entering edit mode
@jdelasherasedacuk-1189
Last seen 9.3 years ago
United Kingdom
Hi, I suspect this is something to do purely with my connection, but I thought I'd ask, just in case: I have a list of refseq ids (NM_xxxxx), 18028 of them. I wanted to get the gene symbols for those genes, so I used biomaRt on the whole list. What I got was a single column data frame longer than 18028, as I get multiple results with some of these refseq ids. There doesn't seem to be an easy way to regroup them together, so I do the following instead: #create an empty list of teh right length A<-vector(mode="list", length=18028) #now loop filling elements of the list from the biomaRt queries for (i in 1:18028){ K<-i A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq_dn a",values=c(RS[i])) } print(K) RS is a vector containing the 18028 refseq ids. the K value is only so that I know where it breaks... because that's what happens... after a while, it breaks with an error message: Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) : couldn't connect to host This doesn't happen if I send the whole query in ONE go, in a vector... but if I do it element by element it breaks after 3-4000 queries. Any ideas to do this in a simpler/better way? Or at least one that doesn't have me coming back to re-start the loop at the position of the last break? thanks! Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK
GO biomaRt GO biomaRt • 898 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 14 hours ago
United States
J.delasHeras at ed.ac.uk wrote: > Hi, > > I suspect this is something to do purely with my connection, but I > thought I'd ask, just in case: > > I have a list of refseq ids (NM_xxxxx), 18028 of them. > I wanted to get the gene symbols for those genes, so I used biomaRt on > the whole list. What I got was a single column data frame longer than > 18028, as I get multiple results with some of these refseq ids. There > doesn't seem to be an easy way to regroup them together, so I do the > following instead: Using the RCurl interface for a big query like that isn't ideal. You would be better off installing RMySQL and using the MySQL interface (note: you can get RMySQL using biocLite(), thanks to the fine folks in Seattle). Also, you can have getBM() put things in a list, so any duplicated gene symbols will be grouped together. A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = "list", mysql = TRUE) Should do the trick. HTH, Jim > > #create an empty list of teh right length > A<-vector(mode="list", length=18028) > #now loop filling elements of the list from the biomaRt queries > for (i in 1:18028){ > K<-i > A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq_ dna",values=c(RS[i])) > } > print(K) > > RS is a vector containing the 18028 refseq ids. > the K value is only so that I know where it breaks... because that's > what happens... after a while, it breaks with an error message: > > Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) : > couldn't connect to host > > This doesn't happen if I send the whole query in ONE go, in a vector... > but if I do it element by element it breaks after 3-4000 queries. > Any ideas to do this in a simpler/better way? Or at least one that > doesn't have me coming back to re-start the loop at the position of the > last break? > > thanks! > > Jose > -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
ADD COMMENT
0
Entering edit mode
Hi, I would like to add that biomaRt in RCurl mode can handle big queries but will break when you use it in a big loop. An alternative to what Jim suggests could be to do the query for all ids at once: A<-getBM(attributes=c("hgnc_symbol","refseq_dna"),mart=mart,filters="r efseq_dna",values=RS) By adding refseq_dna as an attribute, HUGO symbols and RefSeq identifiers will be automatically matched up in A. If needed, you can loop over the result in A and you avoid doing 18000+ separate database queries so it will be faster. best, Steffen James W. MacDonald wrote: > J.delasHeras at ed.ac.uk wrote: > >> Hi, >> >> I suspect this is something to do purely with my connection, but I >> thought I'd ask, just in case: >> >> I have a list of refseq ids (NM_xxxxx), 18028 of them. >> I wanted to get the gene symbols for those genes, so I used biomaRt on >> the whole list. What I got was a single column data frame longer than >> 18028, as I get multiple results with some of these refseq ids. There >> doesn't seem to be an easy way to regroup them together, so I do the >> following instead: >> > > Using the RCurl interface for a big query like that isn't ideal. You > would be better off installing RMySQL and using the MySQL interface > (note: you can get RMySQL using biocLite(), thanks to the fine folks in > Seattle). Also, you can have getBM() put things in a list, so any > duplicated gene symbols will be grouped together. > > A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = > "list", mysql = TRUE) > > Should do the trick. > > HTH, > > Jim > > > >> #create an empty list of teh right length >> A<-vector(mode="list", length=18028) >> #now loop filling elements of the list from the biomaRt queries >> for (i in 1:18028){ >> K<-i >> A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq _dna",values=c(RS[i])) >> } >> print(K) >> >> RS is a vector containing the 18028 refseq ids. >> the K value is only so that I know where it breaks... because that's >> what happens... after a while, it breaks with an error message: >> >> Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) : >> couldn't connect to host >> >> This doesn't happen if I send the whole query in ONE go, in a vector... >> but if I do it element by element it breaks after 3-4000 queries. >> Any ideas to do this in a simpler/better way? Or at least one that >> doesn't have me coming back to re-start the loop at the position of the >> last break? >> >> thanks! >> >> Jose >> >> > > > -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877
ADD REPLY
0
Entering edit mode
Great suggestions! Thanks! adding refseq as attributes too, why didn't I think of that? :-) Jose Quoting Steffen Durinck <durincks at="" mail.nih.gov="">: > Hi, > > I would like to add that biomaRt in RCurl mode can handle big queries > but will break when you use it in a big loop. > An alternative to what Jim suggests could be to do the query for all ids > at once: > > A<-getBM(attributes=c("hgnc_symbol","refseq_dna"),mart=mart,filters= "refseq_dna",values=RS) > > By adding refseq_dna as an attribute, HUGO symbols and RefSeq > identifiers will be automatically matched up in A. If needed, you > can loop over the result in A and you avoid doing 18000+ separate > database queries so it will be faster. > > best, > Steffen > > > > > James W. MacDonald wrote: >> J.delasHeras at ed.ac.uk wrote: >> >>> Hi, >>> >>> I suspect this is something to do purely with my connection, but I >>> thought I'd ask, just in case: >>> >>> I have a list of refseq ids (NM_xxxxx), 18028 of them. >>> I wanted to get the gene symbols for those genes, so I used biomaRt >>> on the whole list. What I got was a single column data frame longer >>> than 18028, as I get multiple results with some of these refseq >>> ids. There doesn't seem to be an easy way to regroup them together, >>> so I do the following instead: >>> >> >> Using the RCurl interface for a big query like that isn't ideal. You >> would be better off installing RMySQL and using the MySQL interface >> (note: you can get RMySQL using biocLite(), thanks to the fine folks >> in Seattle). Also, you can have getBM() put things in a list, so any >> duplicated gene symbols will be grouped together. >> >> A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = >> "list", mysql = TRUE) >> >> Should do the trick. >> >> HTH, >> >> Jim >> >> >> >>> #create an empty list of teh right length >>> A<-vector(mode="list", length=18028) >>> #now loop filling elements of the list from the biomaRt queries >>> for (i in 1:18028){ >>> K<-i >>> A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refse q_dna",values=c(RS[i])) >>> } >>> print(K) >>> >>> RS is a vector containing the 18028 refseq ids. >>> the K value is only so that I know where it breaks... because >>> that's what happens... after a while, it breaks with an error >>> message: >>> >>> Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) : >>> couldn't connect to host >>> >>> This doesn't happen if I send the whole query in ONE go, in a >>> vector... but if I do it element by element it breaks after 3-4000 >>> queries. >>> Any ideas to do this in a simpler/better way? Or at least one that >>> doesn't have me coming back to re-start the loop at the position of >>> the last break? >>> >>> thanks! >>> >>> Jose >>> >>> >> >> >> > > > -- > Steffen Durinck, Ph.D. > > Oncogenomics Section > Pediatric Oncology Branch > National Cancer Institute, National Institutes of Health > URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ > > Phone: 301-402-8103 > Address: > Advanced Technology Center, > 8717 Grovemont Circle > Gaithersburg, MD 20877 > -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK
ADD REPLY
0
Entering edit mode
Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > J.delasHeras at ed.ac.uk wrote: >> Hi, >> >> I suspect this is something to do purely with my connection, but I >> thought I'd ask, just in case: >> >> I have a list of refseq ids (NM_xxxxx), 18028 of them. >> I wanted to get the gene symbols for those genes, so I used biomaRt >> on the whole list. What I got was a single column data frame longer >> than 18028, as I get multiple results with some of these refseq ids. >> There doesn't seem to be an easy way to regroup them together, so I >> do the following instead: > > Using the RCurl interface for a big query like that isn't ideal. You > would be better off installing RMySQL and using the MySQL interface > (note: you can get RMySQL using biocLite(), thanks to the fine folks in > Seattle). Also, you can have getBM() put things in a list, so any > duplicated gene symbols will be grouped together. > > A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = > "list", mysql = TRUE) > > Should do the trick. > > HTH, > > Jim ah, so simple... :-) thanks a lot Jim, I totally overlooked the different output styles. As for the MySQL interface... you're probably right. We have *a* bioinformatician here and he was trying to convince me not long ago that I should take a look at the wonders of working with MySQL... Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK
ADD REPLY

Login before adding your answer.

Traffic: 761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6