Question

biomaRt: connection stopping

0

Entering edit mode

J.delasHeras@ed.ac.uk ★ 1.9k

@jdelasherasedacuk-1189

Last seen 8.7 years ago

United Kingdom

Hi, I suspect this is something to do purely with my connection, but I thought I'd ask, just in case: I have a list of refseq ids (NM_xxxxx), 18028 of them. I wanted to get the gene symbols for those genes, so I used biomaRt on the whole list. What I got was a single column data frame longer than 18028, as I get multiple results with some of these refseq ids. There doesn't seem to be an easy way to regroup them together, so I do the following instead: #create an empty list of teh right length A<-vector(mode="list", length=18028) #now loop filling elements of the list from the biomaRt queries for (i in 1:18028){ K<-i A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq_dn a",values=c(RS[i])) } print(K) RS is a vector containing the 18028 refseq ids. the K value is only so that I know where it breaks... because that's what happens... after a while, it breaks with an error message: Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) : couldn't connect to host This doesn't happen if I send the whole query in ONE go, in a vector... but if I do it element by element it breaks after 3-4000 queries. Any ideas to do this in a simpler/better way? Or at least one that doesn't have me coming back to re-start the loop at the position of the last break? thanks! Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK

GO biomaRt GO biomaRt • 801 views

ADD COMMENT • link updated 17.7 years ago by James W. MacDonald 65k • written 17.7 years ago by J.delasHeras@ed.ac.uk ★ 1.9k

score 0 · Answer 1 · 2006-09-13

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 3 days ago

United States

J.delasHeras at ed.ac.uk wrote: > Hi, > > I suspect this is something to do purely with my connection, but I > thought I'd ask, just in case: > > I have a list of refseq ids (NM_xxxxx), 18028 of them. > I wanted to get the gene symbols for those genes, so I used biomaRt on > the whole list. What I got was a single column data frame longer than > 18028, as I get multiple results with some of these refseq ids. There > doesn't seem to be an easy way to regroup them together, so I do the > following instead: Using the RCurl interface for a big query like that isn't ideal. You would be better off installing RMySQL and using the MySQL interface (note: you can get RMySQL using biocLite(), thanks to the fine folks in Seattle). Also, you can have getBM() put things in a list, so any duplicated gene symbols will be grouped together. A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = "list", mysql = TRUE) Should do the trick. HTH, Jim > > #create an empty list of teh right length > A<-vector(mode="list", length=18028) > #now loop filling elements of the list from the biomaRt queries > for (i in 1:18028){ > K<-i > A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq_ dna",values=c(RS[i])) > } > print(K) > > RS is a vector containing the 18028 refseq ids. > the K value is only so that I know where it breaks... because that's > what happens... after a while, it breaks with an error message: > > Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) : > couldn't connect to host > > This doesn't happen if I send the whole query in ONE go, in a vector... > but if I do it element by element it breaks after 3-4000 queries. > Any ideas to do this in a simpler/better way? Or at least one that > doesn't have me coming back to re-start the loop at the position of the > last break? > > thanks! > > Jose > -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

ADD COMMENT • link 17.7 years ago James W. MacDonald 65k

0

Entering edit mode

Hi, I would like to add that biomaRt in RCurl mode can handle big queries but will break when you use it in a big loop. An alternative to what Jim suggests could be to do the query for all ids at once: A<-getBM(attributes=c("hgnc_symbol","refseq_dna"),mart=mart,filters="r efseq_dna",values=RS) By adding refseq_dna as an attribute, HUGO symbols and RefSeq identifiers will be automatically matched up in A. If needed, you can loop over the result in A and you avoid doing 18000+ separate database queries so it will be faster. best, Steffen James W. MacDonald wrote: > J.delasHeras at ed.ac.uk wrote: > >> Hi, >> >> I suspect this is something to do purely with my connection, but I >> thought I'd ask, just in case: >> >> I have a list of refseq ids (NM_xxxxx), 18028 of them. >> I wanted to get the gene symbols for those genes, so I used biomaRt on >> the whole list. What I got was a single column data frame longer than >> 18028, as I get multiple results with some of these refseq ids. There >> doesn't seem to be an easy way to regroup them together, so I do the >> following instead: >> > > Using the RCurl interface for a big query like that isn't ideal. You > would be better off installing RMySQL and using the MySQL interface > (note: you can get RMySQL using biocLite(), thanks to the fine folks in > Seattle). Also, you can have getBM() put things in a list, so any > duplicated gene symbols will be grouped together. > > A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = > "list", mysql = TRUE) > > Should do the trick. > > HTH, > > Jim > > > >> #create an empty list of teh right length >> A<-vector(mode="list", length=18028) >> #now loop filling elements of the list from the biomaRt queries >> for (i in 1:18028){ >> K<-i >> A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq _dna",values=c(RS[i])) >> } >> print(K) >> >> RS is a vector containing the 18028 refseq ids. >> the K value is only so that I know where it breaks... because that's >> what happens... after a while, it breaks with an error message: >> >> Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) : >> couldn't connect to host >> >> This doesn't happen if I send the whole query in ONE go, in a vector... >> but if I do it element by element it breaks after 3-4000 queries. >> Any ideas to do this in a simpler/better way? Or at least one that >> doesn't have me coming back to re-start the loop at the position of the >> last break? >> >> thanks! >> >> Jose >> >> > > > -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877

ADD REPLY • link 17.7 years ago Steffen Durinck ▴ 580

0

Entering edit mode

Great suggestions! Thanks! adding refseq as attributes too, why didn't I think of that? :-) Jose Quoting Steffen Durinck <durincks at="" mail.nih.gov="">: > Hi, > > I would like to add that biomaRt in RCurl mode can handle big queries > but will break when you use it in a big loop. > An alternative to what Jim suggests could be to do the query for all ids > at once: > > A<-getBM(attributes=c("hgnc_symbol","refseq_dna"),mart=mart,filters= "refseq_dna",values=RS) > > By adding refseq_dna as an attribute, HUGO symbols and RefSeq > identifiers will be automatically matched up in A. If needed, you > can loop over the result in A and you avoid doing 18000+ separate > database queries so it will be faster. > > best, > Steffen > > > > > James W. MacDonald wrote: >> J.delasHeras at ed.ac.uk wrote: >> >>> Hi, >>> >>> I suspect this is something to do purely with my connection, but I >>> thought I'd ask, just in case: >>> >>> I have a list of refseq ids (NM_xxxxx), 18028 of them. >>> I wanted to get the gene symbols for those genes, so I used biomaRt >>> on the whole list. What I got was a single column data frame longer >>> than 18028, as I get multiple results with some of these refseq >>> ids. There doesn't seem to be an easy way to regroup them together, >>> so I do the following instead: >>> >> >> Using the RCurl interface for a big query like that isn't ideal. You >> would be better off installing RMySQL and using the MySQL interface >> (note: you can get RMySQL using biocLite(), thanks to the fine folks >> in Seattle). Also, you can have getBM() put things in a list, so any >> duplicated gene symbols will be grouped together. >> >> A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = >> "list", mysql = TRUE) >> >> Should do the trick. >> >> HTH, >> >> Jim >> >> >> >>> #create an empty list of teh right length >>> A<-vector(mode="list", length=18028) >>> #now loop filling elements of the list from the biomaRt queries >>> for (i in 1:18028){ >>> K<-i >>> A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refse q_dna",values=c(RS[i])) >>> } >>> print(K) >>> >>> RS is a vector containing the 18028 refseq ids. >>> the K value is only so that I know where it breaks... because >>> that's what happens... after a while, it breaks with an error >>> message: >>> >>> Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) : >>> couldn't connect to host >>> >>> This doesn't happen if I send the whole query in ONE go, in a >>> vector... but if I do it element by element it breaks after 3-4000 >>> queries. >>> Any ideas to do this in a simpler/better way? Or at least one that >>> doesn't have me coming back to re-start the loop at the position of >>> the last break? >>> >>> thanks! >>> >>> Jose >>> >>> >> >> >> > > > -- > Steffen Durinck, Ph.D. > > Oncogenomics Section > Pediatric Oncology Branch > National Cancer Institute, National Institutes of Health > URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ > > Phone: 301-402-8103 > Address: > Advanced Technology Center, > 8717 Grovemont Circle > Gaithersburg, MD 20877 > -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK

ADD REPLY • link 17.7 years ago J.delasHeras@ed.ac.uk ★ 1.9k

0

Entering edit mode

Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > J.delasHeras at ed.ac.uk wrote: >> Hi, >> >> I suspect this is something to do purely with my connection, but I >> thought I'd ask, just in case: >> >> I have a list of refseq ids (NM_xxxxx), 18028 of them. >> I wanted to get the gene symbols for those genes, so I used biomaRt >> on the whole list. What I got was a single column data frame longer >> than 18028, as I get multiple results with some of these refseq ids. >> There doesn't seem to be an easy way to regroup them together, so I >> do the following instead: > > Using the RCurl interface for a big query like that isn't ideal. You > would be better off installing RMySQL and using the MySQL interface > (note: you can get RMySQL using biocLite(), thanks to the fine folks in > Seattle). Also, you can have getBM() put things in a list, so any > duplicated gene symbols will be grouped together. > > A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = > "list", mysql = TRUE) > > Should do the trick. > > HTH, > > Jim ah, so simple... :-) thanks a lot Jim, I totally overlooked the different output styles. As for the MySQL interface... you're probably right. We have *a* bioinformatician here and he was trying to convince me not long ago that I should take a look at the wonders of working with MySQL... Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK

ADD REPLY • link 17.7 years ago J.delasHeras@ed.ac.uk ★ 1.9k