slow query

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

The query to org.Hs.eg.db is very slow. I submit the following query, cols=cols(org.Hs.eg.db) gns="BRCA1" BRCA1.info=select(org.Hs.eg.db, cols=cols, keys=gns, keytype="SYMBOL") It takes forever to wait for the result. Anyone knows why and please help me. Thank you. -- output of sessionInfo(): (no result yet) -- Sent via the guest posting facility at bioconductor.org.

• 632 views

ADD COMMENT • link updated 11.1 years ago by Marc Carlson ★ 7.2k • written 11.1 years ago by Guest User ★ 13k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Sorry about the slew of responses. It was only supposed to be one response, but for some reason, my client sent a reply every time I hit save. I hope my explanation helps you. Marc On 03/22/2013 01:49 PM, Sean Wang [guest] wrote: > The query to org.Hs.eg.db is very slow. > > I submit the following query, > > cols=cols(org.Hs.eg.db) > gns="BRCA1" > BRCA1.info=select(org.Hs.eg.db, cols=cols, keys=gns, keytype="SYMBOL") > > > It takes forever to wait for the result. > > Anyone knows why and please help me. > > Thank you. > > -- output of sessionInfo(): > > (no result yet) > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.1 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Sean, It's because you just asked for everything associated with that one gene, multiplied by everything else. Many of the things that are going to be associated with BRCA1 (such as pubmed IDs) have a many to one relationship with the initial key. That means that when you add several of these kinds of cols into your query (and you asked for ALL of them), then the number of rows returned will be multiplied out by all the many to one relationships. So for example, suppose you only had asked for pubmed IDs (PMID) and ENSEMBL IDs. And lets also suppose that there are only 4 pubmed IDs associated with your gene, and 2 ENSEMBL IDs. How many rows would that be? Well in that case the result should be 8 rows long. Now what happens if you then also asked for something like UNIPROT (and lets assume there are 5 of those)? Now your result is suddenly FORTY rows long. See the problem? Because the answer is being returned as a data.frame, and because there are multiple many to one relationships, you can end up generating a really huge result when the data are represented as a simple data.frame. One gene can suddenly actually end up amounting to millions of rows. That is just how the math works out when you store data that has complicated relationships into simple flat data.frame objects. Getting around the problem of all this wasted row-space is part of why relational databases were invented in the 1st place, and here you are calling select which will attempt to flatten such information for you (because it is easier for humans to look at it that way). But as you can see, there are good reasons why we don't actually store it that way in the background. So if you feel like it's taking too long, I would recommend being a little more selective about what you ask for. You can probably get the same data with a couple of separate requests, wait a lot less (and also end up with much more manageable data.frames). Marc On 03/22/2013 01:49 PM, Sean Wang [guest] wrote: > The query to org.Hs.eg.db is very slow. > > I submit the following query, > > cols=cols(org.Hs.eg.db) > gns="BRCA1" > BRCA1.info=select(org.Hs.eg.db, cols=cols, keys=gns, keytype="SYMBOL") > > > It takes forever to wait for the result. > > Anyone knows why and please help me. > > Thank you. > > -- output of sessionInfo(): > > (no result yet) > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.1 years ago Marc Carlson ★ 7.2k

Login before adding your answer.