GEOquery and parsing SOFT files

0

Entering edit mode

Wacek Kusnierczyk ▴ 180

@wacek-kusnierczyk-88

Last seen 9.6 years ago

Hello, The getGEO function from GEOquery parses GEO soft files. With a particular GSE file (GSE13638), it took over 15 minutes on my not-so-crappy machine to parse the file (a local file, download time excluded). I've written a simple parser in perl, and parsing the same file and storing the data in a nested hash/array structure takes ca. 2 seconds. I'm pretty sure there is more essential processing done by getGEO to organize the data into a GSE object, but still, there seems to be an incredibly inefficient implementation underneath. I haven't looked at the source code yet, but here's a question: what is the likely reason getGEO is so slow? Is it the parsing itself, or rather wraping the data into the appropriate structure? Where should I start to look for code to be improved? vQ

GEOquery GEOquery • 1.9k views

ADD COMMENT • link updated 14.9 years ago by Wolfgang Huber ★ 13k • written 14.9 years ago by Wacek Kusnierczyk ▴ 180

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 10 days ago

EMBL European Molecular Biology Laborat…

Dear Wacek, thank you for the feedback and pointing this out. Two general remarks: 1. Please include a reproducible example (R script) for others to reproduce your experience, and subsequently the output of sessionInfo(). 2. Robert Gentleman's book "R Programming for Bioinformatics" (as well as many free sources on the web) describes how to profile R code in order to see in which functions the CPU time is spent. Based on this, you can investigate where to invest developer time for improving the code. Best wishes Wolfgang Wacek Kusnierczyk ha scritto: > Hello, > > The getGEO function from GEOquery parses GEO soft files. With a > particular GSE file (GSE13638), it took over 15 minutes on my > not-so-crappy machine to parse the file (a local file, download time > excluded). I've written a simple parser in perl, and parsing the same > file and storing the data in a nested hash/array structure takes ca. 2 > seconds. I'm pretty sure there is more essential processing done by > getGEO to organize the data into a GSE object, but still, there seems to > be an incredibly inefficient implementation underneath. > > I haven't looked at the source code yet, but here's a question: what is > the likely reason getGEO is so slow? Is it the parsing itself, or > rather wraping the data into the appropriate structure? Where should I > start to look for code to be improved? > > vQ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------------------------ Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber

ADD COMMENT • link 14.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Wolfgang Huber wrote: > > Dear Wacek, > > thank you for the feedback and pointing this out. Two general remarks: thank you for the response. > > 1. Please include a reproducible example (R script) for others to > reproduce your experience, and subsequently the output of sessionInfo(). a reproducible (on my machine, at least) example is as follows: library(GEOquery) repeat { filename = getGEOfile('GSE13638') if (!is.null(filename)) break } system.time({gse13638 = getGEO(filename=filename)}) # elapsed: 620 (so it was 'faster' this time, only 10 minutes...) session info is as follows: R version 2.10.0 Under development (unstable) (2009-05-25 r48607) i686-pc-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GEOquery_2.9.0 RCurl_0.94-1 Biobase_2.5.2 > > 2. Robert Gentleman's book "R Programming for Bioinformatics" (as well > as many free sources on the web) describes how to profile R code in > order to see in which functions the CPU time is spent. Based on this, > you can investigate where to invest developer time for improving the > code. yes, i know; but i thought someone could already have done this or have a hint anyway, as this does not seem likely to be just my local problem. before i spend too much time on a potentially well-known issue, it's surely fine to ask others, including the developer. best regards, vQ

ADD REPLY • link 14.9 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

On Mon, May 25, 2009 at 5:00 PM, Wacek Kusnierczyk < Waclaw.Marcin.Kusnierczyk@idi.ntnu.no> wrote: > Wolfgang Huber wrote: > > > > Dear Wacek, > > > > thank you for the feedback and pointing this out. Two general remarks: > > thank you for the response. > > > > > 1. Please include a reproducible example (R script) for others to > > reproduce your experience, and subsequently the output of sessionInfo(). > > a reproducible (on my machine, at least) example is as follows: > > library(GEOquery) > repeat { > filename = getGEOfile('GSE13638') > if (!is.null(filename)) break } > system.time({gse13638 = getGEO(filename=filename)}) > # elapsed: 620 > > (so it was 'faster' this time, only 10 minutes...) > > session info is as follows: > > R version 2.10.0 Under development (unstable) (2009-05-25 r48607) > i686-pc-linux-gnu > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GEOquery_2.9.0 RCurl_0.94-1 Biobase_2.5.2 > > > > > > > 2. Robert Gentleman's book "R Programming for Bioinformatics" (as well > > as many free sources on the web) describes how to profile R code in > > order to see in which functions the CPU time is spent. Based on this, > > you can investigate where to invest developer time for improving the > > code. > > yes, i know; but i thought someone could already have done this or have > a hint anyway, as this does not seem likely to be just my local > problem. before i spend too much time on a potentially well-known > issue, it's surely fine to ask others, including the developer. > Thanks, Wacek, for the report. This is a known issue and GEOquery could probably be smarter than it currently is; I have not changed that code in q while since GEO began using GSE Matrix files. You could try using GSEMatrix=TRUE, as the documentation suggests. You will find that will deal with your speed issues. Sean [[alternative HTML version deleted]]

ADD REPLY • link 14.9 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: [...] > Thanks, Wacek, for the report. This is a known issue and GEOquery could > probably be smarter than it currently is; I have not changed that code in q > while since GEO began using GSE Matrix files. You could try using > GSEMatrix=TRUE, as the documentation suggests. You will find that will deal > with your speed issues. > > thanks. if parsing soft files is no longer essential, it makes little sense for me to delve into the implementation details; otherwise, i could try to find a spot for improvement. what's your opinion? vQ

ADD REPLY • link 14.9 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

On Tue, May 26, 2009 at 3:21 AM, Wacek Kusnierczyk < Waclaw.Marcin.Kusnierczyk@idi.ntnu.no> wrote: > Sean Davis wrote: > > [...] > > Thanks, Wacek, for the report. This is a known issue and GEOquery could > > probably be smarter than it currently is; I have not changed that code in > q > > while since GEO began using GSE Matrix files. You could try using > > GSEMatrix=TRUE, as the documentation suggests. You will find that will > deal > > with your speed issues. > > > > > > thanks. if parsing soft files is no longer essential, it makes little > sense for me to delve into the implementation details; otherwise, i > could try to find a spot for improvement. what's your opinion? > If all you need are the "expression" values from GSE entities, then the GSE Matrix will be faster, no matter the implementation, because it is smaller. But I really do not know what your needs are, so I cannot answer your question directly. Sean [[alternative HTML version deleted]]

ADD REPLY • link 14.9 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: > On Tue, May 26, 2009 at 3:21 AM, Wacek Kusnierczyk < > Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote: > > >> Sean Davis wrote: >> >> [...] >> >>> Thanks, Wacek, for the report. This is a known issue and GEOquery could >>> probably be smarter than it currently is; I have not changed that code in >>> >> q >> >>> while since GEO began using GSE Matrix files. You could try using >>> GSEMatrix=TRUE, as the documentation suggests. You will find that will >>> >> deal >> >>> with your speed issues. >>> >>> >>> >> thanks. if parsing soft files is no longer essential, it makes little >> sense for me to delve into the implementation details; otherwise, i >> could try to find a spot for improvement. what's your opinion? >> >> > > If all you need are the "expression" values from GSE entities, then the GSE > Matrix will be faster, no matter the implementation, because it is smaller. > But I really do not know what your needs are, so I cannot answer your > question directly. > i didn't mean "essential for my task", but rather "useful given that gse matrix files are available". that is, if there are good reasons to download and parse soft files, the package may need a fix -- irrespectively of what my own study is. vQ

ADD REPLY • link 14.9 years ago Wacek Kusnierczyk ▴ 180

Login before adding your answer.