GEOmetadb and gse organism type

0

Entering edit mode

Wacek Kusnierczyk ▴ 180

@wacek-kusnierczyk-88

Last seen 11.4 years ago

hello Jack, Sean, and others, i have an observation and a question related to GEOmetadb. the queries below are made directly in sqlite, but could have easily be done within R, for example. the gds (data sets) table contains two fields related to organisms: sqlite GEOmetadb.sqlite 'pragma table_info(gds)' | grep organism # 7|platform_organism|TEXT|0||0 # 10|sample_organism|TEXT|0||0 the gse (data series) table does not contain any such field: sqlite GEOmetadb.sqlite 'pragma table_info(gse)' | grep organism the gpl (platform) table contains one such field: sqlite GEOmetadb.sqlite 'pragma table_info(gpl)' | grep organism # 8|organism|TEXT|0||0 (and likewise for the gsm (sample) table). i would like to retrieve all data *series* (not data sets) where gene expression in a specific organism was investigated. in the vignette to the GEOmetadb Bioconductor package (as well as the GEOtools Matlab one), you give the following example: "we would like to find all the human breast cancer-related Affymetrix gene expression GEO series." and then you use the gse_gpl table to retrieve the *platform* organism -- which doesn't seem quite right, because you can hybridize non-human samples to human arrays, as well as human samples to non-human arrays: sqlite GEOmetadb.sqlite ' select gds, sample_organism as so, platform_organism as po from gds where (so like "%homo%" and po not like "%homo%") or (so not like "%homo%" and po like "%homo%")' (and likewise for data series, but the query would be a little bit more involved.) the comment: you may want to update the documentation, unless i misunderstood your intentions. the question: is there no direct way to check the sample organism for data series entries in GEO? it would be very useful to filter gse by sample organism, without having to join the table with gse_gsm and gsm. you already have it for gds, it would be pretty simple to add an analogous field to gse. (and likewise for platform organism). surely, a gse may include more than one organism in the platforms or samples, but that's what we have in gds already: sqlite GEOmetadb.sqlite ' select platform_organism as po, sample_organism as so from gds where po like "%,%" or so like "%,%"' so i would like to be able to sqlite GEOmetadb.sqlite ' select gds, sample_organism as so, platform_organism as po from gds where so like "%homo%" and po like "%homo%"' rather than sqlite GEOmetadb.sqlite ' select distinct gse.gse, gsm. from gse join gse_gsm on gse.gse = gse_gsm.gse join gsm on gsm.gsm = gse_gsm.gsm join gse_gpl on gse.gse = gse_gpl.gse join gpl on gpl.gpl = gse_gpl.gpl where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 like "%homo%") and gpl.organism like "%homo%"' after all, gds are backed by gse, so if gds entries can have sample_organism and platform_organism, why could not gse have them too? regards, vQ -- ---------------------------------------------------------------------- --------- Wacek Kusnierczyk, MD PhD Email: waku at idi.ntnu.no Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics & Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060

Cancer Breast Organism GEOmetadb Cancer Breast Organism GEOmetadb • 4.2k views

ADD COMMENT • link updated 16.8 years ago by Sean Davis 21k • written 16.8 years ago by Wacek Kusnierczyk ▴ 180

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 6 days ago

United States

On Thu, Mar 19, 2009 at 11:08 AM, Wacek Kusnierczyk < Waclaw.Marcin.Kusnierczyk@idi.ntnu.no> wrote: > hello Jack, Sean, and others, > > i have an observation and a question related to GEOmetadb. the queries > below are made directly in sqlite, but could have easily be done within > R, for example. > > the gds (data sets) table contains two fields related to organisms: > > sqlite GEOmetadb.sqlite 'pragma table_info(gds)' | grep organism > # 7|platform_organism|TEXT|0||0 > # 10|sample_organism|TEXT|0||0 > > the gse (data series) table does not contain any such field: > > sqlite GEOmetadb.sqlite 'pragma table_info(gse)' | grep organism > > the gpl (platform) table contains one such field: > > sqlite GEOmetadb.sqlite 'pragma table_info(gpl)' | grep organism > # 8|organism|TEXT|0||0 > > (and likewise for the gsm (sample) table). i would like to retrieve all > data *series* (not data sets) where gene expression in a specific > organism was investigated. in the vignette to the GEOmetadb > Bioconductor package (as well as the GEOtools Matlab one), you give the > following example: > > "we would like to find all the human breast cancer-related > Affymetrix gene expression GEO series." > > and then you use the gse_gpl table to retrieve the *platform* organism > -- which doesn't seem quite right, because you can hybridize non- human > samples to human arrays, as well as human samples to non-human arrays: > > sqlite GEOmetadb.sqlite ' > select gds, sample_organism as so, platform_organism as po > from gds > where (so like "%homo%" and po not like "%homo%") > or (so not like "%homo%" and po like "%homo%")' > > (and likewise for data series, but the query would be a little bit more > involved.) > > the comment: you may want to update the documentation, unless i > misunderstood your intentions. > Good point. Technically, you can hyb samples from one organism to a platform from another organism. We should add a sentence to the vignette to that effect. Thanks. > > the question: is there no direct way to check the sample organism for > data series entries in GEO? it would be very useful to filter gse by > sample organism, without having to join the table with gse_gsm and gsm. > you already have it for gds, it would be pretty simple to add an > analogous field to gse. (and likewise for platform organism). surely, > a gse may include more than one organism in the platforms or samples, > but that's what we have in gds already: > > sqlite GEOmetadb.sqlite ' > select platform_organism as po, sample_organism as so > from gds > where po like "%,%" or so like "%,%"' > > so i would like to be able to > > sqlite GEOmetadb.sqlite ' > select gds, sample_organism as so, platform_organism as po from gds > where so like "%homo%" > and po like "%homo%"' > > rather than > > sqlite GEOmetadb.sqlite ' > select distinct gse.gse, gsm. > from gse > join gse_gsm on gse.gse = gse_gsm.gse > join gsm on gsm.gsm = gse_gsm.gsm > join gse_gpl on gse.gse = gse_gpl.gse > join gpl on gpl.gpl = gse_gpl.gpl > where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 > like "%homo%") > and gpl.organism like "%homo%"' > > after all, gds are backed by gse, so if gds entries can have > sample_organism and platform_organism, why could not gse have them too? > A GSE can have samples from multiple platforms and can, therefore, represent multiple organisms. That is why doing the join to platform (if you are willing to trust that the platform was used for the specified organism) or a join to samples is necessary. A GDS, on the other hand, does not have that characteristic and represents a set of samples from a single platform. Sean [[alternative HTML version deleted]]

ADD COMMENT • link 16.8 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: > On Thu, Mar 19, 2009 at 11:08 AM, Wacek Kusnierczyk < > Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote: > > >> the question: is there no direct way to check the sample organism for >> data series entries in GEO? it would be very useful to filter gse by >> sample organism, without having to join the table with gse_gsm and gsm. >> you already have it for gds, it would be pretty simple to add an >> analogous field to gse. (and likewise for platform organism). surely, >> a gse may include more than one organism in the platforms or samples, >> but that's what we have in gds already: >> >> sqlite GEOmetadb.sqlite ' >> select platform_organism as po, sample_organism as so >> from gds >> where po like "%,%" or so like "%,%"' >> >> so i would like to be able to >> >> sqlite GEOmetadb.sqlite ' >> select gds, sample_organism as so, platform_organism as po from gds >> where so like "%homo%" >> and po like "%homo%"' >> >> rather than >> >> sqlite GEOmetadb.sqlite ' >> select distinct gse.gse, gsm. >> from gse >> join gse_gsm on gse.gse = gse_gsm.gse >> join gsm on gsm.gsm = gse_gsm.gsm >> join gse_gpl on gse.gse = gse_gpl.gse >> join gpl on gpl.gpl = gse_gpl.gpl >> where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 >> like "%homo%") >> and gpl.organism like "%homo%"' >> >> after all, gds are backed by gse, so if gds entries can have >> sample_organism and platform_organism, why could not gse have them too? >> >> > > A GSE can have samples from multiple platforms and can, therefore, represent > multiple organisms. That is why doing the join to platform (if you are > willing to trust that the platform was used for the specified organism) or a > join to samples is necessary. A GDS, on the other hand, does not have that > characteristic and represents a set of samples from a single platform. > from a single platform, but then these: select gds, platform_organism from gds where platform_organism like "%,%"; # 7 data sets indicate mixed-organism platforms, and it is organisms i was originally interested in. and, likewise, select gds, sample_organism from gds where sample_organism like "%,%"; # 8 data sets some data sets seem to contain data from more than one organism. if a data set can have multiple organisms in the platform_organism or sample_organism fields, what is the point in data series not having organism fields at all? just like gds2302 has several organisms listed in the sample_organism field, and gds2349 has several organisms listed in the platform_organism field, so could data series have. it wouldn't have to mean that all samples and all platforms involved in a data series include material from all of the species listed in the respective fields of the data series record. such an update to the database is a matter of a fairly simple statement, as far as i can see. best, vQ

ADD REPLY • link 16.8 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

On Thu, Mar 19, 2009 at 4:59 PM, Wacek Kusnierczyk < Waclaw.Marcin.Kusnierczyk@idi.ntnu.no> wrote: > Sean Davis wrote: > > On Thu, Mar 19, 2009 at 11:08 AM, Wacek Kusnierczyk < > > Waclaw.Marcin.Kusnierczyk@idi.ntnu.no> wrote: > > > > > >> the question: is there no direct way to check the sample organism for > >> data series entries in GEO? it would be very useful to filter gse by > >> sample organism, without having to join the table with gse_gsm and gsm. > >> you already have it for gds, it would be pretty simple to add an > >> analogous field to gse. (and likewise for platform organism). surely, > >> a gse may include more than one organism in the platforms or samples, > >> but that's what we have in gds already: > >> > >> sqlite GEOmetadb.sqlite ' > >> select platform_organism as po, sample_organism as so > >> from gds > >> where po like "%,%" or so like "%,%"' > >> > >> so i would like to be able to > >> > >> sqlite GEOmetadb.sqlite ' > >> select gds, sample_organism as so, platform_organism as po from > gds > >> where so like "%homo%" > >> and po like "%homo%"' > >> > >> rather than > >> > >> sqlite GEOmetadb.sqlite ' > >> select distinct gse.gse, gsm. > >> from gse > >> join gse_gsm on gse.gse = gse_gsm.gse > >> join gsm on gsm.gsm = gse_gsm.gsm > >> join gse_gpl on gse.gse = gse_gpl.gse > >> join gpl on gpl.gpl = gse_gpl.gpl > >> where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 > >> like "%homo%") > >> and gpl.organism like "%homo%"' > >> > >> after all, gds are backed by gse, so if gds entries can have > >> sample_organism and platform_organism, why could not gse have them too? > >> > >> > > > > A GSE can have samples from multiple platforms and can, therefore, > represent > > multiple organisms. That is why doing the join to platform (if you are > > willing to trust that the platform was used for the specified organism) > or a > > join to samples is necessary. A GDS, on the other hand, does not have > that > > characteristic and represents a set of samples from a single platform. > > > > from a single platform, but then these: > > select gds, platform_organism from gds where platform_organism like > "%,%"; > # 7 data sets > > indicate mixed-organism platforms, and it is organisms i was originally > interested in. > > and, likewise, > > select gds, sample_organism from gds where sample_organism like "%,%"; > # 8 data sets > > some data sets seem to contain data from more than one organism. > You have a point here, yes. GEO includes this information as a comma-separated list and we parse that. Our goal is to accurately reproduce GEO metadata in SQL. > > if a data set can have multiple organisms in the platform_organism or > sample_organism fields, what is the point in data series not having > organism fields at all? just like gds2302 has several organisms listed > in the sample_organism field, and gds2349 has several organisms listed > in the platform_organism field, so could data series have. it wouldn't > have to mean that all samples and all platforms involved in a data > series include material from all of the species listed in the respective > fields of the data series record. > > such an update to the database is a matter of a fairly simple statement, > as far as i can see. > There are a couple of reasons for not doing so: 1) GEO does not provide such a field, so we do not either. As for GDS and sample_organism and platform_organism, we have parsed those directly from GEO. 2) Doing so violates a rule of database design (it denormalizes the database). Of course, the power of SQL is that you can manipulate the data in your local instance as you like, so you do not have to stick to our ideas of what the best database design is. Sean [[alternative HTML version deleted]]

ADD REPLY • link 16.8 years ago Sean Davis 21k

0

Entering edit mode

On Thu, Mar 19, 2009 at 5:30 PM, Sean Davis <sdavis2@mail.nih.gov> wrote: > > > On Thu, Mar 19, 2009 at 4:59 PM, Wacek Kusnierczyk < > Waclaw.Marcin.Kusnierczyk@idi.ntnu.no> wrote: > >> Sean Davis wrote: >> > On Thu, Mar 19, 2009 at 11:08 AM, Wacek Kusnierczyk < >> > Waclaw.Marcin.Kusnierczyk@idi.ntnu.no> wrote: >> > >> > >> >> the question: is there no direct way to check the sample organism for >> >> data series entries in GEO? it would be very useful to filter gse by >> >> sample organism, without having to join the table with gse_gsm and gsm. >> >> you already have it for gds, it would be pretty simple to add an >> >> analogous field to gse. (and likewise for platform organism). surely, >> >> a gse may include more than one organism in the platforms or samples, >> >> but that's what we have in gds already: >> >> >> >> sqlite GEOmetadb.sqlite ' >> >> select platform_organism as po, sample_organism as so >> >> from gds >> >> where po like "%,%" or so like "%,%"' >> >> >> >> so i would like to be able to >> >> >> >> sqlite GEOmetadb.sqlite ' >> >> select gds, sample_organism as so, platform_organism as po from >> gds >> >> where so like "%homo%" >> >> and po like "%homo%"' >> >> >> >> rather than >> >> >> >> sqlite GEOmetadb.sqlite ' >> >> select distinct gse.gse, gsm. >> >> from gse >> >> join gse_gsm on gse.gse = gse_gsm.gse >> >> join gsm on gsm.gsm = gse_gsm.gsm >> >> join gse_gpl on gse.gse = gse_gpl.gse >> >> join gpl on gpl.gpl = gse_gpl.gpl >> >> where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 >> >> like "%homo%") >> >> and gpl.organism like "%homo%"' >> >> >> >> after all, gds are backed by gse, so if gds entries can have >> >> sample_organism and platform_organism, why could not gse have them too? >> >> >> >> >> > >> > A GSE can have samples from multiple platforms and can, therefore, >> represent >> > multiple organisms. That is why doing the join to platform (if you are >> > willing to trust that the platform was used for the specified organism) >> or a >> > join to samples is necessary. A GDS, on the other hand, does not have >> that >> > characteristic and represents a set of samples from a single platform. >> > >> >> from a single platform, but then these: >> >> select gds, platform_organism from gds where platform_organism like >> "%,%"; >> # 7 data sets >> >> indicate mixed-organism platforms, and it is organisms i was originally >> interested in. >> >> and, likewise, >> >> select gds, sample_organism from gds where sample_organism like "%,%"; >> # 8 data sets >> >> some data sets seem to contain data from more than one organism. >> > > You have a point here, yes. GEO includes this information as a > comma-separated list and we parse that. Our goal is to accurately reproduce > GEO metadata in SQL. > > >> >> if a data set can have multiple organisms in the platform_organism or >> sample_organism fields, what is the point in data series not having >> organism fields at all? just like gds2302 has several organisms listed >> in the sample_organism field, and gds2349 has several organisms listed >> in the platform_organism field, so could data series have. it wouldn't >> have to mean that all samples and all platforms involved in a data >> series include material from all of the species listed in the respective >> fields of the data series record. >> >> such an update to the database is a matter of a fairly simple statement, >> as far as i can see. >> > > There are a couple of reasons for not doing so: > > 1) GEO does not provide such a field, so we do not either. As for GDS and > sample_organism and platform_organism, we have parsed those directly from > GEO. > > 2) Doing so violates a rule of database design (it denormalizes the > database). > > Of course, the power of SQL is that you can manipulate the data in your > local instance as you like, so you do not have to stick to our ideas of what > the best database design is. > Just to follow up, something like this: select gse.gse,platform_organism,sample_organism from (select gse.gse,group_concat(gpl.organism,',') as platform_organism from gse join gse_gpl gg on gg.gse=gse.gse join gpl on gg.gpl=gpl.gpl group by gse.gse) as o1 join gse on o1.gse=gse.gse join (select gse.gse,group_concat(gsm.organism_ch1,',') || ',' || group_concat(gsm.organism_ch2,',') as sample_organism from gse join gse_gsm gs on gs.gse=gse.gse join gsm on gsm.gsm=gs.gsm group by gse.gse) as o2 on gse.gse=o2.gse; could be used to create a table or view similar to what you are asking for. Sean [[alternative HTML version deleted]]

ADD REPLY • link 16.8 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: > On Thu, Mar 19, 2009 at 5:30 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > > >> On Thu, Mar 19, 2009 at 4:59 PM, Wacek Kusnierczyk < >> Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote: >> >> >>> if a data set can have multiple organisms in the platform_organism or >>> sample_organism fields, what is the point in data series not having >>> organism fields at all? just like gds2302 has several organisms listed >>> in the sample_organism field, and gds2349 has several organisms listed >>> in the platform_organism field, so could data series have. it wouldn't >>> have to mean that all samples and all platforms involved in a data >>> series include material from all of the species listed in the respective >>> fields of the data series record. >>> >>> such an update to the database is a matter of a fairly simple statement, >>> as far as i can see. >>> >>> >> There are a couple of reasons for not doing so: >> >> 1) GEO does not provide such a field, so we do not either. As for GDS and >> sample_organism and platform_organism, we have parsed those directly from >> GEO. >> >> 2) Doing so violates a rule of database design (it denormalizes the >> database). >> >> Of course, the power of SQL is that you can manipulate the data in your >> local instance as you like, so you do not have to stick to our ideas of what >> the best database design is. >> >> > > Just to follow up, something like this: > > select gse.gse,platform_organism,sample_organism > from > (select gse.gse,group_concat(gpl.organism,',') as platform_organism > from gse > join gse_gpl gg on gg.gse=gse.gse > join gpl on gg.gpl=gpl.gpl > group by gse.gse) as o1 > join gse on o1.gse=gse.gse (select gse.gse,group_concat(gsm.organism_ch1,',') || ',' || > group_concat(gsm.organism_ch2,',') as sample_organism > from gse join gse_gsm gs on gs.gse=gse.gse > join gsm on gsm.gsm=gs.gsm group by gse.gse) as o2 on gse.gse=o2.gse; > > could be used to create a table or view similar to what you are asking for. > this will give redundant replicates for each series, which is not necessarily bad, but could be avoided with a few modifications. also, i'm not sure if you're aware that some series do not have corresponding samples: select count(*) from (select distinct gsm from gse_gsm); -- 288939 select count(*) from (select distinct gsm from gse_gsm natural join gsm); -- 268106 which looks like some 20000 entries in gse_gsm do not have a counterpart in gsm (are these typos in gse_gsm?) and likewise for platforms: select count(*) from (select distinct gpl from gpl); -- 5322 select count(*) from (select distinct gpl from gse_gpl natural join gpl); -- 4505 which looks like some 800 entries in gse_gpl do not have a counterpart in gpl. if anyone is interested, the following is a statement that will, as far as i can see, construct a table of the shape i need -- with each gse entry being a comination of entries from gse and three additional fields, cumulating the organisms from the paltforms and samples involved in the series: create table xgse as select gse.*, p.porg as platform_organism, s1.sorg1 as sample_organism_ch1, s2.sorg2 as sample_organism_ch2 from gse natural join ( select gse, group_concat(porg) as porg from ( select distinct gse.gse as gse, gpl.organism as porg from gse left join gse_gpl on gse.gse = gse_gpl.gse left join gpl on gse_gpl.gpl = gpl.gpl) group by gse) as p natural join ( select gse, group_concat(sorg) as sorg1 from ( select distinct gse.gse as gse, gsm.organism_ch1 as sorg from gse left join gse_gsm on gse.gse = gse_gsm.gse left join gsm on gse_gsm.gsm = gsm.gsm) group by gse) as s1 natural join ( select gse, group_concat(sorg) as sorg2 from ( select distinct gse.gse as gse, gsm.organism_ch2 as sorg from gse left join gse_gsm on gse.gse = gse_gsm.gse left join gsm on gse_gsm.gsm = gsm.gsm) group by gse) as s2; and now it's easy to query for series on a platform or with a sample from a specific organism, with no joins. vQ

ADD REPLY • link 16.8 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

Sean Davis wrote: > >> if a data set can have multiple organisms in the platform_organism or >> sample_organism fields, what is the point in data series not having >> organism fields at all? just like gds2302 has several organisms listed >> in the sample_organism field, and gds2349 has several organisms listed >> in the platform_organism field, so could data series have. it wouldn't >> have to mean that all samples and all platforms involved in a data >> series include material from all of the species listed in the respective >> fields of the data series record. >> >> such an update to the database is a matter of a fairly simple statement, >> as far as i can see. >> >> > > There are a couple of reasons for not doing so: > > 1) GEO does not provide such a field, so we do not either. As for GDS and > sample_organism and platform_organism, we have parsed those directly from > GEO. > > 2) Doing so violates a rule of database design (it denormalizes the > database). > > Of course, the power of SQL is that you can manipulate the data in your > local instance as you like, so you do not have to stick to our ideas of what > the best database design is. > i agree with that having multiple comma-separated organism entries for a single data set or data series denormalizes the database. but enforcing normalization is good only if it helps, not hinders using the data. i think that having such an organism summary for a data series is desirable for the purpose of filtering the data by organism, and while a complex sql query can achieve that, it's easier and more efficient to check a field directly in the gse table. i can of course do the preprocessing locally, but i think it could be useful for others to have that done in the master copy, too. take it as a hint, not as a critique -- i appreciate your work woth GEOmetadb. vQ

ADD REPLY • link 16.8 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

Wacek Kusnierczyk wrote: > Sean Davis wrote: >>> if a data set can have multiple organisms in the platform_organism or >>> sample_organism fields, what is the point in data series not having >>> organism fields at all? just like gds2302 has several organisms listed >>> in the sample_organism field, and gds2349 has several organisms listed >>> in the platform_organism field, so could data series have. it wouldn't >>> have to mean that all samples and all platforms involved in a data >>> series include material from all of the species listed in the respective >>> fields of the data series record. >>> >>> such an update to the database is a matter of a fairly simple statement, >>> as far as i can see. >>> >>> >> There are a couple of reasons for not doing so: >> >> 1) GEO does not provide such a field, so we do not either. As for GDS and >> sample_organism and platform_organism, we have parsed those directly from >> GEO. >> >> 2) Doing so violates a rule of database design (it denormalizes the >> database). >> >> Of course, the power of SQL is that you can manipulate the data in your >> local instance as you like, so you do not have to stick to our ideas of what >> the best database design is. >> > > i agree with that having multiple comma-separated organism entries for a > single data set or data series denormalizes the database. but enforcing > normalization is good only if it helps, not hinders using the data. i > think that having such an organism summary for a data series is > desirable for the purpose of filtering the data by organism, and while a > complex sql query can achieve that, it's easier and more efficient to > check a field directly in the gse table. I'd be on Sean's side regarding the choices made. The habit to represent one-to-many or many-to-many associations with comma-separated strings is an unfortunate one, I think. It will be far more efficient (in terms of usage of computing ressources) to filter data series with a query. When it comes to the ease with which one can can achieve it, a function performing the SQL query and returning the result can always be written and distributed for all to use. L. > > i can of course do the preprocessing locally, but i think it could be > useful for others to have that done in the master copy, too. > > take it as a hint, not as a critique -- i appreciate your work woth > GEOmetadb. > > vQ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 16.8 years ago Laurent Gautier ★ 2.3k

0

Entering edit mode

Laurent Gautier wrote: > Wacek Kusnierczyk wrote: >> >> i agree with that having multiple comma-separated organism entries for a >> single data set or data series denormalizes the database. but enforcing >> normalization is good only if it helps, not hinders using the data. i >> think that having such an organism summary for a data series is >> desirable for the purpose of filtering the data by organism, and while a >> complex sql query can achieve that, it's easier and more efficient to >> check a field directly in the gse table. > > I'd be on Sean's side regarding the choices made. > > The habit to represent one-to-many or many-to-many associations with > comma-separated strings is an unfortunate one, I think. yes, but only as long as it is disturbing rather than helpful. > It will be far more efficient (in terms of usage of computing > ressources) to filter data series with a query. not sure what you mean by 'far more efficient'. where the data does not change too often, denormalizing preprocessing often leads to considerable speedup of subsequent queries -- it all depends on the shape of data. here, you have some 250000 rows in the gsm table, but only a few samples per data series, which means that an on-the-fly join may actually be slower than using cached result from a previous join, stored as a comma-separated multiitem field. in an earlier post, i provided an sql statement that creates an additional table, xgse (for 'extended gse'). the following are timings for retrieving all data series that involve at least one human sample: time sqlite GEOmetadb.sqlite ' select distinct gse from gse natural join gse_gsm natural join gsm where organism_ch1 like "%homo%" or organism_ch2 like "%homo%" ' > /dev/null # 3.353 s time sqlite GEOmetadb.sqlite ' select distinct gse from xgse where sample_organism_ch1 like "%homo%" or sample_organism_ch2 "%homo%" ' > /dev/null # 0.031 the 'far more efficient' query is actually 2 orders of magnitude slower. (or did you mean something different?) > When it comes to the ease with which one can can achieve it, a > function performing the SQL query and returning the result can always > be written and distributed for all to use. sure. another issue: as far as i can see, the web interface to geometadb allows one to filter data series by organism, though it's not clear to me whether this means platform organism, sample organism, both, or any. it might be good to clarify this. but the web interface won't display the organism in the output table (tell me i'm wrong) -- which means, a user can filter by organism, but cannot see the organism of data series retrieved with whatever criteria. it might be useful to extend the interface. vQ

ADD REPLY • link 16.8 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

Wacek Kusnierczyk wrote: > Laurent Gautier wrote: >> Wacek Kusnierczyk wrote: >>> i agree with that having multiple comma-separated organism entries for a >>> single data set or data series denormalizes the database. but enforcing >>> normalization is good only if it helps, not hinders using the data. i >>> think that having such an organism summary for a data series is >>> desirable for the purpose of filtering the data by organism, and while a >>> complex sql query can achieve that, it's easier and more efficient to >>> check a field directly in the gse table. >> I'd be on Sean's side regarding the choices made. >> >> The habit to represent one-to-many or many-to-many associations with >> comma-separated strings is an unfortunate one, I think. > > yes, but only as long as it is disturbing rather than helpful. > >> It will be far more efficient (in terms of usage of computing >> ressources) to filter data series with a query. > > not sure what you mean by 'far more efficient'. where the data does not > change too often, denormalizing preprocessing often leads to > considerable speedup of subsequent queries -- it all depends on the > shape of data. here, you have some 250000 rows in the gsm table, but > only a few samples per data series, which means that an on-the-fly join > may actually be slower than using cached result from a previous join, > stored as a comma-separated multiitem field. > > in an earlier post, i provided an sql statement that creates an > additional table, xgse (for 'extended gse'). the following are timings > for retrieving all data series that involve at least one human sample: > > time sqlite GEOmetadb.sqlite ' > select distinct gse > from gse > natural join gse_gsm > natural join gsm > where organism_ch1 like "%homo%" or organism_ch2 like "%homo%" > ' > /dev/null > # 3.353 s > > time sqlite GEOmetadb.sqlite ' > select distinct gse > from xgse > where sample_organism_ch1 like "%homo%" or sample_organism_ch2 > "%homo%" > ' > /dev/null > # 0.031 > > the 'far more efficient' query is actually 2 orders of magnitude > slower. (or did you mean something different?) No. The comment is general and about a commonly found attraction for fields in relational databases populated with comma-separated values. Note also that a frequent use-case is to query for an exact term, such as 'Homo sapiens'. When the field is indexed, this makes the searches faster. For a particular query, I would also use optimization hints before going for such denormalization (use EXPLAIN QUERY PLAN with SQLite). Otherwise, should optimization through a static representation be needed, a materialized view can be used. I am not certain this is possible with SQLite, but a cheap workaround would be to store a table with IDs from the join. This would keep the possibility to perform exact matches. We might now be entering less trivial aspects of querying, but this is where canned queries (like in a function) can help make it easily accessible to all users. >> When it comes to the ease with which one can can achieve it, a >> function performing the SQL query and returning the result can always >> be written and distributed for all to use. > > > sure. > > > another issue: as far as i can see, the web interface to geometadb > allows one to filter data series by organism, though it's not clear to > me whether this means platform organism, sample organism, both, or any. > it might be good to clarify this. > > but the web interface won't display the organism in the output table > (tell me i'm wrong) -- which means, a user can filter by organism, but > cannot see the organism of data series retrieved with whatever > criteria. it might be useful to extend the interface. > > vQ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 16.8 years ago Laurent Gautier ★ 2.3k

0

Entering edit mode

Hi Wacek, Please see my answer to the web interface issue below. Please let me know if you have any questions. Jack On 3/20/09 3:42 PM, "Wacek Kusnierczyk" <waclaw.marcin.kusnierczyk at="" idi.ntnu.no=""> wrote: > Laurent Gautier wrote: >> Wacek Kusnierczyk wrote: >>> >>> i agree with that having multiple comma-separated organism entries for a >>> single data set or data series denormalizes the database. but enforcing >>> normalization is good only if it helps, not hinders using the data. i >>> think that having such an organism summary for a data series is >>> desirable for the purpose of filtering the data by organism, and while a >>> complex sql query can achieve that, it's easier and more efficient to >>> check a field directly in the gse table. >> >> I'd be on Sean's side regarding the choices made. >> >> The habit to represent one-to-many or many-to-many associations with >> comma-separated strings is an unfortunate one, I think. > > yes, but only as long as it is disturbing rather than helpful. > >> It will be far more efficient (in terms of usage of computing >> ressources) to filter data series with a query. > > not sure what you mean by 'far more efficient'. where the data does not > change too often, denormalizing preprocessing often leads to > considerable speedup of subsequent queries -- it all depends on the > shape of data. here, you have some 250000 rows in the gsm table, but > only a few samples per data series, which means that an on-the-fly join > may actually be slower than using cached result from a previous join, > stored as a comma-separated multiitem field. > > in an earlier post, i provided an sql statement that creates an > additional table, xgse (for 'extended gse'). the following are timings > for retrieving all data series that involve at least one human sample: > > time sqlite GEOmetadb.sqlite ' > select distinct gse > from gse > natural join gse_gsm > natural join gsm > where organism_ch1 like "%homo%" or organism_ch2 like "%homo%" > ' > /dev/null > # 3.353 s > > time sqlite GEOmetadb.sqlite ' > select distinct gse > from xgse > where sample_organism_ch1 like "%homo%" or sample_organism_ch2 > "%homo%" > ' > /dev/null > # 0.031 > > the 'far more efficient' query is actually 2 orders of magnitude > slower. (or did you mean something different?) > >> When it comes to the ease with which one can can achieve it, a >> function performing the SQL query and returning the result can always >> be written and distributed for all to use. > > > sure. > > > another issue: as far as i can see, the web interface to geometadb > allows one to filter data series by organism, though it's not clear to > me whether this means platform organism, sample organism, both, or any. > it might be good to clarify this. As you can see, on the web interface of the GEOmetadb application ( http://gbnci.abcc.ncifcrf.gov/geo/geo_search.php ) you can search for anyone of the three GEO entities (GPL, GSE and GSM) with any combination of terms from any of the three. The design provides an effective way, in my opinion, to do crossing GEO type search, which usually needs a complex SQL join command to achieve from command line. It allows you search GSE with an organism name. It also includes MySQL feature, Fulltext serach, in it which allows Boolean mode. Behind the scene is a complex SQL join expression, the same as Wacek wrote. > > but the web interface won't display the organism in the output table > (tell me i'm wrong) -- which means, a user can filter by organism, but > cannot see the organism of data series retrieved with whatever > criteria. it might be useful to extend the interface. > > vQ > >

ADD REPLY • link 16.8 years ago Jack Zhu ▴ 170

0

Entering edit mode

Jack Zhu wrote: > Hi Wacek, > > Please see my answer to the web interface issue below. Please let me know > if you have any questions. > > > >> another issue: as far as i can see, the web interface to geometadb >> allows one to filter data series by organism, though it's not clear to >> me whether this means platform organism, sample organism, both, or any. >> it might be good to clarify this. >> > > As you can see, on the web interface of the GEOmetadb application ( > http://gbnci.abcc.ncifcrf.gov/geo/geo_search.php ) you can search for anyone > of the three GEO entities (GPL, GSE and GSM) with any combination of terms > from any of the three. The design provides an effective way, in my opinion, > to do crossing GEO type search, which usually needs a complex SQL join > command to achieve from command line. It allows you search GSE with an > organism name. thanks for the explanation. yes, i know this search page. my point was, when i choose, say, 'homo sapiens' in the 'organism' field while searching for gse entries, what does it mean, exactly? which organism fields are examined, and what's the boolean combination? maybe it's explained there on the page, i just haven't found it? > It also includes MySQL feature, Fulltext serach, in it which > allows Boolean mode. hmm, is this working properly? the following query: drosha is assumed to return all entries with 'drosha' -- there is just one (with organism = homo sapiens). that's fine; the following query: mirna is supposed to return all entries with 'mirna' -- there are 17. the following query: mirna drosha is supposed to return all entries with 'drosha', 'mirna', or both -- there are 18 such entries (no overlap). the following query: mirna +drosha is supposed to return all entries with or without 'mirna', but necessarily with 'drosha' (the mysql doc page you're referring to says "A leading plus sign indicates that this word /must/ be present in each row that is returned" and "By default (when neither |+| nor |-| is specified) the word is optional, but the rows that contain it are rated higher"). but this query returns 18 entries, while only one of them contains 'drosha'! as far as i can see, the same happens with any selection of keywords prefixed with '+', they work in the optional mode irrespectively of the prefix. (but the exclusion prefix '-' seems to work.) > Behind the scene is a complex SQL join expression, > the same as Wacek wrote. > you mean i reproduced the very same query you're using? interesting. vQ

ADD REPLY • link 16.8 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

Hi Wacek, Thanks for your observation and suggestion. Regarding adding a organism field in the gse table, actually it was considered when the package was designed. Since one gse could have multiple organisms and that would end up with delimited strings. We did not do it because: 1. delimited strings in a field will need to be parsed when you want to do something like counting or convert to other types; 2. Keeping the things in different tables is the purpose of a relational database; 3. GEO does not have this field and we intend to keep data as the same as in the GEO as possible. The join table sql command does not look very pretty: sqlite GEOmetadb.sqlite ' >> select distinct gse.gse, gsm. >> from gse >> join gse_gsm on gse.gse = gse_gsm.gse >> join gsm on gsm.gsm = gse_gsm.gsm >> join gse_gpl on gse.gse = gse_gpl.gse >> join gpl on gpl.gpl = gse_gpl.gpl >> where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 >> like "%homo%") and gpl.organism like "%homo%"' But we could do this by steps (in R): > ho_gpl.gpl = sqliteQuickSQL(con,'SELECT gpl FROM gpl WHERE gpl.organism LIKE "%homo%"') > ho_gsm.gpl = sqliteQuickSQL(con,'SELECT DISTINCT gpl FROM gsm WHERE gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 like "%homo%"') > ho_gpl = intersect(ho_gpl.gpl$gpl, ho_gsm.gpl$gpl) > ho_gse = geoConvert(ho_gpl, 'gse') Jack On 3/19/09 11:44 AM, "Sean Davis" <sdavis2@mail.nih.gov> wrote: > > > On Thu, Mar 19, 2009 at 11:08 AM, Wacek Kusnierczyk > <waclaw.marcin.kusnierczyk@idi.ntnu.no> wrote: >> hello Jack, Sean, and others, >> >> i have an observation and a question related to GEOmetadb. the queries >> below are made directly in sqlite, but could have easily be done within >> R, for example. >> >> the gds (data sets) table contains two fields related to organisms: >> >> sqlite GEOmetadb.sqlite 'pragma table_info(gds)' | grep organism >> # 7|platform_organism|TEXT|0||0 >> # 10|sample_organism|TEXT|0||0 >> >> the gse (data series) table does not contain any such field: >> >> sqlite GEOmetadb.sqlite 'pragma table_info(gse)' | grep organism >> >> the gpl (platform) table contains one such field: >> >> sqlite GEOmetadb.sqlite 'pragma table_info(gpl)' | grep organism >> # 8|organism|TEXT|0||0 >> >> (and likewise for the gsm (sample) table). i would like to retrieve all >> data *series* (not data sets) where gene expression in a specific >> organism was investigated. in the vignette to the GEOmetadb >> Bioconductor package (as well as the GEOtools Matlab one), you give the >> following example: >> >> "we would like to find all the human breast cancer-related >> Affymetrix gene expression GEO series." >> >> and then you use the gse_gpl table to retrieve the *platform* organism >> -- which doesn't seem quite right, because you can hybridize non- human >> samples to human arrays, as well as human samples to non-human arrays: >> >> sqlite GEOmetadb.sqlite ' >> select gds, sample_organism as so, platform_organism as po >> from gds >> where (so like "%homo%" and po not like "%homo%") >> or (so not like "%homo%" and po like "%homo%")' >> >> (and likewise for data series, but the query would be a little bit more >> involved.) >> >> the comment: you may want to update the documentation, unless i >> misunderstood your intentions. > > Good point. Technically, you can hyb samples from one organism to a platform > from another organism. We should add a sentence to the vignette to that > effect. Thanks. > >> >> the question: is there no direct way to check the sample organism for >> data series entries in GEO? it would be very useful to filter gse by >> sample organism, without having to join the table with gse_gsm and gsm. >> you already have it for gds, it would be pretty simple to add an >> analogous field to gse. (and likewise for platform organism). surely, >> a gse may include more than one organism in the platforms or samples, >> but that's what we have in gds already: >> >> sqlite GEOmetadb.sqlite ' >> select platform_organism as po, sample_organism as so >> from gds >> where po like "%,%" or so like "%,%"' >> >> so i would like to be able to >> >> sqlite GEOmetadb.sqlite ' >> select gds, sample_organism as so, platform_organism as po from gds >> where so like "%homo%" >> and po like "%homo%"' >> >> rather than >> >> sqlite GEOmetadb.sqlite ' >> select distinct gse.gse, gsm. >> from gse >> join gse_gsm on gse.gse = gse_gsm.gse >> join gsm on gsm.gsm = gse_gsm.gsm >> join gse_gpl on gse.gse = gse_gpl.gse >> join gpl on gpl.gpl = gse_gpl.gpl >> where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 >> like "%homo%") >> and gpl.organism like "%homo%"' >> >> after all, gds are backed by gse, so if gds entries can have >> sample_organism and platform_organism, why could not gse have them too? > > A GSE can have samples from multiple platforms and can, therefore, represent > multiple organisms. That is why doing the join to platform (if you are > willing to trust that the platform was used for the specified organism) or a > join to samples is necessary. A GDS, on the other hand, does not have that > characteristic and represents a set of samples from a single platform. > > Sean > > > -- Yuelin (Jack) Zhu NIH/NCI/CCR/Genetics Branch Tel: (301)496-4527 FAX: (301) 402-3241 E-mail: zhujack@mail.nih.gov Web: http://meltzerlab.nci.nih.gov/ [[alternative HTML version deleted]]

ADD REPLY • link 16.8 years ago Jack Zhu ▴ 170

0

Entering edit mode

Jack Zhu wrote: > Hi Wacek, > > Thanks for your observation and suggestion. Regarding adding a organism > field in the gse table, actually it was considered when the package was > designed. Since one gse could have multiple organisms and that would end up > with delimited strings. We did not do it because: 1. delimited strings in a > field will need to be parsed when you want to do something like counting or > convert to other types; you can always use complex queries to count the original data > 2. Keeping the things in different tables is the > purpose of a relational database; yes, but the most important purpose of a database is to serve its users; databases do not exist just to acknowledge the relational theory ;) > 3. GEO does not have this field and we > intend to keep data as the same as in the GEO as possible. > that's reasonable. however, geo is already incoherent here, since some entries in gds.platform_organism and gds.sample_organism are already denormalized, as are some entries in the organism fields in the gsm and gpl tables. if you wanted to insist on 1. and 2. above, you'd have to parse and separate these. and actually, to avoid redundance, the database would best have a separate column with organism names, linked to the other tables via an organism is (possibly the NCBI taxon id). since there is already a little mess here (not your fault) in this respect, i don't think adding organism columns to gse would make things worse wrt. 1 and 2 above. but this is only my opinion, i can obviously have things my way locally. > The join table sql command does not look very pretty: > it could be simplified a bit using natural joins, but you see that the simple task of filtering gse by organism does add some ugly complexity with the current shape of things. > sqlite GEOmetadb.sqlite ' > >>> select distinct gse.gse, gsm. >>> from gse >>> join gse_gsm on gse.gse = gse_gsm.gse >>> join gsm on gsm.gsm = gse_gsm.gsm >>> join gse_gpl on gse.gse = gse_gpl.gse >>> join gpl on gpl.gpl = gse_gpl.gpl >>> where (gsm.organism_ch1 like "%homo%" and gsm.organism_ch2 >>> like "%homo%") >>> > and gpl.organism like "%homo%"' > > But we could do this by steps (in R): yes, and you can do things by steps with sql as well. but r adds some performance penalty, without really simplifying the matters. thanks for your efforts. vQ

ADD REPLY • link 16.8 years ago Wacek Kusnierczyk ▴ 180

Login before adding your answer.