getGEO - getting the .CEL files from GEO
1
0
Entering edit mode
張 語恬 ▴ 10
@-3968
Last seen 10.2 years ago
Hi: I've download the GSE CEL files from GEO. But I have trouble in adding the individual charateristics, such as tumor site, age, gender...and so on, to the CEL file. I've read the mail of [BioC] getGEO - getting the .CEL files from GEO,but still not understood. Could you use GSE4045 as an example to demonstrate how to use the exprs(), I can find the instrucion in the mailing list, to replace the GSE4045.SOFT with the CEL raw microarray data and keep the characteristics left. Thanks, greengarden _________________________________________________________________ Hotmail ±j¤jªº©U§£«H¥óºÞ²z¥\¯à¡A­È±o§A«H¿à¡C [[alternative HTML version deleted]]
Microarray Microarray • 2.1k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 9 weeks ago
United States
do you really want to put sample-characteristics data in a CEL file? the sample characteristics are available as follows: ff = getGEO("GSE4045") > table(pData(ff[[1]])$descr) conventional colorectal tumor, mucinous, Dukes Stage c, MSS, no cancer in the family, male, Distal Location , Tumor Grade 2 1 conventional colorectal tumor, non-mucinous, Dukes Stage b, MSS, no cancer in the family, female, Distal Location , Tumor Grade 2 1 conventional colorectal tumor, non-mucinous, Dukes Stage c, MSI, no cancer in the family, female, Proximal Location , Tumor Grade 3 1 .... and you will have to parse that 'description' field to extract stage and other relevant information. for example de = as.character(ff[[1]]$desc gr = gsub(".*, Tumor Grade.(.)$", "\\1", de) gives you a single character string for grade, except for sample 14 -- where my regexp doesn't do as much as it should. such activities would be used to populate an annotated data frame which could then serve as the phenoData component of an AffyBatch instance, which is a typical container for CEL-based intensity data, to be propagated downstream through background correction and normalization and so forth. The experimentData element should also be suitably populated, as early in the workflow as possible. If we look closely enough we can find that the ExpressionSet returned by getGEO has quantifications generated by MAS 5.0. On Wed, Mar 17, 2010 at 11:27 AM, ? ?? <greengarden_0925 at="" hotmail.com=""> wrote: > > > Hi: > > I've download the GSE CEL files from GEO. But I have trouble in adding the individual charateristics, such as tumor site, age, gender...and so on, to the CEL file. > > I've read the mail of [BioC] getGEO - getting the .CEL files from GEO,but still not understood. > > Could you use GSE4045 as an example to demonstrate > how to use the exprs(), I can find the instrucion in the mailing list, to replace the GSE4045.SOFT with the CEL raw microarray data and keep the characteristics left. > > Thanks, > greengarden > _________________________________________________________________ > Hotmail ?????????????????? > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
2010/3/17 Vincent Carey <stvjc at="" channing.harvard.edu="">: > do you really want to put sample-characteristics data in a CEL file? > > the sample characteristics are available as follows: > > ?ff = getGEO("GSE4045") > >> table(pData(ff[[1]])$descr) > > ? ? ? ?conventional colorectal tumor, mucinous, Dukes Stage c, MSS, > no cancer in the family, male, Distal Location , Tumor Grade 2 > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 > ?conventional colorectal tumor, non-mucinous, Dukes Stage b, MSS, no > cancer in the family, female, Distal Location , Tumor Grade 2 > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 > conventional colorectal tumor, non-mucinous, Dukes Stage c, MSI, no > cancer in the family, female, Proximal Location , Tumor Grade 3 > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 > .... > > and you will have to parse that 'description' field to extract stage > and other relevant information. ?for example > > de = as.character(ff[[1]]$desc > gr = gsub(".*, Tumor Grade.(.)$", "\\1", de) > > gives you a single character string for grade, except for sample 14 -- > where my regexp doesn't do as much as it should. > > such activities would be used to populate an annotated data frame > which could then serve as the phenoData component of an AffyBatch > instance, which is a typical container for CEL-based intensity data, > to be propagated downstream through background correction and > normalization and so forth. ?The experimentData element should also be > suitably populated, as early in the workflow as possible. ?If we look > closely enough we can find that the ExpressionSet returned by getGEO > has quantifications generated by MAS 5.0. > > On Wed, Mar 17, 2010 at 11:27 AM, ? ?? <greengarden_0925 at="" hotmail.com=""> wrote: >> >> >> Hi: >> >> I've download ?the GSE CEL files from GEO. But I have trouble in adding the individual charateristics, such as tumor site, age, gender...and so on, to the CEL file. >> >> I've read the mail of [BioC] getGEO - getting the .CEL files from GEO,but still not understood. >> >> Could you use GSE4045 as an example to demonstrate >> how to use the exprs(), I can find the instrucion in the mailing list, to replace the GSE4045.SOFT ?with the CEL raw microarray data and keep the characteristics left. >> There are a couple of tricks here that can sometimes be useful to get better annotation. In this case, they are not a big improvement. The GEO GSE data entity contains information as supplied by the submitters. The GDS data entity contains data taken from GSE records that have been further curated by GEO staff. Often, that leads to more useful annotation than comma-separated lists (although the information is usually the same or similar, at least). To give an example of how one might learn of the existence of such a GDS given a GSE, one can use the GEOmetadb package: library(GEOmetadb) # Next command will take a minute.... sqlfile = getSQLiteFile() # Check to see if the GSE record has a corresponding # GDS record geoConvert('GSE4045','gds') This series of commands will result in the following: $gds from_acc to_acc 1 GSE4045 GDS2201 So, GSE4045 has been curated by NCBI GEO staff and the accession of the curated data is GDS2201. We can check to see what subsets (phenotypic variables) are available using GEOmetadb, but we have to resort to writing SQL to do so: # make a connection to the database conn = dbConnect('SQLite',sqlfile) dbGetQuery(conn,"select gds_subset.gds,gds_subset.description,gds_subset.type from gds_subset where gds='GDS2201'") One can use the columnDescriptions() function to get a data.frame of columns, tables, and descriptions if writing SQL is necessary. This will return this small data.frame: gds description type 1 GDS2201 serrated colerectal carcinoma disease state 2 GDS2201 conventional colorectal carcinoma disease state So, unfortunately, the GEO staff has annotated only the two different types of colorectal carcinoma and not the other clinical variables. If this is all you wanted, then you can use getGEO('GDS2201') to get the annotations and attach those to the ExpressionSet that you create by normalizing the .CEL files of your choosing. If not, then Vince's method is the way to go. Sean
ADD REPLY
0
Entering edit mode
On 17 March 2010 16:51, Sean Davis <seandavi at="" gmail.com=""> wrote: > 2010/3/17 Vincent Carey <stvjc at="" channing.harvard.edu="">: >> do you really want to put sample-characteristics data in a CEL file? >> >> the sample characteristics are available as follows: >> >> ?ff = getGEO("GSE4045") >> >>> table(pData(ff[[1]])$descr) >> >> ? ? ? ?conventional colorectal tumor, mucinous, Dukes Stage c, MSS, >> no cancer in the family, male, Distal Location , Tumor Grade 2 >> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 >> ?conventional colorectal tumor, non-mucinous, Dukes Stage b, MSS, no >> cancer in the family, female, Distal Location , Tumor Grade 2 >> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 >> conventional colorectal tumor, non-mucinous, Dukes Stage c, MSI, no >> cancer in the family, female, Proximal Location , Tumor Grade 3 >> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 >> .... >> >> and you will have to parse that 'description' field to extract stage >> and other relevant information. ?for example >> >> de = as.character(ff[[1]]$desc >> gr = gsub(".*, Tumor Grade.(.)$", "\\1", de) >> >> gives you a single character string for grade, except for sample 14 -- >> where my regexp doesn't do as much as it should. >> >> such activities would be used to populate an annotated data frame >> which could then serve as the phenoData component of an AffyBatch >> instance, which is a typical container for CEL-based intensity data, >> to be propagated downstream through background correction and >> normalization and so forth. ?The experimentData element should also be >> suitably populated, as early in the workflow as possible. ?If we look >> closely enough we can find that the ExpressionSet returned by getGEO >> has quantifications generated by MAS 5.0. >> >> On Wed, Mar 17, 2010 at 11:27 AM, ? ?? <greengarden_0925 at="" hotmail.com=""> wrote: >>> >>> >>> Hi: >>> >>> I've download ?the GSE CEL files from GEO. But I have trouble in adding the individual charateristics, such as tumor site, age, gender...and so on, to the CEL file. >>> >>> I've read the mail of [BioC] getGEO - getting the .CEL files from GEO,but still not understood. >>> >>> Could you use GSE4045 as an example to demonstrate >>> how to use the exprs(), I can find the instrucion in the mailing list, to replace the GSE4045.SOFT ?with the CEL raw microarray data and keep the characteristics left. >>> > > There are a couple of tricks here that can sometimes be useful to get > better annotation. ?In this case, they are not a big improvement. > > The GEO GSE data entity contains information as supplied by the > submitters. ?The GDS data entity contains data taken from GSE records > that have been further curated by GEO staff. ?Often, that leads to > more useful annotation than comma-separated lists (although the > information is usually the same or similar, at least). ?To give an > example of how one might learn of the existence of such a GDS given a > GSE, one can use the GEOmetadb package: > > library(GEOmetadb) > # Next command will take a minute.... > sqlfile = getSQLiteFile() > # Check to see if the GSE record has a corresponding > # GDS record > geoConvert('GSE4045','gds') > > This series of commands will result in the following: > > $gds > ?from_acc ?to_acc > 1 ?GSE4045 GDS2201 > > So, GSE4045 has been curated by NCBI GEO staff and the accession of > the curated data is GDS2201. ?We can check to see what subsets > (phenotypic variables) are available using GEOmetadb, but we have to > resort to writing SQL to do so: > > # make a connection to the database > conn = dbConnect('SQLite',sqlfile) > dbGetQuery(conn,"select > gds_subset.gds,gds_subset.description,gds_subset.type from gds_subset > where gds='GDS2201'") > > One can use the columnDescriptions() function to get a data.frame of > columns, tables, and descriptions if writing SQL is necessary. ?This > will return this small data.frame: > > ? ? ?gds ? ? ? ? ? ? ? ? ? ? ? description ? ? ? ? ?type > 1 GDS2201 ? ? serrated colerectal carcinoma disease state > 2 GDS2201 conventional colorectal carcinoma disease state > > So, unfortunately, the GEO staff has annotated only the two different > types of colorectal carcinoma and not the other clinical variables. > If this is all you wanted, then you can use getGEO('GDS2201') to get > the annotations and attach those to the ExpressionSet that you create > by normalizing the .CEL files of your choosing. ?If not, then Vince's > method is the way to go. > > Sean > It's also worth noting that ArrayExpress have imported much of the data from common Affymetrix platforms (and some other platforms) from GEO. These imported data sets have generally been put through a basic curation step which does improve the computability of the annotation somewhat. The general rule is that for a GEO series GSENNNN then the ArrayExpress entry is E-GEOD-NNNN: library(ArrayExpress) abatch <- ArrayExpress('E-GEOD-4045') Not that it makes a huge difference in this case, but this is a pretty good workaround when a GDS set is not available in GEO. Cheers, Tim -- (former AE curator) Bioinformatician, Smith Lab CIMR, University of Cambridge
ADD REPLY

Login before adding your answer.

Traffic: 628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6