Hi, Alex. The typical process would be to use getGEO to get a GSE or
GSEMatrix file and parse it into R. The data in these files are taken
directly from submitters to GEO and so could be processed by RMA,
MAS5,
or any of several other methods. One will often need to refer to the
protocol information in GEO or to the associated paper to determine
the
exact methods. As Saroj pointed out, in many cases, there is a link
in
the GSE file or online on the summary page to supplementary files.
This
link will, for Affy, usually contain at least .CEL files. One can
then
use the getGEO function to get the processed data and annotation, then
get the raw .CEL files and process them however necessary, and replace
the values that come from GEO with the ones derived locally.
Sean
Alex Tsoi wrote:
> Thanks all of you for the information.
>
> However, as I mentioned in my previous emails, some GEO data (eg.
> GSM72287) has both the .CEL file and .EXP file, and I looked up
their
> paper: http://www.ncbi.nlm.nih.gov/sites/entrez
> and the authors mentioned that they did put the processed data as
.CEL
> and the raw as .EXP.
The .CEL files are, by definition, raw files. If a manuscript says
otherwise, then I think you should probably contact the author to
clarify the situation.
> I understand that I could first download the supplementary files
> manually from the GEO website, then input them as R object. But
> unfortunately, I am doing meta-analysis on cancer microarrays, so I
> would have to download 20 + datasets manually for getting the raw
data
> . So I just wonder, in case the raw data is available in the GEO, is
> there any way I could parse that directly to R ?(since some of those
> have both processed and raw, but once parsed using the getGEO, only
> the processed is shown)
The link for the supplementary files is embedded in the GSE header
information, if available. You can certainly use R to download those
files and uncompress them. You will still need to make some decisions
about how you would like to treat these raw data after they are
downloaded. Since you are setting up to do a meta-analysis,
presumably
you have thought a good deal about how to go about processing the raw
data and analyzing the results across datasets.
Sean
There are links to the .CEL files (I guess this would be "raw" files)
at
GEO.
E.g., GSM72287 is part of the series GSE3218. At the bottom of the
page
(below) there is a link under 'Supplementary files'.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3218
HTH
Saroj
Alex Tsoi wrote:
>I figure out that those are the RMA-processed data, so my question
should be
>how could I get the rawdata ?
>
>
>On 7/3/07, Alex Tsoi <tsoi.teen at="" gmail.com=""> wrote:
>
>
>>Dear all,
>>
>>I use the function getGEO from GEOquery to retrieve different cancer
data
>>sets from GEO to do a meta-analysis.
>>
>>However, I am not quite sure if the data I downloaded has already
been
>>processed (eg. RMA, or MAS) or not, is it true that all the
>>.CEL might be processed while all the .EXP files are raw ?
>>
>>Also, if I assign as:
>>
>>
>>
>>>rawdata <- getGEO(GSM72287)
>>>
>>>
>>"rawdata" has the data table with column names ID_REF and VALUE:
>>
>>but are those processed or raw data values ?
>>
>>My main goal is to get the raw data values from each sample so I
could do
>>a meta analysis by applying my own processing
>>methods.
>>
>>Below is showing the rawdata.
>>
>>Greatly appreciate for any help.
>>
>>
>>
>>An object of class "GSM"
>>channel_count
>>[1] "1"
>>characteristics_ch1
>>[1] "mixed GCT (Embryonal Carcinoma, Seminoma)"
>>contact_address
>>[1] "1275 York Ave"
>>contact_city
>>[1] "New York"
>>contact_country
>>[1] "USA"
>>contact_department
>>[1] "Cell Biology"
>>contact_email
>>[1] " korkolaj at mskcc.org"
>>contact_institute
>>[1] "Memorial Sloan-Kettering"
>>contact_laboratory
>>[1] "Chaganti"
>>contact_name
>>[1] "James,,Korkola"
>>contact_phone
>>[1] "212-639-8281"
>>contact_state
>>[1] "NY"
>>contact_zip/postal_code
>>[1] "10021"
>>data_processing
>>[1] "RMA (robust multi-array)"
>>data_row_count
>>[1] "22645"
>>description
>>[1] "Adult Male Germ Cell Tumor"
>>extract_protocol_ch1
>>[1] "Frozen tissue from a germ cell tumor was minced and homogenized
in
>>RLT buffer (Qiagen).Total RNA was extracted from the tissue lysate
using an
>>RNeasy kit (Qiagen)."
>>geo_accession
>>[1] "GSM72287"
>>hyb_protocol
>>[1] "standard Affymetrix procedures"
>>label_ch1
>>[1] "biotin"
>>label_protocol_ch1
>>[1] "Approximately 12 ug of total RNA was processed to produce
>>biotinylated cRNA targets."
>>last_update_date
>>[1] "Oct 12 2005"
>>molecule_ch1
>>[1] "total RNA"
>>organism_ch1
>>[1] "Homo sapiens"
>>platform_id
>>[1] "GPL97"
>>scan_protocol
>>[1] "standard Affymetrix procedures"
>>series_id
>>[1] "GSE3218"
>>source_name_ch1
>>[1] "germ cell tumor"
>>status
>>[1] "Public on Nov 10 2005"
>>submission_date
>>[1] "Aug 29 2005"
>>supplementary_file
>>[1] "file:///samples/GSM72287/GSM72287.CEL.gz"
>>[2] "file:///samples/GSM72287/GSM72287.EXP.gz"
>>title
>>[1] "germ cell tumors (GCT) and normal controls 052B 1"
>>type
>>[1] "RNA"
>>An object of class "GEODataTable"
>>****** Column Descriptions ******
>> Column Description
>>1 ID_REF \t
>>2 VALUE RMA-calculated Signal intensity
>>****** Data Table ******
>> ID_REF VALUE
>>1 200000_s_at 9.913362
>>2 200001_at 9.822533
>>3 200002_at 11.318111
>>4 200003_s_at 12.280321
>>5 200004_at 11.068576
>>22640 more rows ...
>>
>>
>>
>>--
>>Lam C. Tsoi (Alex)
>>Medical University of South Carolina
>>
>>
>
>
>
>
>
>
CEL files contain the probe-level data, so by definition they contain
'raw' data (no background correction, normalization or
summarization). So CEL files never contain processed data...
Cheers,
Jenny
At 02:39 PM 7/3/2007, Saroj Mohapatra wrote:
>There are links to the .CEL files (I guess this would be "raw" files)
at GEO.
>
>E.g., GSM72287 is part of the series GSE3218. At the bottom of the
>page (below) there is a link under 'Supplementary files'.
>
>http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3218
>
>HTH
>
>Saroj
>
>
>Alex Tsoi wrote:
>
>>I figure out that those are the RMA-processed data, so my question
should be
>>how could I get the rawdata ?
>>
>>
>>On 7/3/07, Alex Tsoi <tsoi.teen at="" gmail.com=""> wrote:
>>
>>
>>>Dear all,
>>>
>>>I use the function getGEO from GEOquery to retrieve different
cancer data
>>>sets from GEO to do a meta-analysis.
>>>
>>>However, I am not quite sure if the data I downloaded has already
been
>>>processed (eg. RMA, or MAS) or not, is it true that all the
>>>.CEL might be processed while all the .EXP files are raw ?
>>>
>>>Also, if I assign as:
>>>
>>>
>>>
>>>>rawdata <- getGEO(GSM72287)
>>>>
>>>"rawdata" has the data table with column names ID_REF and VALUE:
>>>
>>>but are those processed or raw data values ?
>>>
>>>My main goal is to get the raw data values from each sample so I
could do
>>>a meta analysis by applying my own processing
>>>methods.
>>>
>>>Below is showing the rawdata.
>>>
>>>Greatly appreciate for any help.
>>>
>>>
>>>
>>>An object of class "GSM"
>>>channel_count
>>>[1] "1"
>>>characteristics_ch1
>>>[1] "mixed GCT (Embryonal Carcinoma, Seminoma)"
>>>contact_address
>>>[1] "1275 York Ave"
>>>contact_city
>>>[1] "New York"
>>>contact_country
>>>[1] "USA"
>>>contact_department
>>>[1] "Cell Biology"
>>>contact_email
>>>[1] " korkolaj at mskcc.org"
>>>contact_institute
>>>[1] "Memorial Sloan-Kettering"
>>>contact_laboratory
>>>[1] "Chaganti"
>>>contact_name
>>>[1] "James,,Korkola"
>>>contact_phone
>>>[1] "212-639-8281"
>>>contact_state
>>>[1] "NY"
>>>contact_zip/postal_code
>>>[1] "10021"
>>>data_processing
>>>[1] "RMA (robust multi-array)"
>>>data_row_count
>>>[1] "22645"
>>>description
>>>[1] "Adult Male Germ Cell Tumor"
>>>extract_protocol_ch1
>>>[1] "Frozen tissue from a germ cell tumor was minced and
homogenized in
>>>RLT buffer (Qiagen).Total RNA was extracted from the tissue lysate
using an
>>>RNeasy kit (Qiagen)."
>>>geo_accession
>>>[1] "GSM72287"
>>>hyb_protocol
>>>[1] "standard Affymetrix procedures"
>>>label_ch1
>>>[1] "biotin"
>>>label_protocol_ch1
>>>[1] "Approximately 12 ug of total RNA was processed to produce
>>>biotinylated cRNA targets."
>>>last_update_date
>>>[1] "Oct 12 2005"
>>>molecule_ch1
>>>[1] "total RNA"
>>>organism_ch1
>>>[1] "Homo sapiens"
>>>platform_id
>>>[1] "GPL97"
>>>scan_protocol
>>>[1] "standard Affymetrix procedures"
>>>series_id
>>>[1] "GSE3218"
>>>source_name_ch1
>>>[1] "germ cell tumor"
>>>status
>>>[1] "Public on Nov 10 2005"
>>>submission_date
>>>[1] "Aug 29 2005"
>>>supplementary_file
>>>[1] "file:///samples/GSM72287/GSM72287.CEL.gz"
>>>[2] "file:///samples/GSM72287/GSM72287.EXP.gz"
>>>title
>>>[1] "germ cell tumors (GCT) and normal controls 052B 1"
>>>type
>>>[1] "RNA"
>>>An object of class "GEODataTable"
>>>****** Column Descriptions ******
>>> Column Description
>>>1 ID_REF \t
>>>2 VALUE RMA-calculated Signal intensity
>>>****** Data Table ******
>>> ID_REF VALUE
>>>1 200000_s_at 9.913362
>>>2 200001_at 9.822533
>>>3 200002_at 11.318111
>>>4 200003_s_at 12.280321
>>>5 200004_at 11.068576
>>>22640 more rows ...
>>>
>>>
>>>
>>>--
>>>Lam C. Tsoi (Alex)
>>>Medical University of South Carolina
>>>
>>
>>
>>
>>
>>
>
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
Jenny Drnevich, Ph.D.
Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign
330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA
ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu