R: Re: Thank you and small question - R: R: How to use GEOquery to extract more than the default information from a GSE
1
0
Entering edit mode
@manca-marco-path-3578
Last seen 9.6 years ago
Dear Sean, using the command > eset <- getGEO('GSE9820')[[1]] inside the slot eset at phenoData@data I could only find information concerning the experimental facility now following the procedure suggested by James I have everything I need (namely: Cardiology Academic Medical Center Meibergdreef 9 Amsterdam 1105AZ Netherlands) and the contact details of the researcher, while nowhere in the file tree I was able to find relevant informations like the identity (coded of course) of the patients, the type of tissue each GSM was obtained from, and the case/control status. This was something I could work around using limma, producing a classical design matrix, but that was procuring me some major headaches trying to work with siggenes. Now, thanks to the procedure suggested by James, I have been able to include these data, but I was wondering if there is something I could study to get a better insight into the problem because, as instance, following James instructions I ended up with an empty eset at annotation slot (which I filled simply by gse at annotation <- "GPL6255") and an almost empty featureData slot which looked like: > gse at featureData An object of class "AnnotatedDataFrame" featureNames: ILMN_10000, ILMN_10001, ..., ILMN_9997 (20588 total) varLabels and varMetadata description: none > gse at featureData@data data frame with 0 columns and 20588 rows > gse at featureData@data at .Data list() ...and being unable to understand how to add these data during the process, I created a "Frankenstein" ExpressionSet by melting the one obtained with > eset <- getGEO('GSE9820')[[1]] and the handcrafted one by > gse at featureData <- eset at featureData ...which is really hurting, I know... but I had no other ideas and the files are organized in exactly the same order... That's why I would like to have something to study to gain further insight into how to handle GEO object and their ExpressionSets Thank you for your kind and prompt reply, and for your precious support again. My best regards, Marco -- Marco Manca, MD University of Maastricht Faculty of Health, Medicine and Life Sciences (FHML) Cardiovascular Research Institute (CARIM) E-mail: m.manca at path.unimaas.nl Mobile: +31626441205 Twitter: @markomanka ________________________________________ Da: seandavi at gmail.com [seandavi at gmail.com] per conto di Sean Davis [sdavis2 at mail.nih.gov] Inviato: mercoled? 29 luglio 2009 10.32 A: Manca Marco (PATH) Cc: James F. Reid; bioconductor mailing list Oggetto: Re: [BioC] Thank you and small question - R: R: How to use GEOquery to extract more than the default information from a GSE On Wed, Jul 29, 2009 at 4:24 AM, Manca Marco (PATH) <m.manca at="" path.unimaas.nl<mailto:m.manca="" at="" path.unimaas.nl="">> wrote: Dear James, dear Sean, and dear Bioconductors good morning. Thank you for your help up to now, I really apreciate it. I am probably a bit thickheaded, and I apologize for this, but I am still missing something from the picture. The work instructions from James worked excellently in my case, and I am sincerely grateful for the patience and support I have receive. I am nevertheless wondering how did you gain all this insight into the GSE structure and its handling... I have read the following documents: - An Introduction to Bioconductor's ExpressionSet Class ( http://www.b ioconductor.org/packages/2.5/bioc/vignettes/Biobase/inst/doc/Expressio nSetIntroduction.pdf ) - GEOquery ( http://watson.nci.nih.gov/bioc_mirror/2.4/bioc/manuals/GE Oquery/man/GEOquery.pdf ) - Using the GEOquery package ( http://www.bioconductor.org/packages/1. 8/bioc/vignettes/GEOquery/inst/doc/GEOquery.pdf ) ...and yet I am afraid that I would have terrible headaches trying to do what James (and Sean) guided me to, on a new dataset all on my own. Is there any source of information on the topics that I am missing? Or is it just the experience gathered during a painful attempts/failures- success process? Hi, Marco. What data is not included in the ExpressionSet that is returned by: eset <- getGEO('GSE9820')[[1]] We can only help if you can be specific about what you want to do. Sean -- Marco Manca, MD University of Maastricht Faculty of Health, Medicine and Life Sciences (FHML) Cardiovascular Research Institute (CARIM) E-mail: m.manca at path.unimaas.nl<mailto:m.manca at="" path.unimaas.nl=""> Mobile: +31626441205 Twitter: @markomanka ________________________________________ Da: James F. Reid [james.reid at ifom-ieo-campus.it<mailto:james.reid at="" ifom-ieo-campus.it="">] Inviato: luned? 27 luglio 2009 13.26 A: Manca Marco (PATH) Cc: sdavis2 at mail.nih.gov<mailto:sdavis2 at="" mail.nih.gov="">; bioconductor mailing list Oggetto: Re: [BioC] R: How to use GEOquery to extract more than the default information from a GSE Hi Marco, if you set GSEMatrix=FALSE and pick what you want you will have to create an ExpressionSet de novo. For extracting particular annotations of the samples, for example 'characteristics_ch1' and 'source_name_ch1' as you mention, you will want to include these in an annotated phenoData data.frame which in turn will be included in an ExpressionSet. Here's a way of producing a reduced phenoData: library("GEOquery") gse <- getGEO('GSE9820', GSEMatrix=FALSE) pD1 <- sapply(names(GSMList(gse)), function(gsm) GSMList(gse)[[gsm]]@header$characteristics_ch1) pD2 <- sapply(names(GSMList(gse)), function(gsm) GSMList(gse)[[gsm]]@header$source_name_ch1) pD1[,1] ##[1] "patient" "patient ID_REF: A10" "age:58" "sex:M" pD2[1] ## GSM247703 ##"macrophages" ## now put things together pD <- data.frame(type = pD1[1, ], patientID = sub("patient ID_REF: ", "", pD1[2, ]), age = sub("age:", "", pD1[3, ]), sex = sub("sex:", "", pD1[4, ]), cell = pD2) phenoD <- new('AnnotatedDataFrame', data = pD, varMetadata = data.frame(labelDescription = colnames(pD))) When you create the 'exprs' slot in the ExpressionSet make sure that the columns match the rows of this phenoData object. HTH, J. Manca Marco (PATH) wrote: > Dear James, > > thank you for your prompt and kind reply. > > I was doing like the following and I wasn't able to see my annotation associated to the filesL > library("GEOquery") > gse <- getGEO("GSE9820") > gse > > ...following your suggestion I get exactly the same output as you. > > Nevertheless I would love to be able to build my own ExprSet from a GSE using GEOquery with the option GSEMatrix=FALSE and then selecting the variables I want to import/include. In GEOquery's vignette there is an example of this but I am not able to find a document listing the options and the language/naming I should use to personalize the final file (the vignette only mentions that personalizing everything is quite difficult, but possible anyway). > > Thank you. > > Best regards, > Marco > ________________________________________ > Da: James F. Reid [james.reid at ifom-ieo- campus.it<mailto:james.reid at="" ifom-ieo-campus.it="">] > Inviato: venerd? 24 luglio 2009 15.38 > A: Manca Marco (PATH) > Cc: sdavis2 at mail.nih.gov<mailto:sdavis2 at="" mail.nih.gov="">; bioconductor mailing list > Oggetto: Re: [BioC] How to use GEOquery to extract more than the default information from a GSE > > Hi Marco, > > I'm not sure what you mean by 'more than default information'. > > Using GEOquery can be a bit complicated if the GEO series (GSE) contains > multiple platforms, but in your case you're fine because there is only one. > > If you can get a complete ExpressionSet which stores samples annotation, > platform annotation and expression values by doing: > > library("GEOquery") > gse <- getGEO("GSE9820") > names(gse) > ##[1] "GSE9820_series_matrix.txt.gz" > gse[[1]] > > which prints out: > ExpressionSet (storageMode: lockedEnvironment) > assayData: 20589 features, 153 samples > element names: exprs > phenoData > sampleNames: GSM247703, GSM247704, ..., GSM247855 (153 total) > varLabels and varMetadata description: > title: NA > geo_accession: NA > ...: ... > data_row_count: NA > (33 total) > featureData > featureNames: ILMN_10000, ILMN_10001, ..., ILMN_9999 (20589 total) > fvarLabels and fvarMetadata description: > ID: NA > GB_ACC: NA > ...: ... > SYNONYM: NA > (6 total) > additional fvarMetadata: Column, Description > experimentData: use 'experimentData(object)' > Annotation: GPL6255 > > fvarLabels(gse[[1]]) > [1] "ID" "GB_ACC" "SYMBOL" "DEFINITION" "ONTOLOGY" > [6] "SYNONYM" > > contains all the information for the platform, varLabels will give you > the labels of the sample information and you can get to the expression > values by means of exprs(gse[[1]]). > > HTH, > J. > > > Manca Marco (PATH) wrote: >> Dear Sean and dear bioconductors, >> >> I am writing you to ask a source of inspiration (code pieces, notes, references, whatever you might think appropriate) to import array annotation and other data from the GSE I am trying to work with (namely the GSE9820) into my eset. >> >> I have read on GEOquery's vignette that this is actually possible, despite being a bit tricky: >> >> "So, using a combination of lapply on the GSMList, one can extract as many columns of interest as necessary to build the data structure of choice. Because the GSM data from the GEO website are fully downloaded and included in the GSE object, one can extract foreground and background as well as quality for two-channel arrays, for example. Getting array annotation is also a bit more complicated, but by replacing \platform" in the lapply call to get platform information for each array, one can get other information associated with each array. Future work with this package will likely focus on better tools for manipulating GSE data" From http://www.bioconductor.org/packages/2 .4/bioc/vignettes/GEOquery/inst/doc/GEOquery.pdf Page 22 of 22 >> >> ...but I can't find anywhere any hint. >> >> Thank you in advance for your patience and support. >> >> My best regards, >> Marco >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch<mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch<mailto:bioconductor at="" stat.math.ethz.ch=""> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Annotation limma siggenes GEOquery Annotation limma siggenes GEOquery • 1.7k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Wed, Jul 29, 2009 at 4:57 AM, Manca Marco (PATH) <m.manca@path.unimaas.nl> wrote: > > > Dear Sean, > > using the command > > > eset <- getGEO('GSE9820')[[1]] > > inside the slot eset@phenoData@data I could only find information > concerning the experimental facility > > now following the procedure suggested by James I have everything I need > (namely: Cardiology Academic Medical Center Meibergdreef 9 Amsterdam > 1105AZ Netherlands) and the contact details of the researcher, while > nowhere in the file tree I was able to find relevant informations like the > identity (coded of course) of the patients, the type of tissue each GSM was > obtained from, and the case/control status. This was something I could work > around using limma, producing a classical design matrix, but that was > procuring me some major headaches trying to work with siggenes. > > Now, thanks to the procedure suggested by James, I have been able to > include these data, but I was wondering if there is something I could study > to get a better insight into the problem because, as instance, following > James instructions I ended up with an empty eset@annotation slot (which I > filled simply by gse@annotation <- "GPL6255") and an almost empty > featureData slot which looked like: > > > > gse@featureData > An object of class "AnnotatedDataFrame" > featureNames: ILMN_10000, ILMN_10001, ..., ILMN_9997 (20588 total) > varLabels and varMetadata description: none > > > gse@featureData@data > data frame with 0 columns and 20588 rows > > > gse@featureData@data@.Data > list() > > Hi, Marco. Try these. gse <- getGEO('GSE9820')[[1]] # gse is an ExpressionSet varLabels(gse) pData(gse)[1:5,] fData(gse)[1:5,] Generally, if you are using the "@" notation, you are doing something wrong. You should always be using the accessors like I have above. pData(gse) gives the sample information--there are 33 columns per sample, in this case. fData(gse) gives the feature data--there are 6 columns, including gene symbol. This should be all you need to use limma, if you like. Sean > > > -- > Marco Manca, MD > University of Maastricht > Faculty of Health, Medicine and Life Sciences (FHML) > Cardiovascular Research Institute (CARIM) > E-mail: m.manca@path.unimaas.nl > Mobile: +31626441205 > Twitter: @markomanka > ________________________________________ > Da: seandavi@gmail.com [seandavi@gmail.com] per conto di Sean Davis [ > sdavis2@mail.nih.gov] > Inviato: mercoledì 29 luglio 2009 10.32 > A: Manca Marco (PATH) > Cc: James F. Reid; bioconductor mailing list > Oggetto: Re: [BioC] Thank you and small question - R: R: How to use > GEOquery to extract more than the default information from a GSE > > On Wed, Jul 29, 2009 at 4:24 AM, Manca Marco (PATH) < > m.manca@path.unimaas.nl<mailto:m.manca@path.unimaas.nl>> wrote: > > Dear James, dear Sean, and dear Bioconductors > > good morning. > > Thank you for your help up to now, I really apreciate it. > > I am probably a bit thickheaded, and I apologize for this, but I am still > missing something from the picture. The work instructions from James worked > excellently in my case, and I am sincerely grateful for the patience and > support I have receive. > > I am nevertheless wondering how did you gain all this insight into the GSE > structure and its handling... > > I have read the following documents: > > - An Introduction to Bioconductor's ExpressionSet Class ( > http://www.bioconductor.org/packages/2.5/bioc/vignettes/Biobase/inst /doc/ExpressionSetIntroduction.pdf) > > - GEOquery ( > http://watson.nci.nih.gov/bioc_mirror/2.4/bioc/manuals/GEOquery/man/ GEOquery.pdf) > > - Using the GEOquery package ( > http://www.bioconductor.org/packages/1.8/bioc/vignettes/GEOquery/ins t/doc/GEOquery.pdf) > > ...and yet I am afraid that I would have terrible headaches trying to do > what James (and Sean) guided me to, on a new dataset all on my own. > > Is there any source of information on the topics that I am missing? Or is > it just the experience gathered during a painful attempts/failures- success > process? > > Hi, Marco. > > What data is not included in the ExpressionSet that is returned by: > > eset <- getGEO('GSE9820')[[1]] > > We can only help if you can be specific about what you want to do. > > Sean > > > > -- > Marco Manca, MD > University of Maastricht > Faculty of Health, Medicine and Life Sciences (FHML) > Cardiovascular Research Institute (CARIM) > E-mail: m.manca@path.unimaas.nl<mailto:m.manca@path.unimaas.nl> > Mobile: +31626441205 > Twitter: @markomanka > ________________________________________ > Da: James F. Reid [james.reid@ifom-ieo-campus.it<mailto:> james.reid@ifom-ieo-campus.it>] > Inviato: lunedì 27 luglio 2009 13.26 > A: Manca Marco (PATH) > Cc: sdavis2@mail.nih.gov<mailto:sdavis2@mail.nih.gov>; bioconductor > mailing list > Oggetto: Re: [BioC] R: How to use GEOquery to extract more than the default > information from a GSE > > Hi Marco, > > if you set GSEMatrix=FALSE and pick what you want you will have to > create an ExpressionSet de novo. > For extracting particular annotations of the samples, for example > 'characteristics_ch1' and 'source_name_ch1' as you mention, you will > want to include these in an annotated phenoData data.frame which in turn > will be included in an ExpressionSet. > > Here's a way of producing a reduced phenoData: > > library("GEOquery") > gse <- getGEO('GSE9820', GSEMatrix=FALSE) > > pD1 <- sapply(names(GSMList(gse)), function(gsm) > GSMList(gse)[[gsm]]@header$characteristics_ch1) > pD2 <- sapply(names(GSMList(gse)), function(gsm) > GSMList(gse)[[gsm]]@header$source_name_ch1) > > pD1[,1] > ##[1] "patient" "patient ID_REF: A10" "age:58" > "sex:M" > pD2[1] > ## GSM247703 > ##"macrophages" > > ## now put things together > pD <- data.frame(type = pD1[1, ], > patientID = sub("patient ID_REF: ", "", pD1[2, ]), > age = sub("age:", "", pD1[3, ]), > sex = sub("sex:", "", pD1[4, ]), > cell = pD2) > > phenoD <- new('AnnotatedDataFrame', > data = pD, > varMetadata = data.frame(labelDescription = colnames(pD))) > > When you create the 'exprs' slot in the ExpressionSet make sure that the > columns match the rows of this phenoData object. > > > HTH, > J. > > Manca Marco (PATH) wrote: > > Dear James, > > > > thank you for your prompt and kind reply. > > > > I was doing like the following and I wasn't able to see my annotation > associated to the filesL > > library("GEOquery") > > gse <- getGEO("GSE9820") > > gse > > > > ...following your suggestion I get exactly the same output as you. > > > > Nevertheless I would love to be able to build my own ExprSet from a GSE > using GEOquery with the option GSEMatrix=FALSE and then selecting the > variables I want to import/include. In GEOquery's vignette there is an > example of this but I am not able to find a document listing the options and > the language/naming I should use to personalize the final file (the vignette > only mentions that personalizing everything is quite difficult, but possible > anyway). > > > > Thank you. > > > > Best regards, > > Marco > > ________________________________________ > > Da: James F. Reid [james.reid@ifom-ieo-campus.it<mailto:> james.reid@ifom-ieo-campus.it>] > > Inviato: venerdì 24 luglio 2009 15.38 > > A: Manca Marco (PATH) > > Cc: sdavis2@mail.nih.gov<mailto:sdavis2@mail.nih.gov>; bioconductor > mailing list > > Oggetto: Re: [BioC] How to use GEOquery to extract more than the default > information from a GSE > > > > Hi Marco, > > > > I'm not sure what you mean by 'more than default information'. > > > > Using GEOquery can be a bit complicated if the GEO series (GSE) contains > > multiple platforms, but in your case you're fine because there is only > one. > > > > If you can get a complete ExpressionSet which stores samples annotation, > > platform annotation and expression values by doing: > > > > library("GEOquery") > > gse <- getGEO("GSE9820") > > names(gse) > > ##[1] "GSE9820_series_matrix.txt.gz" > > gse[[1]] > > > > which prints out: > > ExpressionSet (storageMode: lockedEnvironment) > > assayData: 20589 features, 153 samples > > element names: exprs > > phenoData > > sampleNames: GSM247703, GSM247704, ..., GSM247855 (153 total) > > varLabels and varMetadata description: > > title: NA > > geo_accession: NA > > ...: ... > > data_row_count: NA > > (33 total) > > featureData > > featureNames: ILMN_10000, ILMN_10001, ..., ILMN_9999 (20589 total) > > fvarLabels and fvarMetadata description: > > ID: NA > > GB_ACC: NA > > ...: ... > > SYNONYM: NA > > (6 total) > > additional fvarMetadata: Column, Description > > experimentData: use 'experimentData(object)' > > Annotation: GPL6255 > > > > fvarLabels(gse[[1]]) > > [1] "ID" "GB_ACC" "SYMBOL" "DEFINITION" "ONTOLOGY" > > [6] "SYNONYM" > > > > contains all the information for the platform, varLabels will give you > > the labels of the sample information and you can get to the expression > > values by means of exprs(gse[[1]]). > > > > HTH, > > J. > > > > > > Manca Marco (PATH) wrote: > >> Dear Sean and dear bioconductors, > >> > >> I am writing you to ask a source of inspiration (code pieces, notes, > references, whatever you might think appropriate) to import array annotation > and other data from the GSE I am trying to work with (namely the GSE9820) > into my eset. > >> > >> I have read on GEOquery's vignette that this is actually possible, > despite being a bit tricky: > >> > >> "So, using a combination of lapply on the GSMList, one can extract as > many columns of interest as necessary to build the data structure of choice. > Because the GSM data from the GEO website are fully downloaded and included > in the GSE object, one can extract foreground and background as well as > quality for two-channel arrays, for example. Getting array annotation is > also a bit more complicated, but by replacing \platform" in the lapply call > to get platform information for each array, one can get other information > associated with each array. Future work with this package will likely focus > on better tools for manipulating GSE data" From > http://www.bioconductor.org/packages/2.4/bioc/vignettes/GEOquery/ins t/doc/GEOquery.pdfPage 22 of 22 > >> > >> ...but I can't find anywhere any hint. > >> > >> Thank you in advance for your patience and support. > >> > >> My best regards, > >> Marco > >> > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Dear Sean, thank you. This absolutely works and is, indeed, the simplest/fastest way to go. My best regards, Marco -- Marco Manca, MD University of Maastricht Faculty of Health, Medicine and Life Sciences (FHML) Cardiovascular Research Institute (CARIM) E-mail: m.manca at path.unimaas.nl Mobile: +31626441205 Twitter: @markomanka ________________________________________ Da: seandavi at gmail.com [seandavi at gmail.com] per conto di Sean Davis [sdavis2 at mail.nih.gov] Inviato: mercoled? 29 luglio 2009 11.21 A: Manca Marco (PATH) Cc: James F. Reid; bioconductor mailing list Oggetto: Re: [BioC] R: Re: Thank you and small question - R: R: How to use GEOquery to extract more than the default information from a GSE On Wed, Jul 29, 2009 at 4:57 AM, Manca Marco (PATH) <m.manca at="" path.unimaas.nl<mailto:m.manca="" at="" path.unimaas.nl="">> wrote: Dear Sean, using the command > eset <- getGEO('GSE9820')[[1]] inside the slot eset at phenoData@data I could only find information concerning the experimental facility now following the procedure suggested by James I have everything I need (namely: Cardiology Academic Medical Center Meibergdreef 9 Amsterdam 1105AZ Netherlands) and the contact details of the researcher, while nowhere in the file tree I was able to find relevant informations like the identity (coded of course) of the patients, the type of tissue each GSM was obtained from, and the case/control status. This was something I could work around using limma, producing a classical design matrix, but that was procuring me some major headaches trying to work with siggenes. Now, thanks to the procedure suggested by James, I have been able to include these data, but I was wondering if there is something I could study to get a better insight into the problem because, as instance, following James instructions I ended up with an empty eset at annotation slot (which I filled simply by gse at annotation <- "GPL6255") and an almost empty featureData slot which looked like: > gse at featureData An object of class "AnnotatedDataFrame" featureNames: ILMN_10000, ILMN_10001, ..., ILMN_9997 (20588 total) varLabels and varMetadata description: none > gse at featureData@data data frame with 0 columns and 20588 rows > gse at featureData@data at .Data list() Hi, Marco. Try these. gse <- getGEO('GSE9820')[[1]] # gse is an ExpressionSet varLabels(gse) pData(gse)[1:5,] fData(gse)[1:5,] Generally, if you are using the "@" notation, you are doing something wrong. You should always be using the accessors like I have above. pData(gse) gives the sample information--there are 33 columns per sample, in this case. fData(gse) gives the feature data--there are 6 columns, including gene symbol. This should be all you need to use limma, if you like. Sean -- Marco Manca, MD University of Maastricht Faculty of Health, Medicine and Life Sciences (FHML) Cardiovascular Research Institute (CARIM) E-mail: m.manca at path.unimaas.nl<mailto:m.manca at="" path.unimaas.nl=""> Mobile: +31626441205 Twitter: @markomanka ________________________________________ Da: seandavi at gmail.com<mailto:seandavi at="" gmail.com=""> [seandavi at gmail.com<mailto:seandavi at="" gmail.com="">] per conto di Sean Davis [sdavis2 at mail.nih.gov<mailto:sdavis2 at="" mail.nih.gov="">] Inviato: mercoled? 29 luglio 2009 10.32 A: Manca Marco (PATH) Cc: James F. Reid; bioconductor mailing list Oggetto: Re: [BioC] Thank you and small question - R: R: How to use GEOquery to extract more than the default information from a GSE On Wed, Jul 29, 2009 at 4:24 AM, Manca Marco (PATH) <m.manca at="" path.unimaas.nl<mailto:m.manca="" at="" path.unimaas.nl=""><mailto:m.manca at="" path.unimaas.nl<mailto:m.manca="" at="" path.unimaas.nl="">>> wrote: Dear James, dear Sean, and dear Bioconductors good morning. Thank you for your help up to now, I really apreciate it. I am probably a bit thickheaded, and I apologize for this, but I am still missing something from the picture. The work instructions from James worked excellently in my case, and I am sincerely grateful for the patience and support I have receive. I am nevertheless wondering how did you gain all this insight into the GSE structure and its handling... I have read the following documents: - An Introduction to Bioconductor's ExpressionSet Class ( http://www.b ioconductor.org/packages/2.5/bioc/vignettes/Biobase/inst/doc/Expressio nSetIntroduction.pdf ) - GEOquery ( http://watson.nci.nih.gov/bioc_mirror/2.4/bioc/manuals/GE Oquery/man/GEOquery.pdf ) - Using the GEOquery package ( http://www.bioconductor.org/packages/1. 8/bioc/vignettes/GEOquery/inst/doc/GEOquery.pdf ) ...and yet I am afraid that I would have terrible headaches trying to do what James (and Sean) guided me to, on a new dataset all on my own. Is there any source of information on the topics that I am missing? Or is it just the experience gathered during a painful attempts/failures- success process? Hi, Marco. What data is not included in the ExpressionSet that is returned by: eset <- getGEO('GSE9820')[[1]] We can only help if you can be specific about what you want to do. Sean -- Marco Manca, MD University of Maastricht Faculty of Health, Medicine and Life Sciences (FHML) Cardiovascular Research Institute (CARIM) E-mail: m.manca at path.unimaas.nl<mailto:m.manca at="" path.unimaas.nl=""><mailto:m.manca at="" path.unimaas.nl<mailto:m.manca="" at="" path.unimaas.nl="">> Mobile: +31626441205 Twitter: @markomanka ________________________________________ Da: James F. Reid [james.reid at ifom-ieo-campus.it<mailto:james.reid at="" ifom-ieo-campus.it=""><mailto:james.reid at="" ifom-ieo-="" campus.it<mailto:james.reid="" at="" ifom-ieo-campus.it="">>] Inviato: luned? 27 luglio 2009 13.26 A: Manca Marco (PATH) Cc: sdavis2 at mail.nih.gov<mailto:sdavis2 at="" mail.nih.gov=""><mailto:sdavis2 at="" mail.nih.gov<mailto:sdavis2="" at="" mail.nih.gov="">>; bioconductor mailing list Oggetto: Re: [BioC] R: How to use GEOquery to extract more than the default information from a GSE Hi Marco, if you set GSEMatrix=FALSE and pick what you want you will have to create an ExpressionSet de novo. For extracting particular annotations of the samples, for example 'characteristics_ch1' and 'source_name_ch1' as you mention, you will want to include these in an annotated phenoData data.frame which in turn will be included in an ExpressionSet. Here's a way of producing a reduced phenoData: library("GEOquery") gse <- getGEO('GSE9820', GSEMatrix=FALSE) pD1 <- sapply(names(GSMList(gse)), function(gsm) GSMList(gse)[[gsm]]@header$characteristics_ch1) pD2 <- sapply(names(GSMList(gse)), function(gsm) GSMList(gse)[[gsm]]@header$source_name_ch1) pD1[,1] ##[1] "patient" "patient ID_REF: A10" "age:58" "sex:M" pD2[1] ## GSM247703 ##"macrophages" ## now put things together pD <- data.frame(type = pD1[1, ], patientID = sub("patient ID_REF: ", "", pD1[2, ]), age = sub("age:", "", pD1[3, ]), sex = sub("sex:", "", pD1[4, ]), cell = pD2) phenoD <- new('AnnotatedDataFrame', data = pD, varMetadata = data.frame(labelDescription = colnames(pD))) When you create the 'exprs' slot in the ExpressionSet make sure that the columns match the rows of this phenoData object. HTH, J. Manca Marco (PATH) wrote: > Dear James, > > thank you for your prompt and kind reply. > > I was doing like the following and I wasn't able to see my annotation associated to the filesL > library("GEOquery") > gse <- getGEO("GSE9820") > gse > > ...following your suggestion I get exactly the same output as you. > > Nevertheless I would love to be able to build my own ExprSet from a GSE using GEOquery with the option GSEMatrix=FALSE and then selecting the variables I want to import/include. In GEOquery's vignette there is an example of this but I am not able to find a document listing the options and the language/naming I should use to personalize the final file (the vignette only mentions that personalizing everything is quite difficult, but possible anyway). > > Thank you. > > Best regards, > Marco > ________________________________________ > Da: James F. Reid [james.reid at ifom-ieo- campus.it<mailto:james.reid at="" ifom-ieo-campus.it=""><mailto:james.reid at="" ifom-ieo-campus.it<mailto:james.reid="" at="" ifom-ieo-campus.it="">>] > Inviato: venerd? 24 luglio 2009 15.38 > A: Manca Marco (PATH) > Cc: sdavis2 at mail.nih.gov<mailto:sdavis2 at="" mail.nih.gov=""><mailto:sdavis2 at="" mail.nih.gov<mailto:sdavis2="" at="" mail.nih.gov="">>; bioconductor mailing list > Oggetto: Re: [BioC] How to use GEOquery to extract more than the default information from a GSE > > Hi Marco, > > I'm not sure what you mean by 'more than default information'. > > Using GEOquery can be a bit complicated if the GEO series (GSE) contains > multiple platforms, but in your case you're fine because there is only one. > > If you can get a complete ExpressionSet which stores samples annotation, > platform annotation and expression values by doing: > > library("GEOquery") > gse <- getGEO("GSE9820") > names(gse) > ##[1] "GSE9820_series_matrix.txt.gz" > gse[[1]] > > which prints out: > ExpressionSet (storageMode: lockedEnvironment) > assayData: 20589 features, 153 samples > element names: exprs > phenoData > sampleNames: GSM247703, GSM247704, ..., GSM247855 (153 total) > varLabels and varMetadata description: > title: NA > geo_accession: NA > ...: ... > data_row_count: NA > (33 total) > featureData > featureNames: ILMN_10000, ILMN_10001, ..., ILMN_9999 (20589 total) > fvarLabels and fvarMetadata description: > ID: NA > GB_ACC: NA > ...: ... > SYNONYM: NA > (6 total) > additional fvarMetadata: Column, Description > experimentData: use 'experimentData(object)' > Annotation: GPL6255 > > fvarLabels(gse[[1]]) > [1] "ID" "GB_ACC" "SYMBOL" "DEFINITION" "ONTOLOGY" > [6] "SYNONYM" > > contains all the information for the platform, varLabels will give you > the labels of the sample information and you can get to the expression > values by means of exprs(gse[[1]]). > > HTH, > J. > > > Manca Marco (PATH) wrote: >> Dear Sean and dear bioconductors, >> >> I am writing you to ask a source of inspiration (code pieces, notes, references, whatever you might think appropriate) to import array annotation and other data from the GSE I am trying to work with (namely the GSE9820) into my eset. >> >> I have read on GEOquery's vignette that this is actually possible, despite being a bit tricky: >> >> "So, using a combination of lapply on the GSMList, one can extract as many columns of interest as necessary to build the data structure of choice. Because the GSM data from the GEO website are fully downloaded and included in the GSE object, one can extract foreground and background as well as quality for two-channel arrays, for example. Getting array annotation is also a bit more complicated, but by replacing \platform" in the lapply call to get platform information for each array, one can get other information associated with each array. Future work with this package will likely focus on better tools for manipulating GSE data" From http://www.bioconductor.org/packages/2 .4/bioc/vignettes/GEOquery/inst/doc/GEOquery.pdf Page 22 of 22 >> >> ...but I can't find anywhere any hint. >> >> Thank you in advance for your patience and support. >> >> My best regards, >> Marco >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch<mailto:bioconductor at="" stat.math.ethz.ch=""><mailto:bioconductor at="" stat.math.ethz.ch<mailto:bioconductor="" at="" stat.math.ethz.ch="">> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch<mailto:bioconductor at="" stat.math.ethz.ch=""><mailto:bioconductor at="" stat.math.ethz.ch<mailto:bioconductor="" at="" stat.math.ethz.ch="">> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch<mailto:bioconductor at="" stat.math.ethz.ch=""> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6