Meta data for gene from GEO
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.3 years ago
I would like to get the data for all the genes in the form of Gene Symbols/Gene ids's mapped to GPL/GSE/GSM/GDS metadata. I have used GEOmetadb package to get this metadata,however I am not able to find a way to extract all this metadata mapped to genes. Is their any way GEOquery bioconductor package be used for this? Thanks, Rohan -- output of sessionInfo(): library(GEOquery)? library(GEOmetadb)? -- Sent via the guest posting facility at bioconductor.org.
GEOquery GEOmetadb GEOquery GEOmetadb • 2.4k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
On Fri, Nov 22, 2013 at 4:53 PM, Rohan [guest] <guest@bioconductor.org>wrote: > > I would like to get the data for all the genes in the form of Gene > Symbols/Gene ids's mapped to GPL/GSE/GSM/GDS metadata. > I have used GEOmetadb package to get this metadata,however I am not able > to find a way to extract all this metadata mapped to genes. > > Is their any way GEOquery bioconductor package be used for this? > Good question. It has a long-winded answer. The GEO platform (GPL) is the only GEO entity that stores any information about gene identity. Other entities (GSM, GSE, GDS) are linked to GPL rows by an ID column. So, to get information about the genes represented by an experiment, we need to look at GPL records. GPL records come in two flavors, the submitter-supplied flavor and the so-called "Annotation" GPL that has been curated by NCBI GEO. You'll need to focus on the Annotation GPL since those are the ones with a standard "Gene ID" column in all of them. The "Annotation" GPLs are only generated for data sets that have been curated by NCBI GEO, namely the GDS records. So, we need to get the distinct GPL records associated with GDS and these will be the entire set of "Annotation" GPLs. Using GEOmetadb (assuming you have already made a connection, etc.): annotgpl = dbGetquery(con,"select distinct GPL from gds") Now, annotgpl contains the accession numbers (GPL IDs) for all the Annotation GPLs. You can use these GPL IDs to relate each GPL to GSM, GDS, and GSE records. How do you get the information about what genes are on each GPL, though? You'll need to use GEOquery for that. gpl = getGEO(annotgpl[1,1],AnnotGPL=TRUE) gpl is now a GPL object and we can use the Table method to get a data frame and grab the Gene ID (which is an Entrez Gene ID): geneids = Table(gpl)[,'Gene ID'] Now, you have the Entrez Gene IDs for all features on the platform and you can associate those with all the GSM, GDS, and GSE records attached to the GPL. If you loop over all the GPLs in the annotgpl data frame, you'll have the information you want, I think. Unfortunately, this is not a complete answer because it does not include the submitter-supplied GPLs that do not have any Annotation GPL available (since NCBI GEO do not curate everything). The submitter-supplied GPLs do not have a standard vocabulary for what is include in the columns of the GPL, so there is not an easy way to automate processing as above. Hope that helps. Sean [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Sean, the answer was really helpful.I am using this to get all the annotations for gpl in this object ,since there are 438 gpl's in annotgpl object. gpllist <- sapply(annotgpl[1:438,1],getGEO,AnnotGPL=TRUE) But this is a list object, and I am unable to access this using a Table function. basically, I would like to have a table with mapping of gpl and all its gene ids: GPL 13 geneid1 GPL13 geneid2 GPL 13 geneid3 GPL 14 geneid1 GPL14 geneid2 GPL 14 geneid3 Thanks, Rohan On Friday, 22 November 2013 6:27 PM, Sean Davis <sdavis2@mail.nih.gov> wrote: On Fri, Nov 22, 2013 at 4:53 PM, Rohan [guest] <guest@bioconductor.org> wrote: >I would like to get the data for all the genes in the form of Gene Symbols/Gene ids's mapped to GPL/GSE/GSM/GDS metadata. >I have used GEOmetadb package to get this metadata,however I am not able to find a way to extract all this metadata mapped to genes. > >Is their any way GEOquery bioconductor package be used for this? > Good question.  It has a long-winded answer. The GEO platform (GPL) is the only GEO entity that stores any information about gene identity.  Other entities (GSM, GSE, GDS) are linked to GPL rows by an ID column.  So, to get information about the genes represented by an experiment, we need to look at GPL records. GPL records come in two flavors, the submitter-supplied flavor and the so-called "Annotation" GPL that has been curated by NCBI GEO.  You'll need to focus on the Annotation GPL since those are the ones with a standard "Gene ID" column in all of them.  The "Annotation" GPLs are only generated for data sets that have been curated by NCBI GEO, namely the GDS records.  So, we need to get the distinct GPL records associated with GDS and these will be the entire set of "Annotation" GPLs.  Using GEOmetadb (assuming you have already made a connection, etc.): annotgpl = dbGetquery(con,"select distinct GPL from gds") Now, annotgpl contains the accession numbers (GPL IDs) for all the Annotation GPLs.  You can use these GPL IDs to relate each GPL to GSM, GDS, and GSE records. How do you get the information about what genes are on each GPL, though?  You'll need to use GEOquery for that. gpl = getGEO(annotgpl[1,1],AnnotGPL=TRUE) gpl is now a GPL object and we can use the Table method to get a data frame and grab the Gene ID (which is an Entrez Gene ID): geneids = Table(gpl)[,'Gene ID'] Now, you have the Entrez Gene IDs for all features on the platform and you can associate those with all the GSM, GDS, and GSE records attached to the GPL.  If you loop over all the GPLs in the annotgpl data frame, you'll have the information you want, I think. Unfortunately, this is not a complete answer because it does not include the submitter-supplied GPLs that do not have any Annotation GPL available (since NCBI GEO do not curate everything).  The submitter-supplied GPLs do not have a standard vocabulary for what is include in the columns of the GPL, so there is not an easy way to automate processing as above. Hope that helps. Sean [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On Mon, Nov 25, 2013 at 2:22 PM, rohan bareja <rohan_1925@yahoo.co.in>wrote: > Hi Sean, > > the answer was really helpful.I am using this to get all the annotations > for gpl in this object ,since there are 438 gpl's in annotgpl object. > > gpllist <- sapply(annotgpl[1:438,1],getGEO,AnnotGPL=TRUE) > > But this is a list object, and I am unable to access this using a Table > function. > > Hi, Rohan. Yes, you'll need to apply the Table() method to each member of the gpllist; another sapply or an lapply would be useful here. Sean > basically, I would like to have a table with mapping of gpl and all its > gene ids: > > GPL 13 geneid1 > GPL13 geneid2 > GPL 13 geneid3 > > > GPL 14 geneid1 > GPL14 geneid2 > GPL 14 geneid3 > > Thanks, > Rohan > > > > On Friday, 22 November 2013 6:27 PM, Sean Davis <sdavis2@mail.nih.gov> > wrote: > > > > > > > On Fri, Nov 22, 2013 at 4:53 PM, Rohan [guest] <guest@bioconductor.org> > wrote: > > > >I would like to get the data for all the genes in the form of Gene > Symbols/Gene ids's mapped to GPL/GSE/GSM/GDS metadata. > >I have used GEOmetadb package to get this metadata,however I am not able > to find a way to extract all this metadata mapped to genes. > > > >Is their any way GEOquery bioconductor package be used for this? > > > > Good question. It has a long-winded answer. > > The GEO platform (GPL) is the only GEO entity that stores any information > about gene identity. Other entities (GSM, GSE, GDS) are linked to GPL rows > by an ID column. So, to get information about the genes represented by an > experiment, we need to look at GPL records. GPL records come in two > flavors, the submitter-supplied flavor and the so-called "Annotation" GPL > that has been curated by NCBI GEO. You'll need to focus on the Annotation > GPL since those are the ones with a standard "Gene ID" column in all of > them. The "Annotation" GPLs are only generated for data sets that have > been curated by NCBI GEO, namely the GDS records. So, we need to get the > distinct GPL records associated with GDS and these will be the entire set > of "Annotation" GPLs. Using GEOmetadb (assuming you have already made a > connection, etc.): > > annotgpl = dbGetquery(con,"select distinct GPL from gds") > > Now, annotgpl contains the accession numbers (GPL IDs) for all the > Annotation GPLs. You can use these GPL IDs to relate each GPL to GSM, GDS, > and GSE records. > > How do you get the information about what genes are on each GPL, though? > You'll need to use GEOquery for that. > > gpl = getGEO(annotgpl[1,1],AnnotGPL=TRUE) > > gpl is now a GPL object and we can use the Table method to get a data > frame and grab the Gene ID (which is an Entrez Gene ID): > > geneids = Table(gpl)[,'Gene ID'] > > Now, you have the Entrez Gene IDs for all features on the platform and you > can associate those with all the GSM, GDS, and GSE records attached to the > GPL. If you loop over all the GPLs in the annotgpl data frame, you'll have > the information you want, I think. > > Unfortunately, this is not a complete answer because it does not include > the submitter-supplied GPLs that do not have any Annotation GPL available > (since NCBI GEO do not curate everything). The submitter-supplied GPLs do > not have a standard vocabulary for what is include in the columns of the > GPL, so there is not an easy way to automate processing as above. > > Hope that helps. > > Sean > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Sean, Thanks for your help!I am trying to get the gene ids for first 3 gpl list objects here, using this: i). genes_new<- unlist(sapply(gpllist[1:3], function(a) {Table(a)[,'Gene ID']})) This is the result of first 3 gene id's in first gpl object(GPL13) using the command above. "1""GPL13" "13""GPL13" "26""GPL13" However, I am getting altogether different gene ids's while doing it for all 3 gpl objects together as seen above.If I am doing it separately, then only i am getting the gene id's which are the correct ones, that I notice in gpllist object. The count of genes in each case (doing together vs separate )remain same . ii.) genes_new<- unlist(sapply(gpllist[1], function(a) {Table(a)[,'Gene ID']})) "818888" "GPL13" "821523" "GPL13" "824405" "GPL13" genes_new<- unlist(sapply(gpllist[2], function(a) {Table(a)[,'Gene ID']})) genes_new<- unlist(sapply(gpllist[3], function(a) {Table(a)[,'Gene ID']})) Do You know what problem might be here. Thanks, Rohan On Monday, 25 November 2013 2:33 PM, Sean Davis <sdavis2@mail.nih.gov> wrote: On Mon, Nov 25, 2013 at 2:22 PM, rohan bareja <rohan_1925@yahoo.co.in> wrote: Hi Sean, > >the answer was really helpful.I am using this to get all the annotations for gpl in this object ,since there are 438 gpl's in annotgpl object. > >gpllist <- sapply(annotgpl[1:438,1],getGEO,AnnotGPL=TRUE) > >But this is a list object, and I am unable to access this using a Table function. > > Hi, Rohan. Yes, you'll need to apply the Table() method to each member of the gpllist; another sapply or an lapply would be useful here. Sean basically, I would like to have a table with mapping of gpl and all its gene ids: > >GPL 13 geneid1 >GPL13 geneid2 >GPL 13 geneid3 > > >GPL 14 geneid1 >GPL14 geneid2 >GPL 14 geneid3 > >Thanks, >Rohan > > > > >On Friday, 22 November 2013 6:27 PM, Sean Davis <sdavis2@mail.nih.gov> wrote: > > > > > > >On Fri, Nov 22, 2013 at 4:53 PM, Rohan [guest] <guest@bioconductor.org> wrote: > > >>I would like to get the data for all the genes in the form of Gene Symbols/Gene ids's mapped to GPL/GSE/GSM/GDS metadata. >>I have used GEOmetadb package to get this metadata,however I am not able to find a way to extract all this metadata mapped to genes. >> >>Is their any way GEOquery bioconductor package be used for this? >> > >Good question.  It has a long-winded answer. > >The GEO platform (GPL) is the only GEO entity that stores any information about gene identity.  Other entities (GSM, GSE, GDS) are linked to GPL rows by an ID column.  So, to get information about the genes represented by an experiment, we need to look at GPL records. GPL records come in two flavors, the submitter-supplied flavor and the so-called "Annotation" GPL that has been curated by NCBI GEO.  You'll need to focus on the Annotation GPL since those are the ones with a standard "Gene ID" column in all of them.  The "Annotation" GPLs are only generated for data sets that have been curated by NCBI GEO, namely the GDS records.  So, we need to get the distinct GPL records associated with GDS and these will be the entire set of "Annotation" GPLs.  Using GEOmetadb (assuming you have already made a connection, etc.): > >annotgpl = dbGetquery(con,"select distinct GPL from gds") > >Now, annotgpl contains the accession numbers (GPL IDs) for all the Annotation GPLs.  You can use these GPL IDs to relate each GPL to GSM, GDS, and GSE records. > >How do you get the information about what genes are on each GPL, though?  You'll need to use GEOquery for that. > >gpl = getGEO(annotgpl[1,1],AnnotGPL=TRUE) > >gpl is now a GPL object and we can use the Table method to get a data frame and grab the Gene ID (which is an Entrez Gene ID): > >geneids = Table(gpl)[,'Gene ID'] > >Now, you have the Entrez Gene IDs for all features on the platform and you can associate those with all the GSM, GDS, and GSE records attached to the GPL.  If you loop over all the GPLs in the annotgpl data frame, you'll have the information you want, I think. > >Unfortunately, this is not a complete answer because it does not include the submitter-supplied GPLs that do not have any Annotation GPL available (since NCBI GEO do not curate everything).  The submitter-supplied GPLs do not have a standard vocabulary for what is include in the columns of the GPL, so there is not an easy way to automate processing as above. > >Hope that helps. > >Sean >        [[alternative HTML version deleted]] > > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6