Question

GEOquery and different types of GPL annotation files

0

Entering edit mode

Peter ▴ 170

@peter-1556

Last seen 9.6 years ago

Do anyone know what the difference is between these two GEO GPL files? GPL199.annot (540kb) GPL199.soft (2166kb) I downloaded the GPL199.annot file here, as far as I can tell it is the only file available on the FTP site with GPL199 in the name: ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_platform/annot/GPL199.annot .gz If you browse a GDS file using this platform, e.g. GDS680, and click on the download icon and select "Annotation SOFT file" you also end up at the same GPL199.annot.gz file: http://www.ncbi.nlm.nih.gov/geo/gds/gds_browse.cgi?gds=680 On the other hand, GEOquery downloaded GPL199.soft from here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL199&for m=text&view=full It seems that GEOquery will load GPL199.soft happily, but fails on GPL199.annot (and other GPL*.annot files) like so: > gpl <- getGEO(filename="c:/temp/geo/GPL199.annot") Error in switch(as.character(first.entity[1]), sample = { : argument is missing, with no default Thanks Peter P.S. Sean - Thanks for fixing the problem with missing geneNames in GEOquery_1.5.3

GEOquery GEOquery • 5.2k views

ADD COMMENT • link updated 18.3 years ago by Sean Davis 21k • written 18.3 years ago by Peter ▴ 170

score 0 · Answer 1 · 2006-01-20

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On 1/20/06 9:27 AM, "Peter" <bioconductor-mailinglist at="" maubp.freeserve.co.uk=""> wrote: > Do anyone know what the difference is between these two GEO GPL files? > > GPL199.annot (540kb) > GPL199.soft (2166kb) The Annotation Soft files are built by GEO staff when they build a GEO dataset. They use whatever public identifier they can in the submitted GPL to do lookups on their own of what the features on the array represent. They are NOT available for every GPL, only those that are attached to a GDS. They do not necessarily agree with the original submitted GPL. They are not currently handled by GEOquery. However, as you noted in another post, Peter, the original GPLs as submitted by users are often larger than those built by GEO, so I haven't found a strong reason to work with the Annotation Soft files. In fact, I typically use the GPL information only for lookup of some primary key (genbank accession, affy id, or something like that) and then build the annotation myself (or use a bioconductor annotation package), as the methods used to generate annotation can be quite varied and the time since last update (in the case of GPLs, never updated) is important. Hope that helps clarify things a bit. Sean

ADD COMMENT • link 18.3 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: > The Annotation Soft files are built by GEO staff when they build a GEO > dataset. They use whatever public identifier they can in the submitted GPL > to do lookups on their own of what the features on the array represent. > They are NOT available for every GPL, only those that are attached to a GDS. > They do not necessarily agree with the original submitted GPL. They are not > currently handled by GEOquery. However, as you noted in another post, > Peter, the original GPLs as submitted by users are often larger than those > built by GEO, so I haven't found a strong reason to work with the Annotation > Soft files. In fact, I typically use the GPL information only for lookup of > some primary key (genbank accession, affy id, or something like that) and > then build the annotation myself (or use a bioconductor annotation package), > as the methods used to generate annotation can be quite varied and the time > since last update (in the case of GPLs, never updated) is important. > > Hope that helps clarify things a bit. A bit - but with two different GPL files its a little tricky following which one you mean at each point. Does the GEO team have some official terms for the two types? I'm a little unclear which are the "Annotation Soft files are built by GEO staff" and which are the "original GPLs as submitted by users", but I think I have worked it out: The GPL96 via the website (the 12MB file) does have GO terms, plus a list of experiments using the platform (GSM and GSE references): http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL96&form =text&view=full I'm guessing this is the one built by GEO staff (due to the GSM and GSE references). In the case of GPL96, the smaller 3MB file from the FTP site (GPL96.annot.gz) seems to have a lot of useful cross references (but no GO terms): ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_platform/annot/GPL96.annot. gz Is this therefore based on the data Affy submitted to describe their human chip? For my basic exploration of GEO files and microarray analysis, this smaller file is actually more useful - but its not supported by GEOquery. Is adding support for these style GPL files likely to be a "big job" do you think? I do take your point that using a bioconductor annotation package may be less hassle (I don't care to build my own annotation yet). Thank you Peter

ADD REPLY • link 18.2 years ago Peter ▴ 170

score 0 · Answer 2 · 2006-01-20

I wrote: > Does anyone know what the difference is between these two GEO GPL files? It looks like the different files contain rather different annotation information (with very little overlap). i.e. One is not just a subset of the other. I suspect different users will have different preferences... ---------------------------------------------------------------------- - Looking at the E.coli chip, > GPL199.annot (540kb) > GPL199.soft (2166kb) The larger (.soft) file includes a list of all GSM and GSE references using the platform, and the following columns: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL199&for m=text&view=full #ID = Affymetrix Probe Set ID ... #ORF = #Species Scientific Name = The genus and species of the ... #Annotation Date = The date that the annotations for this ... #SPOT_ID = Sequence Type: Indicates whether the sequence is ... #Sequence Source = The database from which the sequence used ... #Transcript ID(Array Design) = #Representative Public ID = The accession number of a ... #Alignments = Position of the alignment of the target sequence ... #Gene Title = Title of Gene represented by the probe set. #Gene Symbol = A gene symbol, when one is available (from UniGene). The smaller (.annot) file includes the following columns: ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_platform/annot/GPL199.annot .gz #ID = Platform reference identifier #Gene = Description field extracted from Entrez Gene #Unigene = Cluster ID extracted from Entrez UniGene #UniGene title = UniGene title extracted from Entrez UniGene #Nucleotide = Title extracted from Entrez Nucleotide #Protein = Title extracted from Entrez Protein #GI = GenBank identifier(s) #GenBank Accession = GenBank accession(s) #Gene symbol = Gene name field extracted from Entrez Gene #Platform_CLONEID = CLONE_ID column from GEO Platform data table #Platform_ORF = ORF column from GEO Platform data table #Platform_SPOTID = SPOT_ID column from GEO Platform data table #Platform_SPACC = SP_ACC column from GEO Platform data table #Platform_PTACC = PT_ACC column from GEO Platform data table ---------------------------------------------------------------------- - In the case of the HG-U133A human chip, the file size difference is much more significant (in terms of load times): GPL96.annot (3115kb) GPL96.soft (11979kb) The larger (.soft) file includes a list of all GSM and GSE references using the platform, and the following columns: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL96&form =text&view=full #ID = Affymetrix Probe Set ID ... #Species Scientific Name = The genus and species of the ... #Annotation Date = The date that the annotations for ... #GB_LIST = GenBank Accession Number ... #SPOT_ID = Sequence Type: Indicates whether the sequence is ... #Sequence Source = The database from which the sequence used ... #Representative Public ID = The accession number of a ... #Gene Title = Title of Gene represented by the probe set. #Gene Symbol = A gene symbol, when one is available (from UniGene). #Entrez Gene = Entrez Gene database UID ... #RefSeq Transcript ID = References to multiple sequences in RefSeq. ... #Gene Ontology Biological Process = ... #Gene Ontology Cellular Component = ... #Gene Ontology Molecular Function = ... The smaller (.annot) file has the following different columns: ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_platform/annot/GPL96.annot. gz #ID = Platform reference identifier #Gene = Description field extracted from Entrez Gene #Unigene = Cluster ID extracted from Entrez UniGene #UniGene title = UniGene title extracted from Entrez UniGene #Nucleotide = Title extracted from Entrez Nucleotide #Protein = Title extracted from Entrez Protein #GI = GenBank identifier(s) #GenBank Accession = GenBank accession(s) #Gene symbol = Gene name field extracted from Entrez Gene #Platform_CLONEID = CLONE_ID column from GEO Platform data table #Platform_ORF = ORF column from GEO Platform data table #Platform_SPOTID = SPOT_ID column from GEO Platform data table #Platform_SPACC = SP_ACC column from GEO Platform data table #Platform_PTACC = PT_ACC column from GEO Platform data table