Question

Annotation from Submitter GPL in GSE dataset

0

Entering edit mode

PyPer ▴ 20

@pyper-6819

Last seen 9.5 years ago

Australia

I've recently started trying to process data from the GEO datasets that are publically available.
Unfortunately my R skills are still quite not up to par.

I've been working with a specific set in particular GSE3933
the code is simply to extract the expression for each sample then match it with the appropriate gene symbol/annotation.

gse3933 <- getGEO('GSE3933',GSEMatrix=TRUE)
gpl <- annotation(gse3933[[3]])
platf <- getGEO(gpl, AnnotGPL=TRUE)
anot <- data.frame(attr(dataTable(platf), "table"))
Gmicro <- exprs(gse3933[[3]])

My problem arises in two parts:

Part 1
I've accessed the user submitted annotation which is GPL3289 - this has ten variables. However there is no variable which is "gene symbol". My guess is that the GB_List variable column can be converted to Gene symbols/ Gene ID's - however I have not been able to figure out how to do this.

Part 2
I then want to match the Gene Symbols/Probes with the appropriate expression values from the matrix 'Gmicro' as per the code. Is this simply matching the ID's from the matrix generated by using the exprs function to those generated using dataTable?

annotation geoquery • 5.5k views

ADD COMMENT • link updated 9.6 years ago by James W. MacDonald 65k • written 9.6 years ago by PyPer ▴ 20

score 3 · Accepted Answer · 2014-10-06

3

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

The GB_list column contains comma-separated GenBank IDs, which you can convert to symbol and Gene IDs using the org.Hs.eg.db package. Unfortunately there are multiple IDs per row, so you will have to decide how you want to resolve that. But in general, you use the select() function.

And note the annotation already comes with the GSE that you downloaded, and can be found in the featureData slot. But I think you will be sorely disappointed. Those GenBank IDs look like they are all IMAGE clones that will probably take quite a bit of digging on your part to annotate. As an example:

> gbs1 <- pData(featureData(gse3933[[1]]))$GB_LIST
> ind <- rep(1:length(gbs1), sapply(strsplit(as.character(gbs1), ","), length))
> gbs1 <- unlist(strsplit(as.character(gbs1), ","))
> mapped <- select(org.Hs.eg.db, gbs1, c("ENTREZID","SYMBOL"), "ACCNUM")
> apply(mapped[,2:3], 2, function(x) sum(!is.na(x)))
ENTREZID   SYMBOL
     546      546

I created the 'ind' vector so I could map things back to the original data (accounting for the replicates), but no matter; most of these are ESTs and they don't appear to have either symbols or Entrez Gene IDs (there are over 70k IDs that I tried to map, and only 546 were successful).

So that is how you would annotate, given the infrastructure in Bioconductor.

Best,

Jim

ADD COMMENT • link 9.6 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for the answer! very helpful. Just a couple of follow up questions

From the code you've provided, you've matched the GB_LIST to the known gene names. I'm still a bit confused about the GB_LIST. If these are known Genebank accession numbers, shouldn't they easily match? In other words, why do certain GB_LIST not 'exist' in the database.

My other question as per before is simply:

Does the ID column in featureData correspond with that found in AssayData? so e.g. 550 matches 550? (seems obvious, but want to check for surety)

Thank you so much again for the answer and your time.

ADD REPLY • link 9.6 years ago PyPer ▴ 20

0

Entering edit mode

Just because a sequence is found in GenBank doesn't imply that there will be a gene. As I mentioned, these are ESTs, which IIRC can be roughly translated as 'A sequence some person found in a tissue or cell culture this one time so we put it in the database.'

Or more formally, from NCBI:

"The Nucleotide, Genome Survey Sequence (GSS), and Expressed Sequence Tag (EST) database all contain nucleic acid sequences. The data in GSS and EST are from two large bulk sequence divisions of GenBank. GSS and EST data are typically uncharacterized, short genomic (GSS) or cDNA (EST) sequences."

As an example, one of the first GenBank IDs is AI251547, which if you do a Google search, should come up with this as one of the top hits (it's first for me):

http://www.ncbi.nlm.nih.gov/nucest/AI251574,AI792844,AI733932

And if you follow the links, you will see that this is a clone that was pulled out of an ovary cancer cell line by a CGAP researcher way back in 1998. It has a GenBank accession number and a GI number, but that's it. You can take the sequence and blat it at UCSC and find out that it doesn't really align anywhere.

ADD REPLY • link 9.6 years ago James W. MacDonald 65k