Question

How to identify the microarray annotation library automatically from GEO study ID?

0

Entering edit mode

lijin.abc • 0

@lijinabc-10648

Last seen 8.0 years ago

Hi all,

From GEO study ID, I can find the microarray platform ID used. Then I can identify the microarray library to map probe ids in CEL files to gene symbols. For example, for GEO study GSE6731,

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6731

The platform used is `[HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array`, then in bioconductor, I can find the annotation library `hgu95av2.db`.

I want to do it automatically to find the annotation library given GEO study ID. Do you know how could I do it automatically?

Best regards,

Jin

limma microarray annotation bioconductor annotate • 1.5k views

ADD COMMENT • link updated 8.0 years ago by James W. MacDonald 65k • written 8.0 years ago by lijin.abc • 0

score 0 · Answer 1 · 2016-05-06

I'm not sure there is a universally workable method to do this. Assuming that GEO is strict about the title, you could hypothetically do something like

fun <- function(GDSDataobj){
    fulltitle <- Meta(GDSDataobj)$title
    title <- strsplit(fulltitle, "\\[|\\]")[[1]][2]
    title <- paste0(gsub("_|-| ", "", tolower(title)), ".db")
    title
}

And then

> eset <- getGEO("GSE6731")[[1]]
> fun(getGEO(annotation(eset)))
[1] "hgu95av2.db"

But that is likely to be really fragile, and not workable for any non-Affy array, so ymmv. It might be easier to just go to GEO, see what array was used, and infer the correct annotation package. Or use the annotations that automatically come in the featureData slot of the resulting ExpressionSet. For example,

> head(pData(featureData(eset))[,c(1,2,10,11,12,13)])
                 ID GB_ACC
1000_at     1000_at X60188
1001_at     1001_at X60957
1002_f_at 1002_f_at X65962
1003_s_at 1003_s_at X68149
1004_at     1004_at X68149
1005_at     1005_at X68277
                                                               Gene Title
1000_at                                mitogen-activated protein kinase 3
1001_at   tyrosine kinase with immunoglobulin-like and EGF-like domains 1
1002_f_at          cytochrome P450, family 2, subfamily C, polypeptide 19
1003_s_at                              chemokine (C-X-C motif) receptor 5
1004_at                                chemokine (C-X-C motif) receptor 5
1005_at                                    dual specificity phosphatase 1
          Gene Symbol ENTREZ_GENE_ID
1000_at         MAPK3           5595
1001_at          TIE1           7075
1002_f_at     CYP2C19           1557
1003_s_at       CXCR5            643
1004_at         CXCR5            643
1005_at         DUSP1           1843
                                 RefSeq Transcript ID
1000_at   NM_001040056 /// NM_001109891 /// NM_002746
1001_at                                     NM_005424
1002_f_at                                   NM_000769
1003_s_at                     NM_001716 /// NM_032966
1004_at                       NM_001716 /// NM_032966
1005_at                                     NM_004417

Which you can modify, and then things like limma will automatically use to annotate your results. The downside of the latter is that GEO seems to want to just upload the Affy CSV annotation file as is (with all the /// blah // blahblah // etc - see the RefSeq Transcript ID column) which makes things more difficult to deal with.