Question

GEO MicroArray Download, Annotation, and Main Probe Filtering

0

Entering edit mode

anthonycolombo60 • 0

@anthonycolombo60-8475

Last seen 2.7 years ago

United States

Hi.

This is a similiar issue reported here : annotating microarray data with mogene10stv1

So I'm annotating a microarray .CEL data pulled from GEO by an accession number using GEOquery, and am unable to annotate the resultant expression set using BiomaRt (using the respective hg_u133a_2.db). The error is that I receive all NAs.

The first issue is that I'm unable to use affycoretools to filter the main probes.

qus<-getGEO("GSE30550")[[1]]

getMainProbes(qus)
Error: The file GPL9188 does not appear to have been processed using the oligo package.
In this case the argument to this function should be the name of the correct pd.info package (e.g., pd.hugene.1.0.st.v1.

traceback()
2: stop(paste("The file", pdinfo, "does not appear to have been processed using",
       "the oligo package.\nIn this case the argument to this function should",
       "be the name of the correct pd.info package (e.g., pd.hugene.1.0.st.v1.\n"),
       call. = FALSE)
1: getMainProbes(qus)

for an attempt with biomaRt

here the affyids are the featureNames(qus)

maRt<-getBM(attributes=c('affy_hg_u133a_2','entrezgene'),filters='affy_hg_u133a_2',values=affyids,mart=ensembl)

maRt[affyids %in% maRt[,1],]

   affy_hg_u133a   entrezgene
NA             <NA>         NA
NA.1           <NA>         NA
NA.2           <NA>         NA
NA.3           <NA>         NA
NA.4           <NA>         NA
NA.5           <NA>         NA
NA.6           <NA>         NA
NA.7           <NA>         NA
NA.8           <NA>         NA
NA.9           <NA>         NA
NA.10          <NA>         NA
NA.11          <NA>         NA
NA.12          <NA>         NA
NA.13          <NA>         NA
NA.14          <NA>         NA
NA.15          <NA>         NA
NA.16          <NA>         NA
NA.17          <NA>         NA
NA.18          <NA>         NA
NA.19          <NA>         NA
NA.20          <NA>         NA

so i'm getting a bunch of control probes returned without any gene names. and am not able to use the affycoretools

qus
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11961 features, 268 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM757898 GSM757899 ... GSM758165 (268 total)
varLabels: title geo_accession ... data_row_count (35 total)
varMetadata: labelDescription
featureData
featureNames: 10000_at 10001_at ... 9_at (11961 total)
fvarLabels: ID Gene_ID Description
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL9188
>

I've checked that the annotation of GPL9188 is corresponding to hgu133a version 2 chip. Yet I'm having trouble annotating the correct probes.

The alternate solution is to build the expression set using the affy package, and oligio package, however I was hoping the GEOquery package could build it faster. any suggestions greatly appreciated.

here's my info

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] tools stats4 parallel stats graphics grDevices utils
[8] datasets methods base

other attached packages:
[1] GEOquery_2.36.0            biomaRt_2.26.1
[3] artemisData_0.21           matrixStats_0.50.1
[5] artemis_0.42.5             beeswarm_0.2.1
[7] SummarizedExperiment_1.0.2 TxDbLite_1.9.100
[9] stringdist_0.9.4.1         knitr_1.12
[11] rtracklayer_1.30.1         ensembldb_1.2.1
[13] GenomicFeatures_1.22.8     AnnotationDbi_1.32.3
[15] GenomicRanges_1.22.3       RSQLite_1.0.0
[17] DBI_0.3.1                  Biobase_2.30.0
[19] qusage_2.2.0               limma_3.26.5
[21] GenomeInfoDb_1.6.1         IRanges_2.4.6
[23] S4Vectors_0.8.7            BiocGenerics_0.16.1
[25] jsonlite_0.9.19

loaded via a namespace (and not attached):
[1] TH.data_1.0-6                colorspace_1.2-6
[3] hwriter_1.3.2                estimability_1.1-1
[5] qvalue_2.2.2                 futile.logger_1.4.1
[7] XVector_0.10.0               interactiveDisplayBase_1.8.0
[9] mvtnorm_1.0-4                codetools_0.2-14
[11] splines_3.2.3                R.methodsS3_1.7.0
[13] DESeq_1.22.1                 geneplotter_1.48.0
[15] pathview_1.10.1              Rsamtools_1.22.0
[17] annotate_1.48.0              png_0.1-7
[19] R.oo_1.19.0                  graph_1.48.0
[21] shiny_0.13.0                 httr_1.0.0
[23] lsmeans_2.21-1               Matrix_1.2-3
[25] htmltools_0.3                coda_0.18-1
[27] gtable_0.1.2                 reshape2_1.4.1
[29] ShortRead_1.28.0             Rcpp_0.12.3
[31] Biostrings_2.38.3            gdata_2.17.0
[33] nlme_3.1-123                 stringr_1.0.0
[35] mime_0.4                     gtools_3.5.0
[37] XML_3.98-1.3                 erccdashboard_1.4.0
[39] AnnotationHub_2.2.3          edgeR_3.12.0
[41] zlibbioc_1.16.0              MASS_7.3-45
[43] zoo_1.7-12                   scales_0.3.0
[45] aroma.light_3.0.0            BiocInstaller_1.20.1
[47] RBGL_1.46.0                  KEGGgraph_1.28.0
[49] sandwich_2.3-4               rhdf5_2.14.0
[51] QuasiSeq_1.0-8               lambda.r_1.1.7

microarray affy biomart • 1.5k views

ADD COMMENT • link 8.2 years ago anthonycolombo60 • 0

0

Entering edit mode

anthonycolombo60 • 0

@anthonycolombo60-8475

Last seen 2.7 years ago

United States

okay thank you

I was able to solve using the oligo package as well.

celFiles<- list.celfiles(getwd(),listGzipped=TRUE)
affyRaw<-read.celfiles(celFiles)
library(pd.hg.u133a.2)
eset<-rma(affyRaw)

Annot<-data.frame(ACCNUM=sapply(contents(hgu133a2ACCNUM),
paste,
collapse=", "),
SYMBOL=sapply(contents(hgu133a2SYMBOL),paste,
collapse=", "))

my_frame<-data.frame(exprs(eset))

all<-merge(Annot,my_frame,by.x=0,by.y=0,all=T)

heres all the genes

head(all[,1:4])
Row.names ACCNUM SYMBOL GSM757898.CEL.gz
1 1007_s_at U48705     NA         8.057068
2   1053_at M87338   RFC2         7.490671
3    117_at X51757 HSPA6        10.315676
4    121_at X69699   PAX8         8.606068
5 1255_g_at L36861 GUCA1A         4.249521
6   1294_at L13852     NA         9.783643

ADD COMMENT • link 8.2 years ago anthonycolombo60 • 0

0

Entering edit mode

I'm not sure where you came up with the idea to use contents(), but that is a sub-optimal way to proceed. The data in a ChipDb package haven't been in environments for years now, and although they are backwards-compatible with environment based functions, it is still better to use the modern API. One of the distinct disadvantages is that the environment based API returns NA in the face of one-to-many mappings. As an example,

> annot <- as.data.frame(lapply(c("ENTREZID","ACCNUM","REFSEQ","SYMBOL"), function(x) mapIds(hgu133a2.db, keys(hgu133a2.db), x, "PROBEID")))
> names(annot) <- c("ENTREZID","ACCNUM","REFSEQ","SYMBOL")
> head(annot)
          ENTREZID   ACCNUM       REFSEQ SYMBOL
1007_s_at      780 AAA02866 NM_001202521   DDR1
1053_at       5982 AAB09786 NM_001278791   RFC2
117_at        3310 AAH04279    NM_002155  HSPA6
121_at        7849 AAA03539    NM_003466   PAX8
1255_g_at     2978 AAA60541    NM_000409 GUCA1A
1294_at       7318 AAA75388    NM_003335   UBA7

ADD REPLY • link 8.2 years ago James W. MacDonald 65k

score 2 · Accepted Answer · 2016-01-19

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 6 hours ago

United States

There are no 'main' probes for this array. That is a concept for the WT array series only. In addition, there is no need to use biomaRt to annotate this array. You can use the hgu133a2.db package instead. If you are already using affycoretools, something like

annotation(qus) <- "hgu133a2.db"
library(BiocInstaller)
library(affycoretools)
biocLite("hgu133a2.db")
library(hgu133a2.db)
qus <- annotateEset(qus)

Will populate the featureData slot of the ExpressionSet with Entrez Gene, HUGO symbol, and gene name. And then if you use limma to fit the model, your topTable will get annotated automatically.

ADD COMMENT • link 8.2 years ago James W. MacDonald 65k

0

Entering edit mode

I just remembered that annoateEset is in the devel version of affycoretools, so unless you want to use devel tools, you will have to use select() or mapIds() to generate a data.frame of annotations that you can then use.

ADD REPLY • link 8.2 years ago James W. MacDonald 65k

0

Entering edit mode

If I use the select() method the symbols are NA

test<-select(hgu133a2.db,affyids,"SYMBOL","PROBEID")

head(test)
   PROBEID SYMBOL
1 10000_at   <NA>
2 10001_at   <NA>
3 10002_at   <NA>
4 10003_at   <NA>
5 10004_at   <NA>
6 10005_at   <NA>
>

the amount of NA is so many.

ADD REPLY • link 8.2 years ago anthonycolombo60 • 0

0

Entering edit mode

I solved it using this

library(affy)
library(oligo)

celFiles<- list.celfiles(getwd(),listGzipped=TRUE)
affyRaw<-read.celfiles(celFiles)
#library(pd.hg.u133a.2)
eset<-rma(affyRaw)
library(hgu133a2.db)
Annot<-data.frame(ACCNUM=sapply(contents(hgu133a2ACCNUM),
paste,
collapse=", "),
SYMBOL=sapply(contents(hgu133a2SYMBOL),paste,
collapse=", "))

indx<-featureNames(eset) %in% rownames(Annot)

qus<-eset[!duplicated(Annot[indx,2]),]
featureNames(qus)<-Annot[!duplicated(Annot[indx,2]),2]
qus<-qus[featureNames(qus[which(featureNames(qus)!="NA"),]),]

there was issues with rownames not being unique. not sure if this is a good idea to squash out non-unique genenames' probe ID.

open to suggestions

ADD REPLY • link 8.2 years ago anthonycolombo60 • 0

0

Entering edit mode

The GEO entry you are analyzing is based on the NCBI re-mapped CDF, where they re-mapped the data to Entrez Gene. So all of the probesets are of the form <entrezgeneid>_at.

> annot <- as.data.frame(lapply(c("ACCNUM","SYMBOL"), function(x) mapIds(hgu133a2.db, gsub("_at","",featureNames(gse)), x, "ENTREZID")))
> names(annot) <- c("ACCNUM","SYMBOL")
> head(annot)
        ACCNUM   SYMBOL
10000 AAD24196     AKT3
10001 AAB84363     MED6
10002 AAD28301    NR2E3
10003 AAH27594  NAALAD2
10004 AAB87645 NAALADL1
10005 AAB71665    ACOT8

ADD REPLY • link 8.2 years ago James W. MacDonald 65k