Dear Community,
based on a validation project of mutational signatures, a microarray expression dataset from GEO was processed, in order to evaluate the expression of specific genes in different human cell lines of a specific cancer type. For a quick assessment, i tried to download and analyze the processed data: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73160
library(GEOquery)
library(huex10sttranscriptcluster.db)
library(affycoretools)
gse <- getGEO("GSE73160",GSEMatrix = T)
gg <- gse[[1]]
gg
ExpressionSet (storageMode: lockedEnvironment)
assayData: 284258 features, 76 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM1887886 GSM1887887 ... GSM1887961 (76 total)
varLabels: title geo_accession ... cell type:ch1 (31 total)
varMetadata: labelDescription
featureData
featureNames: 2315252 2315253 ... 4054807 (284258 total)
fvarLabels: ID transcript_cluster_id ... SPOT_ID (12 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 27247353
Annotation: GPL11028
pData(gg)$data_processing[[1]]
[1] Probe set summarization and normalization with the core annotated probes was performed by robust multi-array averaging (RMA) and summarized at the gene level using AROMA.
Levels: Probe set summarization and normalization with the core annotated probes was performed by robust multi-array averaging (RMA) and summarized at the gene level using AROMA.
head(fData(gg))
ID transcript_cluster_id gene_symbol cytoband mRNA_accession GB_ACC chromosome RANGE_GB RANGE_STRAND RANGE_START RANGE_STOP SPOT_ID
2315252 2315252 2315251 NA NA 2315252
2315253 2315253 2315251 NA NA 2315253
2315374 2315374 2315373 NA NA 2315374
2315375 2315375 2315373 NA NA 2315375
2315376 2315376 2315373 NA NA 2315376
2315377 2315377 2315373 NA NA 2315377
But when i tried to use affycoretools to add gene symbol annotation:
eset.rma <- annotateEset(gg, huex10sttranscriptcluster.db)
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
head(fData(eset.rma))
PROBEID ENTREZID SYMBOL GENENAME
2315252 2315252 <NA> <NA> <NA>
2315253 2315253 <NA> <NA> <NA>
2315374 2315374 <NA> <NA> <NA>
2315375 2315375 <NA> <NA> <NA>
2315376 2315376 <NA> <NA> <NA>
2315377 2315377 <NA> <NA> <NA>
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] huex10sttranscriptcluster.db_8.7.0 org.Hs.eg.db_3.8.2 AnnotationDbi_1.46.1 IRanges_2.18.3
[5] S4Vectors_0.22.1 affycoretools_1.56.0 GEOquery_2.52.0 Biobase_2.44.0
[9] BiocGenerics_0.30.0
Thus, which might be the issue here ? and I should be starting with the raw data, preprocessing with oligo ?
Thank you in advance,
Efstathios
Have you taken a look at some other probeid annotations? Thousands of ProbeIDs are "controls" so will have no annotation. Something like
sum(!is.na(fData(eset.rma)$ENTREZID))
might be helpful to see.Dear Sean,
thank you for your suggestion-unfortunately, your suggestion ended with 0. I also tried when downloading with getGEO, the option AnnotGPL but it also returned an error