Annotation issue while analyzing processed Affymetrix Human Exon 1.0 ST Arrays from Geoquery
1
1
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 2 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

based on a validation project of mutational signatures, a microarray expression dataset from GEO was processed, in order to evaluate the expression of specific genes in different human cell lines of a specific cancer type. For a quick assessment, i tried to download and analyze the processed data: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73160

library(GEOquery)
library(huex10sttranscriptcluster.db)
library(affycoretools)

gse <- getGEO("GSE73160",GSEMatrix = T)

gg <- gse[[1]]

gg
ExpressionSet (storageMode: lockedEnvironment)
assayData: 284258 features, 76 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1887886 GSM1887887 ... GSM1887961 (76 total)
  varLabels: title geo_accession ... cell type:ch1 (31 total)
  varMetadata: labelDescription
featureData
  featureNames: 2315252 2315253 ... 4054807 (284258 total)
  fvarLabels: ID transcript_cluster_id ... SPOT_ID (12 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 27247353 
Annotation: GPL11028

pData(gg)$data_processing[[1]]
[1] Probe set summarization and normalization with the core annotated probes was performed by robust multi-array averaging (RMA) and summarized at the gene level using AROMA.
Levels: Probe set summarization and normalization with the core annotated probes was performed by robust multi-array averaging (RMA) and summarized at the gene level using AROMA.

head(fData(gg))
             ID transcript_cluster_id gene_symbol cytoband mRNA_accession GB_ACC chromosome RANGE_GB RANGE_STRAND RANGE_START RANGE_STOP SPOT_ID
2315252 2315252               2315251                                                                                      NA         NA 2315252
2315253 2315253               2315251                                                                                      NA         NA 2315253
2315374 2315374               2315373                                                                                      NA         NA 2315374
2315375 2315375               2315373                                                                                      NA         NA 2315375
2315376 2315376               2315373                                                                                      NA         NA 2315376
2315377 2315377               2315373                                                                                      NA         NA 2315377

But when i tried to use affycoretools to add gene symbol annotation:

eset.rma <- annotateEset(gg, huex10sttranscriptcluster.db)
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns

head(fData(eset.rma))
        PROBEID ENTREZID SYMBOL GENENAME
2315252 2315252     <NA>   <NA>     <NA>
2315253 2315253     <NA>   <NA>     <NA>
2315374 2315374     <NA>   <NA>     <NA>
2315375 2315375     <NA>   <NA>     <NA>
2315376 2315376     <NA>   <NA>     <NA>
2315377 2315377     <NA>   <NA>     <NA>

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] huex10sttranscriptcluster.db_8.7.0 org.Hs.eg.db_3.8.2                 AnnotationDbi_1.46.1               IRanges_2.18.3                    
[5] S4Vectors_0.22.1                   affycoretools_1.56.0               GEOquery_2.52.0                    Biobase_2.44.0                    
[9] BiocGenerics_0.30.0

Thus, which might be the issue here ? and I should be starting with the raw data, preprocessing with oligo ?

Thank you in advance,

Efstathios

affycoretools huex10sttranscriptcluster.db affymetrix Geoquery • 254 views
ADD COMMENT
0
Entering edit mode

Have you taken a look at some other probeid annotations? Thousands of ProbeIDs are "controls" so will have no annotation. Something like sum(!is.na(fData(eset.rma)$ENTREZID)) might be helpful to see.

ADD REPLY
0
Entering edit mode

Dear Sean,

thank you for your suggestion-unfortunately, your suggestion ended with 0. I also tried when downloading with getGEO, the option AnnotGPL but it also returned an error

ADD REPLY
2
Entering edit mode
@james-w-macdonald-5106
Last seen 8 hours ago
United States

The GEO entry indicates those data have been summarized at the 'exon' level, not the transcript level.

> gse <- annotateEset(gse, huex10stprobeset.db)
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> apply(fData(gse), 2, function(x) sum(!is.na(x))/length(x))
  PROBEID  ENTREZID    SYMBOL  GENENAME 
1.0000000 0.9896186 0.9896186 0.9896186 
> head(fData(gse))
        PROBEID ENTREZID SYMBOL
2315252 2315252   729759 OR4F29
2315253 2315253   729759 OR4F29
2315374 2315374   400728 FAM87B
2315375 2315375   400728 FAM87B
2315376 2315376   400728 FAM87B
2315377 2315377   400728 FAM87B
                                                 GENENAME
2315252 olfactory receptor family 4 subfamily F member 29
2315253 olfactory receptor family 4 subfamily F member 29
2315374       family with sequence similarity 87 member B
2315375       family with sequence similarity 87 member B
2315376       family with sequence similarity 87 member B
2315377       family with sequence similarity 87 member B

ADD COMMENT
0
Entering edit mode

Dear James,

thank you very much for your valuable comments-indeed, I was a little been confused about the huex10stprobeset.db and huex10sttranscriptcluster.db databases, when i read above that "Probe set summarization and normalization with the core annotated probes was performed by robust multi-array averaging (RMA) and summarized at the gene level using AROMA" .

1) From above, you meant the gg object and not the gse, as the initial gse is a list right ?

2) For my downstream purposes of checking expression of specific genes, in order to keep unique probesets/genes, something like the following would suffice ?

xx <- fData(gg)
xx2 <- xx[!duplicated(xx$SYMBOL),]
xx2 <- xx2[!is.na(xx2$SYMBOL),]
eset.sel <- gg[rownames(xx2),]
rownames(eset.sel) <- fData(eset.sel)$SYMBOL
ADD REPLY
1
Entering edit mode

What you are doing seems fine.

ADD REPLY

Login before adding your answer.

Traffic: 379 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6