Question

id matching in select function within hugene10stprobeset.db

0

Entering edit mode

sarose989 • 0

@sarose989-6810

Last seen 9.6 years ago

United States

I'm trying to match IDs from a GEOquery data set to annotation information within the package:

hugene10stprobeset.db

When using the select function it seems the probe ids are not matching correctly.

To test, I ran:

> ids <- head(keys(hugene10stprobeset.db, keytype="PROBEID"))
> select(hugene10stprobeset.db, keys=ids, cols=c("SYMBOL","UNIGENE"),keytype="PROBEID")
[1] PROBEID SYMBOL  UNIGENE
<0 rows> (or 0-length row.names)

I get 0 matches even though I took the key names from the db itself.

When I do this for an alternate db it works:

ids <- head(keys(hgu95av2.db, keytype="PROBEID"))
select(hgu95av2.db, keys=ids, cols = c("SYMBOL","UNIGENE"),keytype="PROBEID")

    PROBEID  SYMBOL   UNIGENE
1   1000_at   MAPK3    Hs.861
2   1001_at    TIE1  Hs.78824
3 1002_f_at CYP2C19 Hs.282409
4 1003_s_at   CXCR5 Hs.113916
5   1004_at   CXCR5 Hs.113916
6   1005_at   DUSP1 Hs.171695

Based on this I think the select function within hugene10stprobeset.db is behaving improperly.

> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] annotate_1.32.3             hgu95av2.db_2.6.3          
 [3] hugene10stprobeset.db_8.0.1 hgu133a.db_2.6.3           
 [5] org.Hs.eg.db_2.6.4          RSQLite_0.11.4             
 [7] DBI_0.2-5                   AnnotationDbi_1.16.19      
 [9] BiocInstaller_1.2.1         Biobase_2.14.0             

loaded via a namespace (and not attached):
[1] IRanges_1.12.6 tools_2.14.0   xtable_1.7-1

annotation affy annotationdbi • 1.1k views

ADD COMMENT • link updated 9.6 years ago by James W. MacDonald 65k • written 9.6 years ago by sarose989 • 0

score 1 · Answer 1 · 2014-10-03

By taking just the first few probesets, you have inadvertently chosen control probesets, which don't have any Entrez or UniGene IDs:

library(hugene10stprobeset.db)

library(pd.hugene.1.0.st.v1)

con <- db(pd.hugene.1.0.st.v1)

ids <- head(keys(hugene10stprobeset.db, keytype="PROBEID"))

dbGetQuery(con, paste("select fsetid, type from featureSet where fsetid in ('", paste(ids, collapse = "','"), "');"))
   fsetid type
1 7892501    6
2 7892502    7
3 7892503    7
4 7892504    7
5 7892505    7
6 7892506    7
dbGetQuery(con, "select * from type_dict;")
   type                   type_id
1     1                      main
2     2             control->affx
3     3             control->chip
4     4 control->bgp->antigenomic
5     5     control->bgp->genomic
6     6            normgene->exon
7     7          normgene->intron
8     8  rescue->FLmRNA->unmapped
9     9  control->affx->bac_spike
10   10            oligo_spike_in
11   11           r1_bac_spike_at

So the type of probeset you have chosen are either normgene->exon or normgene->intron controls. If you choose 'main' probesets, you get the expected results:

> ids2 <- as.character(dbGetQuery(con, "select fsetid from featureSet where type='1' limit 10;")[,1])
> ids2
 [1] "7896737" "7896739" "7896741" "7896743" "7896745" "7896747" "7896749"
 [8] "7896751" "7896753" "7896755"
> select(hugene10stprobeset.db, ids2, c("ENTREZID","UNIGENE"))
   PROBEID  ENTREZID   UNIGENE
1  7896737      <NA>      <NA>
2  7896739      <NA>      <NA>
3  7896741     81099 Hs.572591
4  7896741     26682 Hs.554420
5  7896743      <NA>      <NA>
6  7896745     81399 Hs.632360
7  7896745    729759 Hs.632360
8  7896745    729759 Hs.722724
9  7896745    441308 Hs.690459
10 7896747      <NA>      <NA>
11 7896749      <NA>      <NA>
12 7896751      <NA>      <NA>
13 7896753      <NA>      <NA>
14 7896755 100287934 Hs.745567
15 7896755 100287497      <NA>
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and return rows

And please note two other things. First, the Gene ST arrays can be summarized at two levels, the probeset (roughly exon) and transcript level. For several reasons it doesn't make much sense to summarize at the probeset level, and the vast majority of the data you will find on GEO has been summarized at the transcript level. So you need the hugene10sttranscriptcluster.db package instead of the package you are using.

Second, the Gene ST arrays, being 'cut down' versions of the Exon ST arrays, have lots of speculative content. In addition, they have non-translated content as well (miRNAs, lincRNAs, etc), which will often not have an Entrez Gene or UniGene ID.

Best,

Jim