id matching in select function within hugene10stprobeset.db
1
0
Entering edit mode
sarose989 • 0
@sarose989-6810
Last seen 9.6 years ago
United States

I'm trying to match IDs from a GEOquery data set to annotation information within the package:

hugene10stprobeset.db

When using the select function it seems the probe ids are not matching correctly. 

To test, I ran:

> ids <- head(keys(hugene10stprobeset.db, keytype="PROBEID"))
> select(hugene10stprobeset.db, keys=ids, cols=c("SYMBOL","UNIGENE"),keytype="PROBEID")
[1] PROBEID SYMBOL  UNIGENE
<0 rows> (or 0-length row.names)

I get 0 matches even though I took the key names from the db itself. 

When I do this for an alternate db it works:

ids <- head(keys(hgu95av2.db, keytype="PROBEID"))
select(hgu95av2.db, keys=ids, cols = c("SYMBOL","UNIGENE"),keytype="PROBEID")

    PROBEID  SYMBOL   UNIGENE
1   1000_at   MAPK3    Hs.861
2   1001_at    TIE1  Hs.78824
3 1002_f_at CYP2C19 Hs.282409
4 1003_s_at   CXCR5 Hs.113916
5   1004_at   CXCR5 Hs.113916
6   1005_at   DUSP1 Hs.171695

Based on this I think the select function within hugene10stprobeset.db is behaving improperly. 

 

> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] annotate_1.32.3             hgu95av2.db_2.6.3          
 [3] hugene10stprobeset.db_8.0.1 hgu133a.db_2.6.3           
 [5] org.Hs.eg.db_2.6.4          RSQLite_0.11.4             
 [7] DBI_0.2-5                   AnnotationDbi_1.16.19      
 [9] BiocInstaller_1.2.1         Biobase_2.14.0             

loaded via a namespace (and not attached):
[1] IRanges_1.12.6 tools_2.14.0   xtable_1.7-1  
annotation affy annotationdbi • 1.1k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

By taking just the first few probesets, you have inadvertently chosen control probesets, which don't have any Entrez or UniGene IDs:

library(hugene10stprobeset.db)

library(pd.hugene.1.0.st.v1)

con <- db(pd.hugene.1.0.st.v1)

ids <- head(keys(hugene10stprobeset.db, keytype="PROBEID"))

dbGetQuery(con, paste("select fsetid, type from featureSet where fsetid in ('", paste(ids, collapse = "','"), "');"))
   fsetid type
1 7892501    6
2 7892502    7
3 7892503    7
4 7892504    7
5 7892505    7
6 7892506    7
dbGetQuery(con, "select * from type_dict;")
   type                   type_id
1     1                      main
2     2             control->affx
3     3             control->chip
4     4 control->bgp->antigenomic
5     5     control->bgp->genomic
6     6            normgene->exon
7     7          normgene->intron
8     8  rescue->FLmRNA->unmapped
9     9  control->affx->bac_spike
10   10            oligo_spike_in
11   11           r1_bac_spike_at

So the type of probeset you have chosen are either normgene->exon or normgene->intron controls. If you choose 'main' probesets, you get the expected results:

> ids2 <- as.character(dbGetQuery(con, "select fsetid from featureSet where type='1' limit 10;")[,1])
> ids2
 [1] "7896737" "7896739" "7896741" "7896743" "7896745" "7896747" "7896749"
 [8] "7896751" "7896753" "7896755"
> select(hugene10stprobeset.db, ids2, c("ENTREZID","UNIGENE"))
   PROBEID  ENTREZID   UNIGENE
1  7896737      <NA>      <NA>
2  7896739      <NA>      <NA>
3  7896741     81099 Hs.572591
4  7896741     26682 Hs.554420
5  7896743      <NA>      <NA>
6  7896745     81399 Hs.632360
7  7896745    729759 Hs.632360
8  7896745    729759 Hs.722724
9  7896745    441308 Hs.690459
10 7896747      <NA>      <NA>
11 7896749      <NA>      <NA>
12 7896751      <NA>      <NA>
13 7896753      <NA>      <NA>
14 7896755 100287934 Hs.745567
15 7896755 100287497      <NA>
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and return rows

And please note two other things. First, the Gene ST arrays can be summarized at two levels, the probeset (roughly exon) and transcript level. For several reasons it doesn't make much sense to summarize at the probeset level, and the vast majority of the data you will find on GEO has been summarized at the transcript level. So you need the hugene10sttranscriptcluster.db package instead of the package you are using.

Second, the Gene ST arrays, being 'cut down' versions of the Exon ST arrays, have lots of speculative content. In addition, they have non-translated content as well (miRNAs, lincRNAs, etc), which will often not have an Entrez Gene or UniGene ID.

Best,

Jim

 

ADD COMMENT
0
Entering edit mode

I thought the select() contract is to return at least 1:1 mappings, sometimes 1:many, so the first query should return a 6x3 data.frame?

ADD REPLY
0
Entering edit mode

Maybe the OP has a borked package. I get this:

> ids <- head(keys(hugene10stprobeset.db, keytype="PROBEID"))
> select(hugene10stprobeset.db, keys=ids, cols=c("SYMBOL","UNIGENE"),keytype="PROBEID")
  PROBEID SYMBOL UNIGENE
1 7892501   <NA>    <NA>
2 7892502   <NA>    <NA>
3 7892503   <NA>    <NA>
4 7892504   <NA>    <NA>
5 7892505   <NA>    <NA>
6 7892506   <NA>    <NA>
Warning message:
In .colsArgumentWarning() :
  The 'cols' argument has been deprecated and replaced by 'columns' for
  versions of Bioc that are higher than 2.13.  Please use the 'columns'
  argument anywhere that you previously used 'cols'

But then I am not using R-2.14.0 either.

ADD REPLY

Login before adding your answer.

Traffic: 709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6