Question

Annotation to probes in mogene20sttranscriptcluster.db returns "NA"

0

Entering edit mode

haroon.maximas • 0

@haroonmaximas-8197

Last seen 2.8 years ago

United States

Dear All,

I am using Affymetrix Mouse Gene ST 2.0 arrays and after normalization wanted to annotate my ExpressionSet. I loaded the required library as

> library("mogene20sttranscriptcluster.db")
Loading required package: org.Mm.eg.db

Warning message:
In rsqlite_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries

which gave a warning. Next I saved the annotation as :

> Annot <- data.frame(ACCNUM=sapply(contents(mogene20sttranscriptclusterACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(mogene20sttranscriptclusterSYMBOL), paste, collapse=", "), DESC=sapply(contents(mogene20sttranscriptclusterGENENAME), paste, collapse=", "))

but it returned all probes as NA

> head(Annot)
         ACCNUM SYMBOL DESC
17200001     NA     NA   NA
17200003     NA     NA   NA
17200005     NA     NA   NA
17200007     NA     NA   NA
17200009     NA     NA   NA
17200011     NA     NA   NA

Any hint what is going wrong ??

Here is the Sessioninfo:

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_IN.UTF-8       LC_NUMERIC=C               LC_TIME=en_IN.UTF-8        LC_COLLATE=en_IN.UTF-8    
 [5] LC_MONETARY=en_IN.UTF-8    LC_MESSAGES=en_IN.UTF-8    LC_PAPER=en_IN.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mogene20sttranscriptcluster.db_8.4.0 org.Mm.eg.db_3.4.1                   pd.mogene.2.0.st_3.14.1             
 [4] DBI_0.7                              RSQLite_2.0                          genefilter_1.54.2                   
 [7] annotate_1.50.1                      XML_3.98-1.9                         AnnotationDbi_1.38.1                
[10] limma_3.28.21                        oligo_1.36.1                         Biostrings_2.40.2                   
[13] XVector_0.12.1                       IRanges_2.6.1                        S4Vectors_0.10.3                    
[16] oligoClasses_1.34.0                  affy_1.50.0                          Biobase_2.32.0                      
[19] BiocGenerics_0.18.0                 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12               BiocInstaller_1.22.3       compiler_3.4.0             GenomeInfoDb_1.8.7        
 [5] bitops_1.0-6               iterators_1.0.8            tools_3.4.0                zlibbioc_1.18.0           
 [9] digest_0.6.12              bit_1.1-12                 lattice_0.20-35            memoise_1.1.0             
[13] preprocessCore_1.34.0      tibble_1.3.3               ff_2.2-13                  pkgconfig_2.0.1           
[17] rlang_0.1.1                Matrix_1.2-10              foreach_1.4.3              affxparser_1.44.0         
[21] grid_3.4.0                 bit64_0.9-7                survival_2.41-3            blob_1.1.0                
[25] codetools_0.2-15           GenomicRanges_1.24.3       splines_3.4.0              SummarizedExperiment_1.2.3
[29] xtable_1.8-2               RCurl_1.95-4.8             affyio_1.42.0

mogene20sttranscriptcluster.db microarray • 1.7k views

ADD COMMENT • link 6.8 years ago • updated 6.7 years ago haroon.maximas • 0

score 0 · Answer 1 · 2017-07-27

There are two answers to your question. First, you are using old ways to annotate things, which will return NA for any ID that has a one-to-many mapping. So there will be lots of NA values where there really are annotations. You should be using more modern methods, like e.g. select or mapIds to get the annotations. Alternatively you could use annotateEset in my affycoretools package, which will annotate your data and put the results in the featureData slot of your ExpressionSet, which will then be discovered and used by the limma package when you make comparisons.

Second, not all of the things on the HuGene 2.0 array are actually annotated. For example, the first six probes are control probes that don't have an Entrez Gene ID, nor a HUGO symbol or RefSeq ID.

> z <- head(keys(hugene20sttranscriptcluster.db))
> select(hugene20sttranscriptcluster.db, z, c("ENTREZID","SYMBOL","ACCNUM"))

'select()' returned 1:1 mapping between keys and columns
   PROBEID ENTREZID SYMBOL ACCNUM
1 16650001     <NA>   <NA>   <NA>
2 16650003     <NA>   <NA>   <NA>
3 16650005     <NA>   <NA>   <NA>
4 16650007     <NA>   <NA>   <NA>
5 16650009     <NA>   <NA>   <NA>
6 16650011     <NA>   <NA>   <NA>

We can see that by querying the pd.hugene.2.0.st package:

> library(pd.hugene.2.0.st)

> con <- db(pd.hugene.2.0.st)

> dbGetQuery(con, paste("select transcript_cluster_id, type from featureSet where transcript_cluster_id in ('", paste(z, collapse = "','"), "');"))
  transcript_cluster_id type
1              16650001   10
2              16650003   10
3              16650005   10
4              16650007   10
5              16650009   10
6              16650011   10
> dbGetQuery(con, "select * from type_dict;")
   type                    type_id
1     1                       main
2     2            main->junctions
3     3                 main->psrs
4     4               main->rescue
5     5              control->affx
6     6              control->chip
7     7  control->bgp->antigenomic
8     8      control->bgp->genomic
9     9             normgene->exon
10   10           normgene->intron
11   11   rescue->FLmRNA->unmapped
12   12   control->affx->bac_spike
13   13             oligo_spike_in
14   14            r1_bac_spike_at
15   15 control->affx->polya_spike
16   16        control->affx->ercc
17   17  control->affx->ercc->step

So the first six probesets for this array are intronic control probes.

score 0 · Answer 2 · 2017-07-28

Dear James,
Thanks for the response.

I am aware of select method, but still it doesn't give the desired mapping.

> unitkey <- keys(mogene20sttranscriptcluster.db)
> select(mogene20sttranscriptcluster.db, column = c("SYMBOL", "ENTREZID", "UNIGENE"), keys = unitkey)

       PROBEID         SYMBOL  ENTREZID   UNIGENE
1     17200001           <NA>      <NA>      <NA>
2     17200003           <NA>      <NA>      <NA>
3     17200005           <NA>      <NA>      <NA>
4     17200007           <NA>      <NA>      <NA>
5     17200009           <NA>      <NA>      <NA>
6     17200011           <NA>      <NA>      <NA>
7     17200013           <NA>      <NA>      <NA>
8     17200015           <NA>      <NA>      <NA>
9     17200017           <NA>      <NA>      <NA>
10    17200019           <NA>      <NA>      <NA>
11    17200021           <NA>      <NA>      <NA>
12    17200023           <NA>      <NA>      <NA>
13    17200025           <NA>      <NA>      <NA>
14    17200027           <NA>      <NA>      <NA>
15    17200029           <NA>      <NA>      <NA>
16    17200031           <NA>      <NA>      <NA>
17    17200033           <NA>      <NA>      <NA>
18    17200035           <NA>      <NA>      <NA>
19    17200037           <NA>      <NA>      <NA>
20    17200039           <NA>      <NA>      <NA>

It doesn't work even for hugene20sttranscriptcluster.db (human).

And when I query pd.mogene.2.0.st package, it returns

> con <- db(pd.mogene.2.0.st)
> dbGetQuery(con, paste("select transcript_cluster_id, type from featureSet where transcript_cluster_id in ('", paste(z, collapse = "','"), "');"))
[1] transcript_cluster_id type                 
<0 rows> (or 0-length row.names)

I understand that many probes will not have mapping to all ids: 

> mogene20sttranscriptcluster()
Quality control information for mogene20sttranscriptcluster:

This package has the following mappings:

mogene20sttranscriptclusterACCNUM has 27673 mapped keys (of 41345 keys)
mogene20sttranscriptclusterALIAS2PROBE has 82164 mapped keys (of 139192 keys)
mogene20sttranscriptclusterCHR has 25290 mapped keys (of 41345 keys)
mogene20sttranscriptclusterCHRLENGTHS has 66 mapped keys (of 66 keys)
mogene20sttranscriptclusterCHRLOC has 22709 mapped keys (of 41345 keys)
mogene20sttranscriptclusterCHRLOCEND has 22709 mapped keys (of 41345 keys)
mogene20sttranscriptclusterENSEMBL has 21799 mapped keys (of 41345 keys)
mogene20sttranscriptclusterENSEMBL2PROBE has 21440 mapped keys (of 24518 keys)
mogene20sttranscriptclusterENTREZID has 25293 mapped keys (of 41345 keys)
mogene20sttranscriptclusterENZYME has 2216 mapped keys (of 41345 keys)
mogene20sttranscriptclusterENZYME2PROBE has 954 mapped keys (of 971 keys)
mogene20sttranscriptclusterGENENAME has 25293 mapped keys (of 41345 keys)
mogene20sttranscriptclusterGO has 21626 mapped keys (of 41345 keys)
mogene20sttranscriptclusterGO2ALLPROBES has 20551 mapped keys (of 21803 keys)
mogene20sttranscriptclusterGO2PROBE has 16303 mapped keys (of 17209 keys)
mogene20sttranscriptclusterMGI has 25077 mapped keys (of 41345 keys)
mogene20sttranscriptclusterMGI2PROBE has 24374 mapped keys (of 65999 keys)
mogene20sttranscriptclusterPATH has 6332 mapped keys (of 41345 keys)
mogene20sttranscriptclusterPATH2PROBE has 225 mapped keys (of 225 keys)
mogene20sttranscriptclusterPMID has 24873 mapped keys (of 41345 keys)
mogene20sttranscriptclusterPMID2PROBE has 238646 mapped keys (of 269467 keys)
mogene20sttranscriptclusterREFSEQ has 24861 mapped keys (of 41345 keys)
mogene20sttranscriptclusterSYMBOL has 25293 mapped keys (of 41345 keys)
mogene20sttranscriptclusterUNIGENE has 23923 mapped keys (of 41345 keys)
mogene20sttranscriptclusterUNIPROT has 19860 mapped keys (of 41345 keys)

but still its unclear to me where I am going wrong ?