Annotation to probes in mogene20sttranscriptcluster.db returns "NA"
2
0
Entering edit mode
@haroonmaximas-8197
Last seen 2.7 years ago
United States

Dear All,

I am using Affymetrix Mouse Gene ST 2.0 arrays and after normalization wanted to annotate my ExpressionSet.  I loaded the required library as

> library("mogene20sttranscriptcluster.db")
Loading required package: org.Mm.eg.db

Warning message:
In rsqlite_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries

which gave a warning. Next I saved the annotation as :

> Annot <- data.frame(ACCNUM=sapply(contents(mogene20sttranscriptclusterACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(mogene20sttranscriptclusterSYMBOL), paste, collapse=", "), DESC=sapply(contents(mogene20sttranscriptclusterGENENAME), paste, collapse=", "))

but it returned all probes as NA

> head(Annot)
         ACCNUM SYMBOL DESC
17200001     NA     NA   NA
17200003     NA     NA   NA
17200005     NA     NA   NA
17200007     NA     NA   NA
17200009     NA     NA   NA
17200011     NA     NA   NA

Any hint what is going wrong ??

 

Here is the Sessioninfo:

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_IN.UTF-8       LC_NUMERIC=C               LC_TIME=en_IN.UTF-8        LC_COLLATE=en_IN.UTF-8    
 [5] LC_MONETARY=en_IN.UTF-8    LC_MESSAGES=en_IN.UTF-8    LC_PAPER=en_IN.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mogene20sttranscriptcluster.db_8.4.0 org.Mm.eg.db_3.4.1                   pd.mogene.2.0.st_3.14.1             
 [4] DBI_0.7                              RSQLite_2.0                          genefilter_1.54.2                   
 [7] annotate_1.50.1                      XML_3.98-1.9                         AnnotationDbi_1.38.1                
[10] limma_3.28.21                        oligo_1.36.1                         Biostrings_2.40.2                   
[13] XVector_0.12.1                       IRanges_2.6.1                        S4Vectors_0.10.3                    
[16] oligoClasses_1.34.0                  affy_1.50.0                          Biobase_2.32.0                      
[19] BiocGenerics_0.18.0                 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12               BiocInstaller_1.22.3       compiler_3.4.0             GenomeInfoDb_1.8.7        
 [5] bitops_1.0-6               iterators_1.0.8            tools_3.4.0                zlibbioc_1.18.0           
 [9] digest_0.6.12              bit_1.1-12                 lattice_0.20-35            memoise_1.1.0             
[13] preprocessCore_1.34.0      tibble_1.3.3               ff_2.2-13                  pkgconfig_2.0.1           
[17] rlang_0.1.1                Matrix_1.2-10              foreach_1.4.3              affxparser_1.44.0         
[21] grid_3.4.0                 bit64_0.9-7                survival_2.41-3            blob_1.1.0                
[25] codetools_0.2-15           GenomicRanges_1.24.3       splines_3.4.0              SummarizedExperiment_1.2.3
[29] xtable_1.8-2               RCurl_1.95-4.8             affyio_1.42.0     
mogene20sttranscriptcluster.db microarray • 1.7k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

There are two answers to your question. First, you are using old ways to annotate things, which will return NA for any ID that has a one-to-many mapping. So there will be lots of NA values where there really are annotations. You should be using more modern methods, like e.g. select or mapIds to get the annotations. Alternatively you could use annotateEset in my affycoretools package, which will annotate your data and put the results in the featureData slot of your ExpressionSet, which will then be discovered and used by the limma package when you make comparisons.

Second, not all of the things on the HuGene 2.0 array are actually annotated. For example, the first six probes are control probes that don't have an Entrez Gene ID, nor a HUGO symbol or RefSeq ID.

> z <- head(keys(hugene20sttranscriptcluster.db))
> select(hugene20sttranscriptcluster.db, z, c("ENTREZID","SYMBOL","ACCNUM"))
'select()' returned 1:1 mapping between keys and columns
   PROBEID ENTREZID SYMBOL ACCNUM
1 16650001     <NA>   <NA>   <NA>
2 16650003     <NA>   <NA>   <NA>
3 16650005     <NA>   <NA>   <NA>
4 16650007     <NA>   <NA>   <NA>
5 16650009     <NA>   <NA>   <NA>
6 16650011     <NA>   <NA>   <NA>

We can see that by querying the pd.hugene.2.0.st package:

> library(pd.hugene.2.0.st)

> con <- db(pd.hugene.2.0.st)

> dbGetQuery(con, paste("select transcript_cluster_id, type from featureSet where transcript_cluster_id in ('", paste(z, collapse = "','"), "');"))
  transcript_cluster_id type
1              16650001   10
2              16650003   10
3              16650005   10
4              16650007   10
5              16650009   10
6              16650011   10
> dbGetQuery(con, "select * from type_dict;")
   type                    type_id
1     1                       main
2     2            main->junctions
3     3                 main->psrs
4     4               main->rescue
5     5              control->affx
6     6              control->chip
7     7  control->bgp->antigenomic
8     8      control->bgp->genomic
9     9             normgene->exon
10   10           normgene->intron
11   11   rescue->FLmRNA->unmapped
12   12   control->affx->bac_spike
13   13             oligo_spike_in
14   14            r1_bac_spike_at
15   15 control->affx->polya_spike
16   16        control->affx->ercc
17   17  control->affx->ercc->step

So the first six probesets for this array are intronic control probes.

ADD COMMENT
0
Entering edit mode
@haroonmaximas-8197
Last seen 2.7 years ago
United States
Dear James,
Thanks for the response.

I am aware of select method, but still it doesn't give the desired mapping.

> unitkey <- keys(mogene20sttranscriptcluster.db)
> select(mogene20sttranscriptcluster.db, column = c("SYMBOL", "ENTREZID", "UNIGENE"), keys = unitkey)

       PROBEID         SYMBOL  ENTREZID   UNIGENE
1     17200001           <NA>      <NA>      <NA>
2     17200003           <NA>      <NA>      <NA>
3     17200005           <NA>      <NA>      <NA>
4     17200007           <NA>      <NA>      <NA>
5     17200009           <NA>      <NA>      <NA>
6     17200011           <NA>      <NA>      <NA>
7     17200013           <NA>      <NA>      <NA>
8     17200015           <NA>      <NA>      <NA>
9     17200017           <NA>      <NA>      <NA>
10    17200019           <NA>      <NA>      <NA>
11    17200021           <NA>      <NA>      <NA>
12    17200023           <NA>      <NA>      <NA>
13    17200025           <NA>      <NA>      <NA>
14    17200027           <NA>      <NA>      <NA>
15    17200029           <NA>      <NA>      <NA>
16    17200031           <NA>      <NA>      <NA>
17    17200033           <NA>      <NA>      <NA>
18    17200035           <NA>      <NA>      <NA>
19    17200037           <NA>      <NA>      <NA>
20    17200039           <NA>      <NA>      <NA>

It doesn't work even for hugene20sttranscriptcluster.db (human).

And when I query pd.mogene.2.0.st package, it returns

> con <- db(pd.mogene.2.0.st)
> dbGetQuery(con, paste("select transcript_cluster_id, type from featureSet where transcript_cluster_id in ('", paste(z, collapse = "','"), "');"))
[1] transcript_cluster_id type                 
<0 rows> (or 0-length row.names)

I understand that many probes will not have mapping to all ids: 

> mogene20sttranscriptcluster()
Quality control information for mogene20sttranscriptcluster:

This package has the following mappings:

mogene20sttranscriptclusterACCNUM has 27673 mapped keys (of 41345 keys)
mogene20sttranscriptclusterALIAS2PROBE has 82164 mapped keys (of 139192 keys)
mogene20sttranscriptclusterCHR has 25290 mapped keys (of 41345 keys)
mogene20sttranscriptclusterCHRLENGTHS has 66 mapped keys (of 66 keys)
mogene20sttranscriptclusterCHRLOC has 22709 mapped keys (of 41345 keys)
mogene20sttranscriptclusterCHRLOCEND has 22709 mapped keys (of 41345 keys)
mogene20sttranscriptclusterENSEMBL has 21799 mapped keys (of 41345 keys)
mogene20sttranscriptclusterENSEMBL2PROBE has 21440 mapped keys (of 24518 keys)
mogene20sttranscriptclusterENTREZID has 25293 mapped keys (of 41345 keys)
mogene20sttranscriptclusterENZYME has 2216 mapped keys (of 41345 keys)
mogene20sttranscriptclusterENZYME2PROBE has 954 mapped keys (of 971 keys)
mogene20sttranscriptclusterGENENAME has 25293 mapped keys (of 41345 keys)
mogene20sttranscriptclusterGO has 21626 mapped keys (of 41345 keys)
mogene20sttranscriptclusterGO2ALLPROBES has 20551 mapped keys (of 21803 keys)
mogene20sttranscriptclusterGO2PROBE has 16303 mapped keys (of 17209 keys)
mogene20sttranscriptclusterMGI has 25077 mapped keys (of 41345 keys)
mogene20sttranscriptclusterMGI2PROBE has 24374 mapped keys (of 65999 keys)
mogene20sttranscriptclusterPATH has 6332 mapped keys (of 41345 keys)
mogene20sttranscriptclusterPATH2PROBE has 225 mapped keys (of 225 keys)
mogene20sttranscriptclusterPMID has 24873 mapped keys (of 41345 keys)
mogene20sttranscriptclusterPMID2PROBE has 238646 mapped keys (of 269467 keys)
mogene20sttranscriptclusterREFSEQ has 24861 mapped keys (of 41345 keys)
mogene20sttranscriptclusterSYMBOL has 25293 mapped keys (of 41345 keys)
mogene20sttranscriptclusterUNIGENE has 23923 mapped keys (of 41345 keys)
mogene20sttranscriptclusterUNIPROT has 19860 mapped keys (of 41345 keys)

but still its unclear to me where I am going wrong ?

 

 

ADD COMMENT
0
Entering edit mode

Please don't use the Add your answer box to add a comment. Instead, use the ADD COMMENT or ADD REPLY link.

As I already noted in my answer, not all of the probes on the array measure a 'gene'! You are looking at the first few probesets on the array, which are intronic controls, and are then saying that the package doesn't work. If you ask 'what gene does this thing measure?' and it's an intronic control, the answer is 'It doesn't measure a gene, it's an intronic control'. I can't make it any clearer than that.

> z <- select(hugene20sttranscriptcluster.db, keys(hugene20sttranscriptcluster.db), c("SYMBOL","ENTREZID"))
'select()' returned 1:many mapping between keys and columns
> apply(z, 2, function(x) sum(is.na(x)))
 PROBEID   SYMBOL ENTREZID 
       0    21563    21563 
> dim(z)
[1] 59281     3
> head(z)
   PROBEID SYMBOL ENTREZID
1 16650001   <NA>     <NA>
2 16650003   <NA>     <NA>
3 16650005   <NA>     <NA>
4 16650007   <NA>     <NA>
5 16650009   <NA>     <NA>
6 16650011   <NA>     <NA>

> z[5000:5010,]
      PROBEID       SYMBOL  ENTREZID
5000 16670270      PPIAL4E    730262
5001 16670270      PPIAL4D    645142
5002 16670270      PPIAL4F    728945
5003 16670270      PPIAL4C    653598
5004 16670270      PPIAL4A    653505
5005 16670270      PPIAL4G    644591
5006 16670270 LOC105371242 105371242
5007 16670272 LOC105371196 105371196
5008 16670272 LOC105371214 105371214
5009 16670276    LOC645166    645166
5010 16670276 LOC100996732 100996732
> 
ADD REPLY

Login before adding your answer.

Traffic: 492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6