Question

Different microarrays annotations provided by toTable and annotateEset functions

0

Entering edit mode

eleonoregravier ▴ 70

@eleonoregravier-8219

Last seen 3.4 years ago

France

Hi BioC community,

I work on transcriptomic data from the GeneChip Human Gene 2.1 ST Array (Affymetrix).

Up to now, I used the toTable function from AnnotationDbi package to get the entrez_id of each probe_id :

library(AnnotationDbi)
library(hugene21sttranscriptcluster.db)

entrezidHugene <- toTable(hugene21sttranscriptclusterENTREZID)

The data frame entrezidHugene presented 29548 unique probe ids with non-missing entrezIDs.

I tested recently another way to annotate my dataset with the annotateEset function from the affycoretools package :

library(oligo)

library(affycoretools)

celFiles <- list.celfiles(celPath, full.name=FALSE)

rawData<-read.celfiles(paste(celPath,celFiles,sep="\\"))

normSet<-rma(rawData, target='core')

val<- annotateEset(normSet, x=hugene21sttranscriptcluster.db)

dim(fData(val)[!is.na(fData(val)$ENTREZID),])

This function provides the (non-missing) entrezID of 32054 unique probe ids.

The 29548 unique probe ids with non-missing entrezIDs provided by the toTable function are also provided by the annotateEset function and for these probesets the entrez_ids generated are the same. But there are 2506 probesets annotated by annotateEset that are not annotated by the toTable function... I Thought that the results would be the same because annotateEset use the mapIds function from the AnnotationDbi package…

Could you help me to understand why there is such a difference between the two approaches please ? Which approach is recommended ?

Thanks a lot for your help

Eléonore

sessionInfo()

R version 3.4.3 (2017-11-30)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows >= 8 x64 (build 9200)


Matrix products: default


locale:

[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252

[4] LC_NUMERIC=C                   LC_TIME=French_France.1252   


attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base    


other attached packages:

 [1] oligo_1.40.2                         Biostrings_2.44.2                  

 [3] XVector_0.16.0                       oligoClasses_1.38.0                

 [5] affycoretools_1.48.0                 hugene21sttranscriptcluster.db_8.6.0

 [7] org.Hs.eg.db_3.4.1                   AnnotationDbi_1.38.2               

 [9] IRanges_2.10.5                       S4Vectors_0.14.7                   

[11] Biobase_2.36.2                       BiocGenerics_0.22.0                


loaded via a namespace (and not attached):

  [1] colorspace_1.3-2              hwriter_1.3.2                 biovizBase_1.24.0           

  [4] htmlTable_1.9                 GenomicRanges_1.28.6          base64enc_0.1-3             

  [7] dichromat_2.0-0               affyio_1.46.0                 bit64_0.9-7                 

 [10] interactiveDisplayBase_1.14.0 codetools_0.2-15              splines_3.4.3               

 [13] R.methodsS3_1.7.1             ggbio_1.24.1                  geneplotter_1.54.0          

 [16] knitr_1.17                    Formula_1.2-2                 Rsamtools_1.28.0            

 [19] annotate_1.54.0               cluster_2.0.6                 GO.db_3.4.1                 

 [22] R.oo_1.21.0                   graph_1.54.0                  shiny_1.0.5                 

 [25] compiler_3.4.3                httr_1.3.1                    GOstats_2.42.0              

 [28] backports_1.1.1               Matrix_1.2-12                 lazyeval_0.2.1              

 [31] limma_3.32.10                 acepack_1.4.1                 htmltools_0.3.6             

 [34] tools_3.4.3                   gtable_0.2.0                  GenomeInfoDbData_0.99.0     

 [37] affy_1.54.0                   Category_2.42.1               affxparser_1.48.0           

 [40] reshape2_1.4.2                Rcpp_0.12.13                  gdata_2.18.0                

 [43] preprocessCore_1.38.1         rtracklayer_1.36.6            iterators_1.0.8             

 [46] stringr_1.2.0                 mime_0.5                      ensembldb_2.0.4             

 [49] gtools_3.5.0                  XML_3.98-1.9                  AnnotationHub_2.8.3         

 [52] edgeR_3.18.1                  zlibbioc_1.22.0               scales_0.5.0                

 [55] BSgenome_1.44.2               VariantAnnotation_1.22.3      BiocInstaller_1.26.1        

 [58] ProtGenerics_1.8.0            SummarizedExperiment_1.6.5    RBGL_1.52.0                 

 [61] AnnotationFilter_1.0.0        RColorBrewer_1.1-2            yaml_2.1.14                 

 [64] curl_3.0                      memoise_1.1.0                 gridExtra_2.3               

 [67] ggplot2_2.2.1                 biomaRt_2.32.1                rpart_4.1-11                

 [70] gcrma_2.48.0                  reshape_0.8.7                 latticeExtra_0.6-28         

 [73] stringi_1.1.5                 RSQLite_2.0                   genefilter_1.58.1           

 [76] foreach_1.4.3                 checkmate_1.8.5               caTools_1.17.1              

 [79] GenomicFeatures_1.28.5        BiocParallel_1.10.1           GenomeInfoDb_1.12.3         

 [82] ReportingTools_2.16.0         rlang_0.1.2                   pkgconfig_2.0.1             

 [85] matrixStats_0.52.2            bitops_1.0-6                  lattice_0.20-35             

 [88] GenomicAlignments_1.12.2      htmlwidgets_0.9               bit_1.1-12                  

 [91] GSEABase_1.38.2               AnnotationForge_1.18.2        GGally_1.3.2                

 [94] plyr_1.8.4                    magrittr_1.5                  DESeq2_1.16.1               

 [97] R6_2.2.2                      gplots_3.0.1                  Hmisc_4.0-3                 

[100] DelayedArray_0.2.7            DBI_0.7                       foreign_0.8-69              

[103] survival_2.41-3               RCurl_1.95-4.8                nnet_7.3-12                 

[106] tibble_1.3.4                  KernSmooth_2.23-15            OrganismDbi_1.18.1          

[109] PFAM.db_3.4.1                 locfit_1.5-9.1                grid_3.4.3                  

[112] data.table_1.10.4-3           blob_1.1.0                    digest_0.6.12               

[115] xtable_1.8-2                  ff_2.2-13                     httpuv_1.3.5                

[118] R.utils_2.6.0                 munsell_0.4.3

microarray annotation annotationdbi affycoretools annotationtools • 1.8k views

ADD COMMENT • link updated 8.0 years ago by James W. MacDonald 68k • written 8.0 years ago by eleonoregravier ▴ 70

score 1 · Answer 1 · 2018-01-12

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 50 minutes ago

United States

The toTable function is based on an older paradigm in which we said any one-to-many mapping should return NA. So any probeset that maps to more than one Entrez Gene ID was set to NA because it wasn't clear which gene was being measured.

The current paradigm is to return all one-to-many mappings, along with a message saying we have done so. However, as you note annotateEset uses mapIds which by default returns just the first mapped value of any one-to-many mappings.

So if you use toTable, you get just those probesets that have a one-to-one mapping to Gene, and if you use annotateEset you get the probesets with one-to-many mappings as well, but by default only returning the first value. The multivals argument for annotateEset affects what is returned. From ?annotateEset:

multivals: For ChipDb method; this is passed to 'mapIds' to control how
          1:many mappings are handled. The default is 'first', which
          takes just the first result. Other valid values are 'list'
          and 'CharacterList', which return all mapped results.

Also you have a bunch of old package versions. You should fix that by doing

library(BiocInstaller)

biocValid()

and then probably

biocLite()

ADD COMMENT • link 8.0 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks a lot James for your very clear answer.

I understand that the current way to annotate microarrays is to return all one-to-many mappings (return just the first mapped value). However, I dont understand very well why it is recommended to return all one-to-many mappings comparing to what was done by toTable? Do you think these probesets with multiple Entrez Gene IDs are reliable ? Are you confident in the annotation given by annotateEset for these probesets since it is not clear which gene was being measured ?

Thanks for the update of the bioconductor packages,

ADD REPLY • link 8.0 years ago eleonoregravier ▴ 70

0

Entering edit mode

You misunderstand. There is no recommendation here, nor am I making any statements about reliability of any mappings.

We are simply providing the data that we get from Affymetrix in a form that is easier for people to deal with. We make no claims as to the reliability of their data, nor what someone should do with a one-to-many mapping.

The issue at hand is really what we should use as the default for one-to-many mappings. At one extreme is to either return NA or exclude altogether (which depending on how you extracted the data is what you got in the past). At the other extreme is to simply return everything including the one-to-many mappings and expect that our end users will figure things out for themselves.

In my opinion (and others have disagreed with me on this) we should take some middle ground that works OK for most people, without being overly paternalistic, which is what the defaults (IMO, again) currently do. Please note that you can suppress the one-to-many mappings by using multivals = "asNA".

ADD REPLY • link 8.0 years ago James W. MacDonald 68k