Hi BioC community,
I work on transcriptomic data from the GeneChip Human Gene 2.1 ST Array (Affymetrix).
Up to now, I used the toTable function from AnnotationDbi package to get the entrez_id of each probe_id :
library(AnnotationDbi) library(hugene21sttranscriptcluster.db) entrezidHugene <- toTable(hugene21sttranscriptclusterENTREZID)
The data frame entrezidHugene presented 29548 unique probe ids with non-missing entrezIDs.
I tested recently another way to annotate my dataset with the annotateEset function from the affycoretools package :
library(oligo) library(affycoretools) celFiles <- list.celfiles(celPath, full.name=FALSE) rawData<-read.celfiles(paste(celPath,celFiles,sep="\\")) normSet<-rma(rawData, target='core') val<- annotateEset(normSet, x=hugene21sttranscriptcluster.db) dim(fData(val)[!is.na(fData(val)$ENTREZID),])
This function provides the (non-missing) entrezID of 32054 unique probe ids.
The 29548 unique probe ids with non-missing entrezIDs provided by the toTable function are also provided by the annotateEset function and for these probesets the entrez_ids generated are the same. But there are 2506 probesets annotated by annotateEset that are not annotated by the toTable function... I Thought that the results would be the same because annotateEset use the mapIds function from the AnnotationDbi package…
Could you help me to understand why there is such a difference between the two approaches please ? Which approach is recommended ?
Thanks a lot for your help
sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 [4] LC_NUMERIC=C LC_TIME=French_France.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] oligo_1.40.2 Biostrings_2.44.2 [3] XVector_0.16.0 oligoClasses_1.38.0 [5] affycoretools_1.48.0 hugene21sttranscriptcluster.db_8.6.0 [7] org.Hs.eg.db_3.4.1 AnnotationDbi_1.38.2 [9] IRanges_2.10.5 S4Vectors_0.14.7 [11] Biobase_2.36.2 BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] colorspace_1.3-2 hwriter_1.3.2 biovizBase_1.24.0 [4] htmlTable_1.9 GenomicRanges_1.28.6 base64enc_0.1-3 [7] dichromat_2.0-0 affyio_1.46.0 bit64_0.9-7 [10] interactiveDisplayBase_1.14.0 codetools_0.2-15 splines_3.4.3 [13] R.methodsS3_1.7.1 ggbio_1.24.1 geneplotter_1.54.0 [16] knitr_1.17 Formula_1.2-2 Rsamtools_1.28.0 [19] annotate_1.54.0 cluster_2.0.6 GO.db_3.4.1 [22] R.oo_1.21.0 graph_1.54.0 shiny_1.0.5 [25] compiler_3.4.3 httr_1.3.1 GOstats_2.42.0 [28] backports_1.1.1 Matrix_1.2-12 lazyeval_0.2.1 [31] limma_3.32.10 acepack_1.4.1 htmltools_0.3.6 [34] tools_3.4.3 gtable_0.2.0 GenomeInfoDbData_0.99.0 [37] affy_1.54.0 Category_2.42.1 affxparser_1.48.0 [40] reshape2_1.4.2 Rcpp_0.12.13 gdata_2.18.0 [43] preprocessCore_1.38.1 rtracklayer_1.36.6 iterators_1.0.8 [46] stringr_1.2.0 mime_0.5 ensembldb_2.0.4 [49] gtools_3.5.0 XML_3.98-1.9 AnnotationHub_2.8.3 [52] edgeR_3.18.1 zlibbioc_1.22.0 scales_0.5.0 [55] BSgenome_1.44.2 VariantAnnotation_1.22.3 BiocInstaller_1.26.1 [58] ProtGenerics_1.8.0 SummarizedExperiment_1.6.5 RBGL_1.52.0 [61] AnnotationFilter_1.0.0 RColorBrewer_1.1-2 yaml_2.1.14 [64] curl_3.0 memoise_1.1.0 gridExtra_2.3 [67] ggplot2_2.2.1 biomaRt_2.32.1 rpart_4.1-11 [70] gcrma_2.48.0 reshape_0.8.7 latticeExtra_0.6-28 [73] stringi_1.1.5 RSQLite_2.0 genefilter_1.58.1 [76] foreach_1.4.3 checkmate_1.8.5 caTools_1.17.1 [79] GenomicFeatures_1.28.5 BiocParallel_1.10.1 GenomeInfoDb_1.12.3 [82] ReportingTools_2.16.0 rlang_0.1.2 pkgconfig_2.0.1 [85] matrixStats_0.52.2 bitops_1.0-6 lattice_0.20-35 [88] GenomicAlignments_1.12.2 htmlwidgets_0.9 bit_1.1-12 [91] GSEABase_1.38.2 AnnotationForge_1.18.2 GGally_1.3.2 [94] plyr_1.8.4 magrittr_1.5 DESeq2_1.16.1 [97] R6_2.2.2 gplots_3.0.1 Hmisc_4.0-3 [100] DelayedArray_0.2.7 DBI_0.7 foreign_0.8-69 [103] survival_2.41-3 RCurl_1.95-4.8 nnet_7.3-12 [106] tibble_1.3.4 KernSmooth_2.23-15 OrganismDbi_1.18.1 [109] PFAM.db_3.4.1 locfit_1.5-9.1 grid_3.4.3 [112] data.table_1.10.4-3 blob_1.1.0 digest_0.6.12 [115] xtable_1.8-2 ff_2.2-13 httpuv_1.3.5 [118] R.utils_2.6.0 munsell_0.4.3
Thanks a lot James for your very clear answer.
I understand that the current way to annotate microarrays is to return all one-to-many mappings (return just the first mapped value). However, I dont understand very well why it is recommended to return all one-to-many mappings comparing to what was done by toTable? Do you think these probesets with multiple Entrez Gene IDs are reliable ? Are you confident in the annotation given by annotateEset for these probesets since it is not clear which gene was being measured ?
Thanks for the update of the bioconductor packages,
You misunderstand. There is no recommendation here, nor am I making any statements about reliability of any mappings.
We are simply providing the data that we get from Affymetrix in a form that is easier for people to deal with. We make no claims as to the reliability of their data, nor what someone should do with a one-to-many mapping.
The issue at hand is really what we should use as the default for one-to-many mappings. At one extreme is to either return NA or exclude altogether (which depending on how you extracted the data is what you got in the past). At the other extreme is to simply return everything including the one-to-many mappings and expect that our end users will figure things out for themselves.
In my opinion (and others have disagreed with me on this) we should take some middle ground that works OK for most people, without being overly paternalistic, which is what the defaults (IMO, again) currently do. Please note that you can suppress the one-to-many mappings by using multivals = "asNA".