Question

biomart - Human Transcriptome Array 2.0 - Affymetrix - ID to Symbol Problem

0

Entering edit mode

Sergio • 0

@sergio-24055

Last seen 3.4 years ago

Hi everyone!

So, I have experienced a problem retrieving annotation for prob IDs to obtain the Symbol names from Affymetrix HTA2.0 using biomaRt.

The ID's from my data appears in this format: "TC05002592.hg.1", but in biomart, the ID's do not contain the ".1": "TC05002592.hg.1"

The data I am interested in has the GEO identifier: "GSE137574". [Platform: Platforms (1) GPL17586 [HTA-2_0] Affymetrix Human Transcriptome Array 2.0 [transcript (gene) version]]

I did a work around, but I am not sure if this is correct or if I missing something.

Do you have any suggestions?

Thank you very much in advance!

Here it is my code:


> ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> listAttributes(mart)[grep("affy", listAttributes(mart)[,1]),]
> G_list <- getBM(filters= "with_affy_hta_2_0", attributes= c("affy_hta_2_0", "ensembl_gene_id",
                                                          "hgnc_symbol"),values=TRUE,mart= mart)

> head(G_list)
   affy_hta_2_0 ensembl_gene_id hgnc_symbol
1 TC0M000002.hg ENSG00000210049       MT-TF
2 TC07000959.hg ENSG00000210049       MT-TF
3 TC11001412.hg ENSG00000210049       MT-TF
4 TC0M000002.hg ENSG00000211459     MT-RNR1
5 TC07000959.hg ENSG00000211459     MT-RNR1
6 TC11001412.hg ENSG00000211459     MT-RNR1

> head(limma_res_table) # my data
               ID     logFC   AveExpr         t      P.Value    adj.P.Val        B
1 TC01006074.hg.1 10.387105 10.598539  64.98987 1.087062e-26 7.666286e-22 49.02568
2 TC09000652.hg.1 -6.714061  7.605793 -58.89460 9.425114e-26 3.245665e-21 47.38025
3 TC01003777.hg.1  8.687729  8.227119  57.87758 1.380684e-25 3.245665e-21 47.08051
4 TC05003313.hg.1  8.806027  8.610895  56.62782 2.227560e-25 3.927355e-21 46.70139
5 TC0Y000053.hg.1 -6.634259  8.264292 -54.74166 4.678188e-25 6.598397e-21 46.10554
6 TC09002259.hg.1 -7.784896  7.888734 -49.51870 4.197832e-24 4.934062e-20 44.29154

> limma_res_table <- rename(limma_res_table, c("affy_hta_2_0" = "ID"))
> limma_res_table <- merge(limma_res_table,G_list,by="affy_hta_2_0")
[1] affy_hta_2_0      logFC             AveExpr           t                 P.Value           adj.P.Val         B                 ensembl_gene_id.x
 [9] GeneSymbol        ensembl_gene_id.y hgnc_symbol      
<0 rows> (or 0-length row.names)

# So... this is my work around:

> an <- c(G_list$affy_hta_2_0)
> G_list$anno = paste(an,".1", sep = "")
> G_list <- rename(G_list, c("old" = "affy_hta_2_0"))
> G_list <- rename(G_list, c("affy_hta_2_0" = "anno"))
> limma_res_table <- rename(limma_res_table, c("affy_hta_2_0" = "ID"))
> limma_res_table <- merge(limma_res_table,G_list,by="affy_hta_2_0")

> head(limma_res_table)
     affy_hta_2_0       logFC  AveExpr          t   P.Value adj.P.Val         B           old ensembl_gene_id hgnc_symbol
1 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000223972    DDX11L1
2 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000248472    DDX11L9
3 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000288170    DDX11L9
4 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000233614   DDX11L10
5 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000279928           
6 TC01000002.hg.1  0.02323564 3.956609  0.2290388 0.8209495 0.9363863 -7.644989 TC01000002.hg ENSG00000283921  MIR1302-9

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fgsea_1.14.0        Rsubread_2.2.6      limma_3.44.3        ggplot2_3.3.2       pheatmap_1.0.12     dplyr_1.0.2         GEOquery_2.56.0    
 [8] Biobase_2.48.0      BiocGenerics_0.34.0 biomaRt_2.44.4     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           lattice_0.20-41      tidyr_1.1.2          prettyunits_1.1.1    assertthat_0.2.1     digest_0.6.27        BiocFileCache_1.12.1
 [8] R6_2.5.0             stats4_4.0.3         RSQLite_2.2.1        httr_1.4.2           pillar_1.4.6         rlang_0.4.8          progress_1.2.2      
[15] curl_4.3             rstudioapi_0.13      data.table_1.13.2    blob_1.2.1           S4Vectors_0.26.1     Matrix_1.2-18        labeling_0.4.2      
[22] BiocParallel_1.22.0  statmod_1.4.35       readr_1.4.0          stringr_1.4.0        bit_4.0.4            munsell_0.5.0        compiler_4.0.3      
[29] pkgconfig_2.0.3      askpass_1.1          openssl_1.4.3        tidyselect_1.1.0     gridExtra_2.3        tibble_3.0.4         IRanges_2.22.2      
[36] XML_3.99-0.5         fansi_0.4.1          crayon_1.3.4         dbplyr_2.0.0         withr_2.3.0          rappdirs_0.3.1       grid_4.0.3          
[43] gtable_0.3.0         lifecycle_0.2.0      DBI_1.1.0            magrittr_2.0.1       scales_1.1.1         cli_2.1.0            stringi_1.5.3       
[50] farver_2.0.3         xml2_1.3.2           ellipsis_0.3.1       generics_0.1.0       vctrs_0.3.5          fastmatch_1.1-0      RColorBrewer_1.1-2  
[57] tools_4.0.3          bit64_4.0.5          glue_1.4.2           purrr_0.3.4          hms_0.5.3            AnnotationDbi_1.50.3 colorspace_2.0-0    
[64] memoise_1.1.0

biomaRt • 2.1k views

ADD COMMENT • link updated 3.4 years ago by Mike Smith ★ 6.5k • written 3.4 years ago by Sergio • 0

score 0 · Answer 1 · 2020-12-03

I think removing the .1 from the Affy IDs looks fine. I don't know the source of the discrepancy between the Affymetrix notation and Ensembl, but checking a subset from the Affymetrix annoation files manually looks like they return the same Ensembl gene IDs.

A couple of comments on the biomaRt code:

You don't need both the useEnsembl() and useDatasets() lines. They actually doing the same thing. You could just go with mart <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl") and then use the mart object in your calls to getBM().
Rather than using biomaRt to download all genes targetted by Affy HTA 2.0, you could query with the ID column from your limma table. That might be something like:

## remove ".1" from IDs
affy_id <- gsub(x = limma_res_table$ID, pattern = "\\..*$", "")
## get matching Ensembl IDs
getBM(attributes= c("affy_hta_2_0", "ensembl_gene_id", hgnc_symbol"),
      filters = "affy_hta_2_0",
      values  = affy_id,
      mart = mart

If you really want the annotation for all probes, it might be quicker in the long run to install the pd.hta.2.0 package and use that. Then you don't have to query biomaRt lots of times.

A really minor point (I just want to advertise the functionality), to find the names of the affy attributes you could swap listAttributes(mart)[grep("affy", listAttributes(mart)[,1]),] with searchAttributes(mart, "affy")