Question

Biomart error on association gene symbols name

0

Entering edit mode

jarod_v6@libero.it ▴ 40

@jarod_v6liberoit-6654

Last seen 5.2 years ago

Italy

Dear all.

I have perform differential expression analysis using Deseq2 and tximport from rsem data.

When I try to map gene symbol I have some symbol are not annotated. This is some informtion on my code and my version of tools.

library( "biomaRt" )
#ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
ensembl = useMart( host="dec2013.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" )

genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol","band","chromosome_name"),
                  filters = "ensembl_gene_id",
                  values = list(res$ensembl,"protein_coding"),
                  mart = ensembl )
idx <- match( res$ensembl, genemap$ensembl_gene_id )
res$entrez <- genemap$entrezgene[ idx ]
res$hgnc_symbol <- genemap$hgnc_symbol[ idx ]


table(res$hgnc_symbol=="HDGFRP2")

FALSE
57905
> table(res$hgnc_symbol=="FUS")

FALSE  TRUE
57904     1
> table(res$hgnc_symbol=="DDIT3")

FALSE  TRUE
57904     1
> table(res$ensembl=="ENSG00000167674")

FALSE  TRUE
57904     1
> table(res$ensembl=="ENSG00000167674")

FALSE  TRUE
57904     1








> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8      
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8  
[7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                  LC_ADDRESS=C             
[10] LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.28.0             ggplot2_2.2.1              genefilter_1.54.2        
[4] pheatmap_1.0.8             tximportData_1.0.2         tximport_1.0.3           
[7] DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0           
[10] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1            
[13] S4Vectors_0.10.3           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
[1] locfit_1.5-9.1       splines_3.3.2        lattice_0.20-35      colorspace_1.3-2   
[5] htmltools_0.3.5      base64enc_0.1-3      survival_2.40-1      XML_3.98-1.5       
[9] foreign_0.8-67       DBI_0.6-1            BiocParallel_1.6.6   RColorBrewer_1.1-2 
[13] plyr_1.8.4           stringr_1.2.0        zlibbioc_1.18.0      munsell_0.4.3      
[17] gtable_0.2.0         htmlwidgets_0.8      memoise_1.0.0        labeling_0.3       
[21] latticeExtra_0.6-28  knitr_1.15.1         geneplotter_1.50.0   AnnotationDbi_1.34.4
[25] htmlTable_1.9        Rcpp_0.12.9          acepack_1.4.1        xtable_1.8-2       
[29] backports_1.0.5      scales_0.4.1         checkmate_1.8.2      Hmisc_4.0-2        
[33] annotate_1.50.1      XVector_0.12.1       gridExtra_2.2.1      digest_0.6.12      
[37] stringi_1.1.2        grid_3.3.2           tools_3.3.2          bitops_1.0-6       
[41] magrittr_1.5         lazyeval_0.2.0       RCurl_1.95-4.8       tibble_1.2         
[45] RSQLite_1.1-2        Formula_1.2-1        cluster_2.0.6        Matrix_1.2-8       
[49] data.table_1.10.0    assertthat_0.1       rpart_4.1-10         nnet_7.3-12

biomart deseq2 • 804 views

ADD COMMENT • link updated 6.9 years ago by James W. MacDonald 65k • written 6.9 years ago by jarod_v6@libero.it ▴ 40

score 0 · Answer 1 · 2017-06-09

This isn't an error on the part of biomaRt, nor Ensembl! It's just how things are. If you search for HDGFRP2 on the HUGO website, THEY say it's ENSG00000167674, but if you go to Ensembl, they say that Ensembl ID's gene is called AC011498.1, but Ensembl also says there is an overlapping gene from NCBI with Gene ID 84717, which NCBI says is HDGFRP2.

So Ensembl says there is a gene in that region, that they call AC011498.1, and they recognize that NCBI claims an overlapping gene 84717 that is different in some sense from Ensembl's gene. I'm sure there is some nitty gritty details that explain exactly why there isn't agreement, but the end result is that NCBI and Ensembl and HUGO are different groups that may have different ideas about what is and isn't a gene, and where they might be, etc, etc.

It's usually better to try to stick with one group (NCBI or Ensembl) when doing annotation stuff, because you always run into these disagreements.