Biomart error on association gene symbols name
1
0
Entering edit mode
@jarod_v6liberoit-6654
Last seen 5.2 years ago
Italy

Dear all.

I have perform differential expression analysis using  Deseq2 and tximport from rsem data.

When I try to map gene symbol I have some symbol are not annotated. This is some  informtion on my code and my version of tools.

 

 

library( "biomaRt" )
#ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
ensembl = useMart( host="dec2013.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" )

genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol","band","chromosome_name"),
                  filters = "ensembl_gene_id",
                  values = list(res$ensembl,"protein_coding"),
                  mart = ensembl )
idx <- match( res$ensembl, genemap$ensembl_gene_id )
res$entrez <- genemap$entrezgene[ idx ]
res$hgnc_symbol <- genemap$hgnc_symbol[ idx ]


table(res$hgnc_symbol=="HDGFRP2")

FALSE
57905
> table(res$hgnc_symbol=="FUS")

FALSE  TRUE
57904     1
> table(res$hgnc_symbol=="DDIT3")

FALSE  TRUE
57904     1
> table(res$ensembl=="ENSG00000167674")

FALSE  TRUE
57904     1
> table(res$ensembl=="ENSG00000167674")

FALSE  TRUE
57904     1








> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8      
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8  
[7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                  LC_ADDRESS=C             
[10] LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.28.0             ggplot2_2.2.1              genefilter_1.54.2        
[4] pheatmap_1.0.8             tximportData_1.0.2         tximport_1.0.3           
[7] DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0           
[10] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1            
[13] S4Vectors_0.10.3           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
[1] locfit_1.5-9.1       splines_3.3.2        lattice_0.20-35      colorspace_1.3-2   
[5] htmltools_0.3.5      base64enc_0.1-3      survival_2.40-1      XML_3.98-1.5       
[9] foreign_0.8-67       DBI_0.6-1            BiocParallel_1.6.6   RColorBrewer_1.1-2 
[13] plyr_1.8.4           stringr_1.2.0        zlibbioc_1.18.0      munsell_0.4.3      
[17] gtable_0.2.0         htmlwidgets_0.8      memoise_1.0.0        labeling_0.3       
[21] latticeExtra_0.6-28  knitr_1.15.1         geneplotter_1.50.0   AnnotationDbi_1.34.4
[25] htmlTable_1.9        Rcpp_0.12.9          acepack_1.4.1        xtable_1.8-2       
[29] backports_1.0.5      scales_0.4.1         checkmate_1.8.2      Hmisc_4.0-2        
[33] annotate_1.50.1      XVector_0.12.1       gridExtra_2.2.1      digest_0.6.12      
[37] stringi_1.1.2        grid_3.3.2           tools_3.3.2          bitops_1.0-6       
[41] magrittr_1.5         lazyeval_0.2.0       RCurl_1.95-4.8       tibble_1.2         
[45] RSQLite_1.1-2        Formula_1.2-1        cluster_2.0.6        Matrix_1.2-8       
[49] data.table_1.10.0    assertthat_0.1       rpart_4.1-10         nnet_7.3-12

 

biomart deseq2 • 804 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 5 hours ago
United States

This isn't an error on the part of biomaRt, nor Ensembl! It's just how things are. If you search for HDGFRP2 on the HUGO website, THEY say it's ENSG00000167674, but if you go to Ensembl, they say that Ensembl ID's gene is called AC011498.1, but Ensembl also says there is an overlapping gene from NCBI with Gene ID 84717, which NCBI says is HDGFRP2.

So Ensembl says there is a gene in that region, that they call AC011498.1, and they recognize that NCBI claims an overlapping gene 84717 that is different in some sense from Ensembl's gene. I'm sure there is some nitty gritty details that explain exactly why there isn't agreement, but the end result is that NCBI and Ensembl and HUGO are different groups that may have different ideas about what is and isn't a gene, and where they might be, etc, etc.

It's usually better to try to stick with one group (NCBI or Ensembl) when doing annotation stuff, because you always run into these disagreements.

 

ADD COMMENT

Login before adding your answer.

Traffic: 842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6