Search
Question: Biomart error on association gene symbols name
0
gravatar for jarod_v6@libero.it
5 months ago by
Italy
jarod_v6@libero.it20 wrote:

Dear all.

I have perform differential expression analysis using  Deseq2 and tximport from rsem data.

When I try to map gene symbol I have some symbol are not annotated. This is some  informtion on my code and my version of tools.

 

 

library( "biomaRt" )
#ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
ensembl = useMart( host="dec2013.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" )

genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene", "hgnc_symbol","band","chromosome_name"),
                  filters = "ensembl_gene_id",
                  values = list(res$ensembl,"protein_coding"),
                  mart = ensembl )
idx <- match( res$ensembl, genemap$ensembl_gene_id )
res$entrez <- genemap$entrezgene[ idx ]
res$hgnc_symbol <- genemap$hgnc_symbol[ idx ]


table(res$hgnc_symbol=="HDGFRP2")

FALSE
57905
> table(res$hgnc_symbol=="FUS")

FALSE  TRUE
57904     1
> table(res$hgnc_symbol=="DDIT3")

FALSE  TRUE
57904     1
> table(res$ensembl=="ENSG00000167674")

FALSE  TRUE
57904     1
> table(res$ensembl=="ENSG00000167674")

FALSE  TRUE
57904     1








> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8      
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8  
[7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                  LC_ADDRESS=C             
[10] LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.28.0             ggplot2_2.2.1              genefilter_1.54.2        
[4] pheatmap_1.0.8             tximportData_1.0.2         tximport_1.0.3           
[7] DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0           
[10] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1            
[13] S4Vectors_0.10.3           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
[1] locfit_1.5-9.1       splines_3.3.2        lattice_0.20-35      colorspace_1.3-2   
[5] htmltools_0.3.5      base64enc_0.1-3      survival_2.40-1      XML_3.98-1.5       
[9] foreign_0.8-67       DBI_0.6-1            BiocParallel_1.6.6   RColorBrewer_1.1-2 
[13] plyr_1.8.4           stringr_1.2.0        zlibbioc_1.18.0      munsell_0.4.3      
[17] gtable_0.2.0         htmlwidgets_0.8      memoise_1.0.0        labeling_0.3       
[21] latticeExtra_0.6-28  knitr_1.15.1         geneplotter_1.50.0   AnnotationDbi_1.34.4
[25] htmlTable_1.9        Rcpp_0.12.9          acepack_1.4.1        xtable_1.8-2       
[29] backports_1.0.5      scales_0.4.1         checkmate_1.8.2      Hmisc_4.0-2        
[33] annotate_1.50.1      XVector_0.12.1       gridExtra_2.2.1      digest_0.6.12      
[37] stringi_1.1.2        grid_3.3.2           tools_3.3.2          bitops_1.0-6       
[41] magrittr_1.5         lazyeval_0.2.0       RCurl_1.95-4.8       tibble_1.2         
[45] RSQLite_1.1-2        Formula_1.2-1        cluster_2.0.6        Matrix_1.2-8       
[49] data.table_1.10.0    assertthat_0.1       rpart_4.1-10         nnet_7.3-12

 

ADD COMMENTlink modified 5 months ago by James W. MacDonald45k • written 5 months ago by jarod_v6@libero.it20
0
gravatar for James W. MacDonald
5 months ago by
United States
James W. MacDonald45k wrote:

This isn't an error on the part of biomaRt, nor Ensembl! It's just how things are. If you search for HDGFRP2 on the HUGO website, THEY say it's ENSG00000167674, but if you go to Ensembl, they say that Ensembl ID's gene is called AC011498.1, but Ensembl also says there is an overlapping gene from NCBI with Gene ID 84717, which NCBI says is HDGFRP2.

So Ensembl says there is a gene in that region, that they call AC011498.1, and they recognize that NCBI claims an overlapping gene 84717 that is different in some sense from Ensembl's gene. I'm sure there is some nitty gritty details that explain exactly why there isn't agreement, but the end result is that NCBI and Ensembl and HUGO are different groups that may have different ideas about what is and isn't a gene, and where they might be, etc, etc.

It's usually better to try to stick with one group (NCBI or Ensembl) when doing annotation stuff, because you always run into these disagreements.

 

ADD COMMENTlink written 5 months ago by James W. MacDonald45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 214 users visited in the last hour