Question: an error in BioMart regarding gene annotations
gravatar for Bogdan
16 months ago by
Palo Alto, CA, USA
Bogdan580 wrote:

Dear all, 

I am writing to ask please for a piece of help regarding Biomart . I am using the piece of R code below, that was inspired by the package StructuralVariantAnnotation to annotate the Structural Variants from DELLY.

However, the coordinates on chr21 are not annotated properly, in the sense that : shall I input the following coordinates for a breakpoint "chr21:10813930-10813931", it gives me the gene annotations such as "SMIM11B,U2AF1L5,LOC102724652,CRYAA,U2AF1,CBS"

I can post the full code in R, if you wish. Would you please let me know --- has this been noted before ? how shall I fix it ?

thank you very much !

-- bogdan

##################### the piece of R code that I am using is the following : ##############################################################

gns <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)

hits <-, gns, ignore.strand=TRUE))

hits$SYMBOL <- biomaRt::select(, gns[hits$subjectHits]$gene_id, "SYMBOL")$SYMBOL

hits$gene_strand <- as.character(strand(gns[hits$subjectHits]))

hits <- hits %>%
  group_by(queryHits) %>%
  summarise(SYMBOL=paste(SYMBOL, collapse=","), gene_strand=paste0(gene_strand, collapse=""))
biomart annotationdbi • 518 views
ADD COMMENTlink modified 16 months ago by daniel.vantwisk50 • written 16 months ago by Bogdan580

Can you explain why this is surprising? It seems like 6 (Entrez) genes overlap the break point

> olaps = findOverlaps(GRanges("chr21:10813930-10813931"), gns)
> olaps
Hits object with 6 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1        3964
  [2]         1        4053
  [3]         1        4060
  [4]         1        6588
  [5]         1       20165
  [6]         1       22706
  queryLength: 1 / subjectLength: 24183
> gns[subjectHits(olaps)]
GRanges object with 6 ranges and 1 metadata column:
            seqnames           ranges strand |     gene_id
               <Rle>        <IRanges>  <Rle> | <character>
  102723553    chr21 7744962-34407866      + |   102723553
  102724594    chr21 6484661-43107587      - |   102724594
  102724652    chr21 6561284-43172805      + |   102724652
       1409    chr21 6560714-43172805      + |        1409
       7307    chr21 6484623-43107578      - |        7307
        875    chr21 6444869-43076943      - |         875
  seqinfo: 455 sequences (1 circular) from hg38 genome

and that these map to 6 gene SYMBOLs ?

> egid = gns[subjectHits(olaps)]$gene_id
> select(, egid, "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
1 102723553 SMIM11B
2 102724594 U2AF1L5
3 102724652  CRYAA2
4      1409   CRYAA
5      7307   U2AF1
6       875     CBS


ADD REPLYlink written 16 months ago by Martin Morgan ♦♦ 24k

Hi Martin, thank you for your comments and precious help. It has been surprising for me because the coordinates of the BREAKPOINT (i.e.  chr21:10,813,930-10,813,931) are 3 MB away from SMIM11B (chr21:7,748,899-7,777,853), at least according to UCSC genome browser (the link to the gene area is below). I would aim to know only the genes that are intersecting the breakpoint, and not necessarily the genes that are 3Mb away. Any suggestions are welcome. Thanks you.,
ADD REPLYlink modified 16 months ago • written 16 months ago by Bogdan580

Likely the explanation is that the org.* and TxDb.* packages are static and report the information at time of publication from particular sources, whereas UCSC is dynamic and presents the most recent updates of possibly different sources. The information in the annotation objects are

> TxDb.Hsapiens.UCSC.hg38.knownGene
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# UCSC Track: GENCODE v24
# Resource URL:
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 197782
# exon_nrow: 581036
# cds_nrow: 293052
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-09-29 13:02:09 +0000 (Thu, 29 Sep 2016)
# GenomicFeatures version at creation time: 1.25.18
# RSQLite version at creation time: 1.0.0
OrgDb object:
| Db type: OrgDb
| Supporting package: AnnotationDbi
| ORGANISM: Homo sapiens
| SPECIES: Human
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEDATE: 2018-Mar28
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEDATE: 2018-Mar26
| ENSOURCEDATE: 2017-Dec04
| UPSOURCEDATE: Mon Apr  9 20:58:54 2018

Please see: help('select') for usage information
ADD REPLYlink modified 16 months ago • written 16 months ago by Martin Morgan ♦♦ 24k

There was presumably some something wrong in the source file used at the time of generation, as those 6 genes each span approximately 30Mb, which seems unlikely.

width(gns[hits$subjectHits]) / 1e6 
[1] 26.66290 36.62293 36.61152 36.61209 36.62296 36.63208
ADD REPLYlink modified 16 months ago • written 16 months ago by Mike Smith4.0k

Dear Martin, thank you for your suggestion. To me ... hmmm ... it is still a bit surprising: for example, shall I input the coordinate ("chr21:21004327-21004328") that overlap NCAM2 gene, the result (after running your code) is the following (below). It contains NCAM2 gene, but also those additional genes that are 2-3 MB away.


> select(, egid, "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
1 102723553      SMIM11B
2 102724594      U2AF1L5
3 102724652 LOC102724652
4      1409        CRYAA
5      4685        NCAM2
6      7307        U2AF1
7       875          CBS


ADD REPLYlink written 16 months ago by Bogdan580

The code is doing the correct thing, it's finding entries in TxDb.Hsapiens.UCSC.hg38.knownGene that overlap with your query. However, the 6 genes that keep getting returned have one coordinate incorrect in their annotation and so appear to span a huge stretch of the chromosome and overlap a lot of things.

ADD REPLYlink written 16 months ago by Mike Smith4.0k

Thank you Mike. Yes, I share the same opinions. For now, I will just ignore those genes, until a new release of  "" becomes available. Thank you.

ADD REPLYlink written 16 months ago by Bogdan580

I don't think this is really a biomaRt question.  You're not actually querying a BioMart instance here, the annotation is found in TxDb.Hsapiens.UCSC.hg38.knownGene and the map between symbols is in which are both an AnnotationDb objects.  You're using biomaRt::select() but in the background this is really dispatching the version in AnnotationDbi, so you might get a better response if you add that tag to the question.

ADD REPLYlink modified 16 months ago • written 16 months ago by Mike Smith4.0k
Answer: an error in BioMart regarding gene annotations
gravatar for daniel.vantwisk
16 months ago by
daniel.vantwisk50 wrote:

Martin Morgan and I looked into the issue today and it seems to be a problem on UCSC's end.  There seems to be an issue with mapping between UCSC's knownGene and LocusLink (i.e. the knownToLocusLink table) that can map two separate genes to the same Entrez gene IDs.  For example, SMIM11A and SMIM11B seem to be equated with each other as the Entrez Gene ID '102723553', as shown by this table gleaned from the process of creating the TxDb.Hsapiens.UCSC.hg38.knownGene package using GenomicFeatures makeTxDbPackageFromUCSC function:

         tx_name   gene_id

11929 uc061yyb.1 102723553
11930 uc061yyc.1 102723553
11931 uc032pse.2 102723553
11932 uc061yyd.1 102723553
11933 uc061yye.1 102723553
11934 uc061yyf.1 102723553
11935 uc061yyg.1 102723553
12263 uc002ytw.3 102723553
12264 uc002ytu.5 102723553
12265 uc061zvo.1 102723553
12266 uc061zvp.1 102723553
12267 uc061zvq.1 102723553
12268 uc061zvr.1 102723553
12269 uc061zvs.1 102723553

Transcripts that should be associated with only SMIM11A or SMIM11B are being equated with each other.  The following link shows which transcripts are actually associated with which genes:

This explains some of the irregularities such as the 30Mb spans that Mike Smith pointed out.

We're sending UCSC a report regarding this issue.  Please respond back if this explanation wasn't clear enough.

ADD COMMENTlink written 16 months ago by daniel.vantwisk50

Thank you Daniel, and Martin, for your help, time, suggestions and explanations. I may work then with :

transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene) instead of genes(TxDb.Hsapiens.UCSC.hg38.knownGene).


ADD REPLYlink modified 16 months ago • written 16 months ago by Bogdan580

Any news about package update? Thanks.

ADD REPLYlink written 15 months ago by foehn60
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 438 users visited in the last hour