an error in BioMart regarding gene annotations
1
2
Entering edit mode
Bogdan ▴ 620
@bogdan-2367
Last seen 24 days ago
Palo Alto, CA, USA

Dear all,

I am writing to ask please for a piece of help regarding Biomart . I am using the piece of R code below, that was inspired by the package StructuralVariantAnnotation to annotate the Structural Variants from DELLY.

https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R

However, the coordinates on chr21 are not annotated properly, in the sense that : shall I input the following coordinates for a breakpoint "chr21:10813930-10813931", it gives me the gene annotations such as "SMIM11B,U2AF1L5,LOC102724652,CRYAA,U2AF1,CBS"

I can post the full code in R, if you wish. Would you please let me know --- has this been noted before ? how shall I fix it ?

thank you very much !

-- bogdan

##################### the piece of R code that I am using is the following : ##############################################################

gns <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)

hits <- as.data.frame(findOverlaps(gr, gns, ignore.strand=TRUE))

hits$SYMBOL <- biomaRt::select(org.Hs.eg.db, gns[hits$subjectHits]$gene_id, "SYMBOL")$SYMBOL

hits$gene_strand <- as.character(strand(gns[hits$subjectHits]))

hits <- hits %>%
group_by(queryHits) %>%
summarise(SYMBOL=paste(SYMBOL, collapse=","), gene_strand=paste0(gene_strand, collapse=""))
biomart AnnotationDbi • 754 views
1
Entering edit mode

Can you explain why this is surprising? It seems like 6 (Entrez) genes overlap the break point

> olaps = findOverlaps(GRanges("chr21:10813930-10813931"), gns)
> olaps
Hits object with 6 hits and 0 metadata columns:
queryHits subjectHits
<integer>   <integer>
[1]         1        3964
[2]         1        4053
[3]         1        4060
[4]         1        6588
[5]         1       20165
[6]         1       22706
-------
queryLength: 1 / subjectLength: 24183
> gns[subjectHits(olaps)]
GRanges object with 6 ranges and 1 metadata column:
seqnames           ranges strand |     gene_id
<Rle>        <IRanges>  <Rle> | <character>
102723553    chr21 7744962-34407866      + |   102723553
102724594    chr21 6484661-43107587      - |   102724594
102724652    chr21 6561284-43172805      + |   102724652
1409    chr21 6560714-43172805      + |        1409
7307    chr21 6484623-43107578      - |        7307
875    chr21 6444869-43076943      - |         875
-------
seqinfo: 455 sequences (1 circular) from hg38 genome



and that these map to 6 gene SYMBOLs ?

> egid = gns[subjectHits(olaps)]$gene_id > select(org.Hs.eg.db, egid, "SYMBOL") 'select()' returned 1:1 mapping between keys and columns ENTREZID SYMBOL 1 102723553 SMIM11B 2 102724594 U2AF1L5 3 102724652 CRYAA2 4 1409 CRYAA 5 7307 U2AF1 6 875 CBS ADD REPLY 0 Entering edit mode Hi Martin, thank you for your comments and precious help. It has been surprising for me because the coordinates of the BREAKPOINT (i.e. chr21:10,813,930-10,813,931) are 3 MB away from SMIM11B (chr21:7,748,899-7,777,853), at least according to UCSC genome browser (the link to the gene area is below). I would aim to know only the genes that are intersecting the breakpoint, and not necessarily the genes that are 3Mb away. Any suggestions are welcome. Thanks you., https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr21%3A7748899-7777853 ADD REPLY 0 Entering edit mode Likely the explanation is that the org.* and TxDb.* packages are static and report the information at time of publication from particular sources, whereas UCSC is dynamic and presents the most recent updates of possibly different sources. The information in the annotation objects are > TxDb.Hsapiens.UCSC.hg38.knownGene TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: UCSC # Genome: hg38 # Organism: Homo sapiens # Taxonomy ID: 9606 # UCSC Table: knownGene # UCSC Track: GENCODE v24 # Resource URL: http://genome.ucsc.edu/ # Type of Gene ID: Entrez Gene ID # Full dataset: yes # miRBase build ID: NA # transcript_nrow: 197782 # exon_nrow: 581036 # cds_nrow: 293052 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2016-09-29 13:02:09 +0000 (Thu, 29 Sep 2016) # GenomicFeatures version at creation time: 1.25.18 # RSQLite version at creation time: 1.0.0 # DBSCHEMAVERSION: 1.1 > org.Hs.eg.db OrgDb object: | DBSCHEMAVERSION: 2.1 | Db type: OrgDb | Supporting package: AnnotationDbi | DBSCHEMA: HUMAN_DB | ORGANISM: Homo sapiens | SPECIES: Human | EGSOURCEDATE: 2018-Apr4 | EGSOURCENAME: Entrez Gene | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | CENTRALID: EG | TAXID: 9606 | GOSOURCENAME: Gene Ontology | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/ | GOSOURCEDATE: 2018-Mar28 | GOEGSOURCEDATE: 2018-Apr4 | GOEGSOURCENAME: Entrez Gene | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | KEGGSOURCENAME: KEGG GENOME | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes | KEGGSOURCEDATE: 2011-Mar15 | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens) | GPSOURCEURL: | GPSOURCEDATE: 2018-Mar26 | ENSOURCEDATE: 2017-Dec04 | ENSOURCENAME: Ensembl | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta | UPSOURCENAME: Uniprot | UPSOURCEURL: http://www.UniProt.org/ | UPSOURCEDATE: Mon Apr 9 20:58:54 2018 Please see: help('select') for usage information ADD REPLY 0 Entering edit mode There was presumably some something wrong in the source file used at the time of generation, as those 6 genes each span approximately 30Mb, which seems unlikely. width(gns[hits$subjectHits]) / 1e6
[1] 26.66290 36.62293 36.61152 36.61209 36.62296 36.63208

0
Entering edit mode

Dear Martin, thank you for your suggestion. To me ... hmmm ... it is still a bit surprising: for example, shall I input the coordinate ("chr21:21004327-21004328") that overlap NCAM2 gene, the result (after running your code) is the following (below). It contains NCAM2 gene, but also those additional genes that are 2-3 MB away.

> select(org.Hs.eg.db, egid, "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
ENTREZID       SYMBOL
1 102723553      SMIM11B
2 102724594      U2AF1L5
3 102724652 LOC102724652
4      1409        CRYAA
5      4685        NCAM2
6      7307        U2AF1
7       875          CBS

0
Entering edit mode

The code is doing the correct thing, it's finding entries in TxDb.Hsapiens.UCSC.hg38.knownGene that overlap with your query. However, the 6 genes that keep getting returned have one coordinate incorrect in their annotation and so appear to span a huge stretch of the chromosome and overlap a lot of things.

0
Entering edit mode

Thank you Mike. Yes, I share the same opinions. For now, I will just ignore those genes, until a new release of  "org.Hs.eg.db" becomes available. Thank you.

0
Entering edit mode

I don't think this is really a biomaRt question.  You're not actually querying a BioMart instance here, the annotation is found in TxDb.Hsapiens.UCSC.hg38.knownGene and the map between symbols is in org.Hs.eg.db which are both an AnnotationDb objects.  You're using biomaRt::select() but in the background this is really dispatching the version in AnnotationDbi, so you might get a better response if you add that tag to the question.

2
Entering edit mode
@danielvantwisk-13028
Last seen 15 months ago

Martin Morgan and I looked into the issue today and it seems to be a problem on UCSC's end.  There seems to be an issue with mapping between UCSC's knownGene and LocusLink (i.e. the knownToLocusLink table) that can map two separate genes to the same Entrez gene IDs.  For example, SMIM11A and SMIM11B seem to be equated with each other as the Entrez Gene ID '102723553', as shown by this table gleaned from the process of creating the TxDb.Hsapiens.UCSC.hg38.knownGene package using GenomicFeatures makeTxDbPackageFromUCSC function:

tx_name   gene_id

11929 uc061yyb.1 102723553
11930 uc061yyc.1 102723553
11931 uc032pse.2 102723553
11932 uc061yyd.1 102723553
11933 uc061yye.1 102723553
11934 uc061yyf.1 102723553
11935 uc061yyg.1 102723553
12263 uc002ytw.3 102723553
12264 uc002ytu.5 102723553
12265 uc061zvo.1 102723553
12266 uc061zvp.1 102723553
12267 uc061zvq.1 102723553
12268 uc061zvr.1 102723553
12269 uc061zvs.1 102723553

Transcripts that should be associated with only SMIM11A or SMIM11B are being equated with each other.  The following link shows which transcripts are actually associated with which genes: https://genome.ucsc.edu/cgi-bin/hgTracks?hgtgroup_map_close=0&hgtgroup_genes_close=0&hgtgroup_phenDis_close=0&hgtgroup_rna_close=0&hgtgroup_expression_close=0&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_varRep_close=0&hgtgroup_rep_close=0&hgsid=680056393_bStG3WiGNgm9raLjtcAGj1CRRv1O&position=SMIM11A&hgt.positionInput=SMIM11A&hgt.jump=go&hgt.suggestTrack=knownGene&db=hg38&c=chr1&l=11102836&r=11267747&pix=1245&dinkL=2.0&dinkR=2.0

This explains some of the irregularities such as the 30Mb spans that Mike Smith pointed out.

We're sending UCSC a report regarding this issue.  Please respond back if this explanation wasn't clear enough.

0
Entering edit mode

Thank you Daniel, and Martin, for your help, time, suggestions and explanations. I may work then with :

0
Entering edit mode

Any news about package update? Thanks.