I have been trying all day to get the gene symbols for a large set (~7000) of hg19 coordinates. Some coordinates will not overlap a gene and some coordinates can overlap with several genes.

I've tried using biomaRt but seems I have to query each coordinate one at a time which is going to take many hours to complete (and I worry about spamming biomaRt; is that possible?). I also tried to use TxDbUCSCKnownGene but it has outdated gene_ids. Finally, is based on hg38 and not hg19 (I think).

I'm probably just overlooking something. How do I solve this? Is there a table of gene annotations with coordinates for hg19 that I can download? I'll just write my own script to look for overlaps if I have to.

Thank you

Simply download a GTF file (for example from GENCODE) matching the current genome and/or annotation version you are using in your project and get it from there.

Load the GTF into R using rtracklayer::import and then use the GenomicRanges intersection functions to intersect your ranges with the GTF (which is a GRanges object after loading). From there you can filter as needed. Yes, some genomic sites have overlapping genes (one on the + and one on the - strand). No general answer on how you want to deal with this. For code suggestions please add meaningful example data, via dput().

Now why didn't I think of that? I had tunnel vision thinking I needed to use an annotation package. Thank you ATpoint!


