Question

Why does biomart have multiple positions for the same gene

0

Entering edit mode

adeler001 • 0

@adeler001-21743

Last seen 2.2 years ago

Canada

I download annotations of the chromosome start and end position from biomart , I am wondering why some of the genes appear multiple times (see picture) with different start and end positions? Also is there a way to automate deleting these gene duplicates? enter image description here . \

biomaRt • 739 views

ADD COMMENT • link 2.4 years ago adeler001 • 0

score 0 · Answer 1 · 2021-12-02

Some genes appear multiple times because there is evidence that they arise from multiple positions in the genome. If you are asking 'where are the genes located', and we think there are multiple locations for a given gene, you will get more than one position returned. For example, there are several genes that are found on both the X and Y chromosomes (true fact!). Plus there are any number of genes that are thought to be located on various haplotypes and unplaced scaffolds.

There are functions (like duplicated or unique) that can be used to subset that list to single positions. I would also imagine there are ways to do so in Excel, since it appears that's where you have the position data currently.

If you aren't wedded to Ensembl mappings, you could always use the Homo.sapiens package, which will by default exclude any gene(s) that are found in multiple places

> library(Homo.sapiens)
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
> gns <- genes(Homo.sapiens, columns = "SYMBOL")
1655 genes were dropped because they have exons located on both strands
  of the same reference sequence or on more than one reference sequence,
  so cannot be represented by a single genomic range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a
  GRangesList object, or use suppressMessages() to suppress this message.
'select()' returned 1:1 mapping between keys and columns
> gns
GRanges object with 29474 ranges and 1 metadata column:
            seqnames              ranges strand |          SYMBOL
               <Rle>           <IRanges>  <Rle> | <CharacterList>
          1    chr19   58345178-58362751      - |            A1BG
         10     chr8   18386311-18401218      + |            NAT2
        100    chr20   44619522-44652233      - |             ADA
       1000    chr18   27932879-28177946      - |            CDH2
  100008586     chrX   49551278-49568218      + |         GAGE12F
        ...      ...                 ...    ... .             ...
       9990    chr15   34229784-34338060      - |         SLC12A6
       9991     chr9 112217716-112333664      - |           PTBP3
       9992    chr21   34364006-34371381      + |           KCNE2
       9993    chr22   19036282-19122454      - |           DGCR2
       9997    chr22   50523568-50526461      - |            SCO2
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome