Search
Question: BiomaRT inconsistency for PDGFRA, returns FIP1L1 instead
0
gravatar for Stephane Plaisance | VIB |
13 months ago by

I try to fetch gene symbols from coordinate ranges (in good old hg19 space). It works perfectly except for one region where I get FIP1L1 in first position before the expected PDGFRA while the two genes are not even overlapping. Actually, FIP1L1 is located quite far downstream from PDGFRA.

Could someone explain this behavior and how to avoid it?

Thanks

You can reproduce my result with:

library("biomaRt")

ensembl=useMart(host='feb2014.archive.ensembl.org', 
                biomart='ENSEMBL_MART_ENSEMBL', 
                dataset="hsapiens_gene_ensembl")
getBM(attributes = "hgnc_symbol", 
      filters = c('chromosome_name','start','end'), 
      values = list("4","55127193","55127663"), 
      mart=ensembl)

#  hgnc_symbol
# 1      FIP1L1
# 2      PDGFRA

This is true for several intervals as seen in IGV

UCSC confirms the 'fraud' (disclaimer, this was a joke and I apology if I did hurt people's feeling)

http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr4%3A55127193-55127663&hgsid=218264807_Z9OOPdRJRvGjAxC6MfObhoYdLIeU

 

ADD COMMENTlink modified 13 months ago • written 13 months ago by Stephane Plaisance | VIB |50
1
gravatar for James W. MacDonald
13 months ago by
United States
James W. MacDonald45k wrote:

It's not a fraud, but a misunderstanding on your part. And fraud is a pretty strong term to be using in this situation. Ensembl and UCSC are two different groups, with different methods of saying what genes are where. You are simply pointing out a difference in opinion between the two groups.

>library(EnsDb.Hsapiens.v75)
> z <- genes(EnsDb.Hsapiens.v75)
> subsetByOverlaps(z,  GRanges("4:55127193-55127663"))
GRanges object with 2 ranges and 5 metadata columns:
                  seqnames               ranges strand |         gene_id
                     <Rle>            <IRanges>  <Rle> |     <character>
  ENSG00000145216        4 [54243810, 55161439]      + | ENSG00000145216
  ENSG00000134853        4 [55095264, 55164414]      + | ENSG00000134853
                    gene_name    entrezid   gene_biotype seq_coord_system
                  <character> <character>    <character>      <character>
  ENSG00000145216      FIP1L1       81608 protein_coding       chromosome
  ENSG00000134853      PDGFRA        5156 protein_coding       chromosome
  -------
  seqinfo: 273 sequences from GRCh37 genome

> library(Homo.sapiens)
> zz <- genes(Homo.sapiens, columns = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> subsetByOverlaps(zz,  GRanges("chr4:55127193-55127663"))
GRanges object with 1 range and 1 metadata column:
       seqnames               ranges strand |          SYMBOL
          <Rle>            <IRanges>  <Rle> | <CharacterList>
  5156     chr4 [54243820, 55164412]      + |          PDGFRA
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
ADD COMMENTlink written 13 months ago by James W. MacDonald45k
1

Just to echo this, you can clearly see in the Ensembl browser that at least one isoform of FIP1L1 spans PDGFRA.

Ensembl browser

ADD REPLYlink written 13 months ago by Mike Smith2.1k

Dear James,

Thanks very much for answering my post and very sorry that you did not get my joke about 'fraude' (of course I did not mean to be rude, just trying to put it in a funny way).

Concerning your evidence that Ensembl reports a transcript spanning PDGFRA, it is nice but I kind of doubt this one is true (putting PDGFRA out of the picture at once is a bit hard for the Biologist I am - again joking, for the record :-)). I read gene fusion events exist between these two genes that might explain its presence (hypereosinophilic syndrome in http://www.uniprot.org/uniprot/Q6UN15).

Thanks you very much for the code to query UCSC, I am replacing my biomaRt code by this one and it should answer my needs when I succeed in muting the many <'select()' returned 1:1 mapping between keys and columns> messages.

I found this potential exception by looking at 300 locations in IGV and wonder how many other are not in agreement with UCSC/NCBI.

Best regards,

Stephane

ADD REPLYlink written 13 months ago by Stephane Plaisance | VIB |50

But it's not so simple as all that. If you choose to use the UCSC mappings, they perpetrate the exact opposite 'fraude' on you, saying that PDGFRA overlaps FIP1L1:

> fip1l1 <- GRanges("chr4:54186117-54341911")

> subsetByOverlaps(genes(Homo.sapiens, columns = "SYMBOL"), fip1l1)
'select()' returned 1:1 mapping between keys and columns
GRanges object with 4 ranges and 1 metadata column:
         seqnames               ranges strand |          SYMBOL
            <Rle>            <IRanges>  <Rle> | <CharacterList>
  152579     chr4 [53739151, 54232242]      - |           SCFD2
    5156     chr4 [54243820, 55164412]      + |          PDGFRA
   81608     chr4 [54243820, 54326103]      + |          FIP1L1
   84708     chr4 [54326437, 54518682]      - |            LNX1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

 

Whereas Ensembl doesn't.

> seqlevelsStyle(fip1l1) <- "Ensembl"
> subsetByOverlaps(genes(EnsDb.Hsapiens.v75), fip1l1)
GRanges object with 3 ranges and 5 metadata columns:
                  seqnames               ranges strand |         gene_id
                     <Rle>            <IRanges>  <Rle> |     <character>
  ENSG00000184178        4 [53739149, 54232242]      - | ENSG00000184178
  ENSG00000145216        4 [54243810, 55161439]      + | ENSG00000145216
  ENSG00000072201        4 [54325468, 54567572]      - | ENSG00000072201
                    gene_name    entrezid   gene_biotype seq_coord_system
                  <character> <character>    <character>      <character>
  ENSG00000184178       SCFD2      152579 protein_coding       chromosome
  ENSG00000145216      FIP1L1       81608 protein_coding       chromosome
  ENSG00000072201        LNX1       84708 protein_coding       chromosome
  -------
  seqinfo: 273 sequences from GRCh37 genome

So it comes down (in this instance) to what the two different groups want to call the gene fusion transcript. One says PDGFRA, and the other says FIP1L1.

ADD REPLYlink written 13 months ago by James W. MacDonald45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 136 users visited in the last hour