Question

Different answer from Homo.sapiens and biomaRt packages!!

0

Entering edit mode

NS ▴ 60

@ns-7498

Last seen 6.9 years ago

United States

I have some genome regions which I am gonna map them to gene symbols.

To do that there are two methods which have different result. I have mentioned one example below:

1) Using 'Homo.sapeins' package

library(Homo.sapiens)
geneRanges <- function(db, column="ENTREZID")
{
  g <- genes(db, columns=column)
  col <- mcols(g)[[column]]
  genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
  mcols(genes)[[column]] <- as.character(unlist(col))
  genes
}

gns <- geneRanges(Homo.sapiens, column="SYMBOL")

splitColumnByOverlap <-  function(query, subject, column="ENTREZID", ...)
{
  olaps <- findOverlaps(query, subject, ...)
  
  f1 <- factor(subjectHits(olaps), levels=seq_len(subjectLength(olaps)))
  
  splitAsList(mcols(query)[[column]][queryHits(olaps)], f1)
}

library(GenomicRanges)

cnv <- makeGRangesFromDataFrame(data.frame(chr="chr2", start=109993208, end=109993264))

gene.Symbol <- splitColumnByOverlap(gns, cnv, "SYMBOL")

The obtained gene symbol is 'SH3RF3'.

2) using 'biomaRt' package

library(GenomicRanges)
library(biomaRt)

mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
region <- "2:109993208:109993264"
gene.Symbol <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position","end_position"),
            filters = c("chromosomal_region"), values=region, mart=mart)


The obtained result is:

hgnc_symbol chromosome_name start_position end_position
1   LINC01123               2      109987063    109996140

Anyone knows why the results are different from each other?

Thanks.

biomart homo.sapiens genome regions gene symbol • 3.7k views

ADD COMMENT • link updated 10.7 years ago by Hervé Pagès 16k • written 10.7 years ago by NS ▴ 60

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 11 hours ago

Seattle, WA, United States

Hi,

The usual suspects are:

1) Homo.sapiens and the hsapiens_gene_ensembl data set from the ensembl mart use different gene models: the knownGene track from UCSC for the former, Ensembl genes for the latter.

2) The Homo.sapiens object is a high-level object that bundles together the gene-centric annotations from org.Hs.eg.db and the transcript-centric annotations from TxDb.Hsapiens.UCSC.hg19.knownGene. The show method gives you that information (in a somewhat obfuscated way I must admit, but some work is currently in the process to address this):

> Homo.sapiens
class: OrganismDb 
Annotation resources:
[1] "GO.db"                             "org.Hs.eg.db"                     
[3] "TxDb.Hsapiens.UCSC.hg19.knownGene"
Annotation relationships:
     xDbs           yDbs                                xKeys      yKeys   
[1,] "GO.db"        "org.Hs.eg.db"                      "GOID"     "GO"    
[2,] "org.Hs.eg.db" "TxDb.Hsapiens.UCSC.hg19.knownGene" "ENTREZID" "GENEID"
For more details, please see the show methods for the component objects listed above.

However, the current Ensembl release is based on the GRCh38 assembly (a.k.a. hg38), not hg19.

Note that the Homo.sapiens object is meant to be modular and there is also some work in the process for making it easy for the user to replace the TxDb component with a custom one. Once this is available, you'll be able to do something like:

library(Homo.sapiens)
ens_txdb <- makeTxDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
txdb(Homo.sapiens) <- ens_txdb  # doesn't work yet!

and try again to compare the Homo.sapiens solution with the pure biomaRt solution for mapping your genome regions to gene symbols.

In the meantime, let's try to do the mapping by using ens_txdb directly:

tx_by_gene <- transcriptsBy(ens_txdb, by="gene")
hits <- findOverlaps(cnv, tx_by_gene)
gene.EnsId <- extractList(names(tx_by_gene), as(hits, "List"))

I can't run the above code at the moment (my call to makeTxDbFromBiomart() is still running after 20 min, the BioMart server seems horribly slow today), but hopefully gene.EnsId will contain ENSG00000204588, the Ensembl gene id for LINC01123.

H.

ADD COMMENT • link 10.7 years ago Hervé Pagès 16k

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 17 hours ago

United States

The short answer is that your region does overlap SH3RF3, and that lincRNA's Entrez Gene ID doesn't appear to be in UCSC:

> gns2 <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

Entrez Gene ID for SH3RF3 is 344558, and for the lincRNA is 440894

> gns2[mcols(gns2)[,1] %in%  "344558",]
GRanges object with 1 range and 1 metadata column:
         seqnames                 ranges strand |     gene_id
            <Rle>              <IRanges>  <Rle> | <character>
  344558     chr2 [109745997, 110262207]      + |      344558
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
> gns2[mcols(gns2)[,1] %in%  "440894",]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     gene_id
      <Rle> <IRanges>  <Rle> | <character>
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Different databases run by different people can have different information in them...

ADD COMMENT • link 10.7 years ago James W. MacDonald 68k

score 2 · Accepted Answer · 2015-05-18

2

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 4 days ago

United States

The version of the genome you are querying against?

ADD COMMENT • link 10.7 years ago Steve Lianoglou ★ 13k