Question: Different answer from Homo.sapiens and biomaRt packages!!
gravatar for NS
3.2 years ago by
United States
NS10 wrote:

I have some genome regions which I am gonna map them to gene symbols.

To do that there are two methods which have different result. I have mentioned one example below:

1) Using 'Homo.sapeins' package    

geneRanges <- function(db, column="ENTREZID")
  g <- genes(db, columns=column)
  col <- mcols(g)[[column]]
  genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
  mcols(genes)[[column]] <- as.character(unlist(col))

gns <- geneRanges(Homo.sapiens, column="SYMBOL")

splitColumnByOverlap <-  function(query, subject, column="ENTREZID", ...)
  olaps <- findOverlaps(query, subject, ...)
  f1 <- factor(subjectHits(olaps), levels=seq_len(subjectLength(olaps)))
  splitAsList(mcols(query)[[column]][queryHits(olaps)], f1)


cnv <- makeGRangesFromDataFrame(data.frame(chr="chr2", start=109993208, end=109993264))

gene.Symbol <- splitColumnByOverlap(gns, cnv, "SYMBOL")

The obtained gene symbol is 'SH3RF3'.

2) using 'biomaRt' package


mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
region <- "2:109993208:109993264"
gene.Symbol <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position","end_position"),
            filters = c("chromosomal_region"), values=region, mart=mart)

The obtained result is:

hgnc_symbol chromosome_name start_position end_position
1   LINC01123               2      109987063    109996140

Anyone knows why the results are different from each other?


ADD COMMENTlink modified 3.2 years ago by Hervé Pagès ♦♦ 13k • written 3.2 years ago by NS10
gravatar for Steve Lianoglou
3.2 years ago by
Steve Lianoglou12k wrote:

The version of the genome you are querying against?


ADD COMMENTlink written 3.2 years ago by Steve Lianoglou12k
gravatar for Hervé Pagès
3.2 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:


The usual suspects are:

1) Homo.sapiens and the hsapiens_gene_ensembl data set from the ensembl mart use different gene models: the knownGene track from UCSC for the former, Ensembl genes for the latter.

2) The Homo.sapiens object is a high-level object that bundles together the gene-centric annotations from and the transcript-centric annotations from TxDb.Hsapiens.UCSC.hg19.knownGene. The show method gives you that information (in a somewhat obfuscated way I must admit, but some work is currently in the process to address this):

> Homo.sapiens
class: OrganismDb 
Annotation resources:
[1] "GO.db"                             ""                     
[3] "TxDb.Hsapiens.UCSC.hg19.knownGene"
Annotation relationships:
     xDbs           yDbs                                xKeys      yKeys   
[1,] "GO.db"        ""                      "GOID"     "GO"    
[2,] "" "TxDb.Hsapiens.UCSC.hg19.knownGene" "ENTREZID" "GENEID"
For more details, please see the show methods for the component objects listed above.

However, the current Ensembl release is based on the GRCh38 assembly (a.k.a. hg38), not hg19.

Note that the Homo.sapiens object is meant to be modular and there is also some work in the process for making it easy for the user to replace the TxDb component with a custom one. Once this is available, you'll be able to do something like:

ens_txdb <- makeTxDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
txdb(Homo.sapiens) <- ens_txdb  # doesn't work yet!

and try again to compare the Homo.sapiens solution with the pure biomaRt solution for mapping your genome regions to gene symbols.

In the meantime, let's try to do the mapping by using ens_txdb directly:

tx_by_gene <- transcriptsBy(ens_txdb, by="gene")
hits <- findOverlaps(cnv, tx_by_gene)
gene.EnsId <- extractList(names(tx_by_gene), as(hits, "List"))

I can't run the above code at the moment (my call to makeTxDbFromBiomart() is still running after 20 min, the BioMart server seems horribly slow today), but hopefully gene.EnsId will contain ENSG00000204588, the Ensembl gene id for LINC01123.


ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Hervé Pagès ♦♦ 13k
gravatar for James W. MacDonald
3.2 years ago by
United States
James W. MacDonald46k wrote:

The short answer is that your region does overlap SH3RF3, and that lincRNA's Entrez Gene ID doesn't appear to be in UCSC:

> gns2 <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

Entrez Gene ID for SH3RF3 is 344558, and for the lincRNA is 440894

> gns2[mcols(gns2)[,1] %in%  "344558",]
GRanges object with 1 range and 1 metadata column:
         seqnames                 ranges strand |     gene_id
            <Rle>              <IRanges>  <Rle> | <character>
  344558     chr2 [109745997, 110262207]      + |      344558
  seqinfo: 93 sequences (1 circular) from hg19 genome
> gns2[mcols(gns2)[,1] %in%  "440894",]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     gene_id
      <Rle> <IRanges>  <Rle> | <character>
  seqinfo: 93 sequences (1 circular) from hg19 genome

Different databases run by different people can have different information in them...

ADD COMMENTlink written 3.2 years ago by James W. MacDonald46k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 183 users visited in the last hour