Different answer from Homo.sapiens and biomaRt packages!!
3
0
Entering edit mode
NS ▴ 60
@ns-7498
Last seen 2.6 years ago
United States

I have some genome regions which I am gonna map them to gene symbols.

To do that there are two methods which have different result. I have mentioned one example below:

1) Using 'Homo.sapeins' package    

library(Homo.sapiens)
geneRanges <- function(db, column="ENTREZID")
{
  g <- genes(db, columns=column)
  col <- mcols(g)[[column]]
  genes <- granges(g)[rep(seq_along(g), elementLengths(col))]
  mcols(genes)[[column]] <- as.character(unlist(col))
  genes
}

gns <- geneRanges(Homo.sapiens, column="SYMBOL")

splitColumnByOverlap <-  function(query, subject, column="ENTREZID", ...)
{
  olaps <- findOverlaps(query, subject, ...)
  
  f1 <- factor(subjectHits(olaps), levels=seq_len(subjectLength(olaps)))
  
  splitAsList(mcols(query)[[column]][queryHits(olaps)], f1)
}

library(GenomicRanges)

cnv <- makeGRangesFromDataFrame(data.frame(chr="chr2", start=109993208, end=109993264))

gene.Symbol <- splitColumnByOverlap(gns, cnv, "SYMBOL")

The obtained gene symbol is 'SH3RF3'.

2) using 'biomaRt' package

library(GenomicRanges)
library(biomaRt)

mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
region <- "2:109993208:109993264"
gene.Symbol <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position","end_position"),
            filters = c("chromosomal_region"), values=region, mart=mart)


The obtained result is:

hgnc_symbol chromosome_name start_position end_position
1   LINC01123               2      109987063    109996140

Anyone knows why the results are different from each other?

Thanks.

biomart homo.sapiens genome regions gene symbol • 1.5k views
ADD COMMENT
2
Entering edit mode
@steve-lianoglou-2771
Last seen 1 day ago
Denali

The version of the genome you are querying against?

 

ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 3 hours ago
Seattle, WA, United States

Hi,

The usual suspects are:

1) Homo.sapiens and the hsapiens_gene_ensembl data set from the ensembl mart use different gene models: the knownGene track from UCSC for the former, Ensembl genes for the latter.

2) The Homo.sapiens object is a high-level object that bundles together the gene-centric annotations from org.Hs.eg.db and the transcript-centric annotations from TxDb.Hsapiens.UCSC.hg19.knownGene. The show method gives you that information (in a somewhat obfuscated way I must admit, but some work is currently in the process to address this):

> Homo.sapiens
class: OrganismDb 
Annotation resources:
[1] "GO.db"                             "org.Hs.eg.db"                     
[3] "TxDb.Hsapiens.UCSC.hg19.knownGene"
Annotation relationships:
     xDbs           yDbs                                xKeys      yKeys   
[1,] "GO.db"        "org.Hs.eg.db"                      "GOID"     "GO"    
[2,] "org.Hs.eg.db" "TxDb.Hsapiens.UCSC.hg19.knownGene" "ENTREZID" "GENEID"
For more details, please see the show methods for the component objects listed above.

However, the current Ensembl release is based on the GRCh38 assembly (a.k.a. hg38), not hg19.

Note that the Homo.sapiens object is meant to be modular and there is also some work in the process for making it easy for the user to replace the TxDb component with a custom one. Once this is available, you'll be able to do something like:

library(Homo.sapiens)
ens_txdb <- makeTxDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
txdb(Homo.sapiens) <- ens_txdb  # doesn't work yet!

and try again to compare the Homo.sapiens solution with the pure biomaRt solution for mapping your genome regions to gene symbols.

In the meantime, let's try to do the mapping by using ens_txdb directly:

tx_by_gene <- transcriptsBy(ens_txdb, by="gene")
hits <- findOverlaps(cnv, tx_by_gene)
gene.EnsId <- extractList(names(tx_by_gene), as(hits, "List"))

I can't run the above code at the moment (my call to makeTxDbFromBiomart() is still running after 20 min, the BioMart server seems horribly slow today), but hopefully gene.EnsId will contain ENSG00000204588, the Ensembl gene id for LINC01123.

H.

ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 6 hours ago
United States

The short answer is that your region does overlap SH3RF3, and that lincRNA's Entrez Gene ID doesn't appear to be in UCSC:

> gns2 <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

Entrez Gene ID for SH3RF3 is 344558, and for the lincRNA is 440894

> gns2[mcols(gns2)[,1] %in%  "344558",]
GRanges object with 1 range and 1 metadata column:
         seqnames                 ranges strand |     gene_id
            <Rle>              <IRanges>  <Rle> | <character>
  344558     chr2 [109745997, 110262207]      + |      344558
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
> gns2[mcols(gns2)[,1] %in%  "440894",]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     gene_id
      <Rle> <IRanges>  <Rle> | <character>
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Different databases run by different people can have different information in them...

ADD COMMENT

Login before adding your answer.

Traffic: 270 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6