Question

mapping gene type or biotype to ENSEMBL ID

0

Entering edit mode

Anubhav • 0

@Anubhav-24483

Last seen 4.1 years ago

Hi all,

I have a list of ENSEMBL ids of human lncRNA's for which I am trying to figure out the gene type (or biotype) eg. lincRNA, processed transcript, antisense, sense_overlapping, etc. Now, I have a GTF file (GENCODE v30 ) which contains the gene_id and gene_type argument in the 9th column. I could somehow try to use a code/script to map my IDs using this GTF file, but I was wondering whether there was an easier way to do it using online tools? I tried biomaRt but the current version of the ensemble release collapses the various types of lncRNAs to a single type ie. lncRNA. I really want to get the subtypes for each lncRNA.

P.S. The last GENCODE version with the lncRNA type 'split up' is the v30.

Thanks in advance :)

```

ensembldb biomaRt org.Hs.eg.db Annotation • 3.9k views

ADD COMMENT • link updated 4.2 years ago by Johannes Rainer ★ 2.1k • written 4.2 years ago by Anubhav • 0

2

Entering edit mode

This information is also available in the ensembldb EnsDb databases. To use the information for Ensembl release 100 you could for example get the mapping gene ID to gene biotype with:

> ## Get the EnsDb from AnnotationHub
> library(ensembldb)
> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2020-10-27
> edb <- ah[[names(query(ah, "EnsDb.Hsapiens.v100"))]]
loading from cache
> genes(edb, columns = c("gene_id", "gene_biotype"), return.type = "DataFrame")
DataFrame with 68008 rows and 2 columns
              gene_id   gene_biotype
          <character>    <character>
1     ENSG00000000003 protein_coding
2     ENSG00000000005 protein_coding
...               ...            ...
68007         LRG_998       LRG_gene
68008         LRG_999       LRG_gene

Or if you have a set of gene IDs you could query the data like that:

gene_ids <- c("ENSG00000003096", "ENSG00000006530", "ENSG00000000457", "ENSG00000076356")
> genes(edb, filter = ~ gene_id == gene_ids, columns = "gene_biotype", return.type = "DataFrame")
DataFrame with 4 rows and 2 columns
    gene_biotype         gene_id
     <character>     <character>
1 protein_coding ENSG00000000457
2 protein_coding ENSG00000003096
3 protein_coding ENSG00000006530
4 protein_coding ENSG00000076356

but be aware that the ordering in the result DataFrame is not the same as the ordering of the query IDs.

hope this helps.

cheers, jo

ADD REPLY • link 4.2 years ago Johannes Rainer ★ 2.1k

score 0 · Answer 1 · 2021-01-05

Seems like I got the answer myself after a bit of tinkering. For anyone interested,

#read GTF file in R using rtracklayer package
rtracklayer::import('file.gtf')-> GTF
GTF<- as.data.frame(GTF) #save GTF as a data frame, which will have a separate column called gene_type with the required information
# we can then run a loop over the list of ENSEMBL IDs and using the 'match' function, match each ID to the gene type.

But if anyone has any ideas about online resources/tools for doing the same, I would love to hear that. Cheers!!