Question

From a ENST to ENSG

0

Entering edit mode

suxxa ▴ 20

@saccovr

Last seen 2.8 years ago

Bolzano

Hi there,

I have a database like this with 6876 obs.

                     baseMean log2FoldChange      lfcSE       pvalue         padj
ENST00000229416.10 850.753253      0.2068899 0.05536265 6.804183e-05 3.349652e-03
ENST00000509541.5  288.905762      0.1790877 0.06232428 1.752013e-03 3.600306e-02
ENST00000487168.1   15.855337      0.2610650 0.07884873 2.235875e-04 8.202311e-03
ENST00000381177.6   57.890737     -0.4073535 0.05360163 2.962595e-15 1.696065e-12
ENST00000381180.8    2.851529     -0.3272547 0.11009314 4.038313e-04 1.277395e-02
ENST00000381184.6   35.539393     -0.5825259 0.08344768 1.771709e-13 9.039495e-11

I would like to get the following informations for each transcript: Ensembl ENSG, Gene name and Gene symbol

What should I do?

Many thanks

AnnotationDbi • 7.2k views

ADD COMMENT • link 3.5 years ago suxxa ▴ 20

score 3 · Accepted Answer · 2021-01-27

3

Entering edit mode

Guido Hooiveld ★ 4.0k

@guido-hooiveld-2020

Last seen 12 hours ago

Wageningen University, Wageningen, the …

Alternatively, regarding annotation you also can make use of the EnsDb's:

1). Notice that you have human ENSEMBL transcipt IDs.

2). Find out which version of ENSEMBL (or GENCODE) was used to map the reads.

3). Johannes Rainer kindly builds many ENSEMBL-based annotation libraries, and makes them available (incl. the most recent ones) through the so-called AnnnotationHub. See e.g. here. Preferably find/use the version mentioned at step 2).

4). To get you started with some code, check here.

Note that you may need to strip the transcript version identifier with gsub before querying the EnsDb. For example:

annotation <- AnnotationDbi:::select(EnsDb.Hsapiens.v101, keys =  sub("\\..*", "", rownames( <<Insert name DESeq2 res object>> ) ), 
                     keytype = "TXID", columns = c("TXID", "TXNAME", "GENEID", "GENENAME", "DESCRIPTION", "GENEBIOTYPE", "TXBIOTYPE", "ENTREZID") )
annotation <- annotation[!duplicated(annotation[,1]),] # To be sure: get rid of duplicates, and only keep 1st hit.

ADD COMMENT • link 3.5 years ago Guido Hooiveld ★ 4.0k

3

Entering edit mode

Adding to Guido's answer:

Point 2 is crucial, always ensure you know exactly which annotation version was used in your analysis.

For point 3, assuming you've Ensembl 101:

library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "EnsDb.Hsapiens.v101")
AnnotationHub with 1 record
# snapshotDate(): 2021-01-14
# names(): AH83216
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2020-04-27
# $title: Ensembl 101 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("101", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH83216"]]' 
edb <- ah[["AH83216"]]

With that EnsDb database you could either use the AnnotationDbi solution from Guido's comment, or you could use the ensembldb provided functions: 1) get a data.frame with all transcript annotations including some gene annotations 2) match your IDs to the IDs in the table.

txs <- transcripts(edb, columns = c("tx_id", "tx_biotype", "tx_id_version", "gc_content", "gene_name", "gene_id"))
          tx_id     tx_biotype      tx_id_version gc_content gene_name
1 ENST00000000233 protein_coding ENST00000000233.10   59.39922      ARF5
2 ENST00000000412 protein_coding  ENST00000000412.8   45.71429      M6PR
3 ENST00000000442 protein_coding ENST00000000442.11   63.80827     ESRRA
4 ENST00000001008 protein_coding  ENST00000001008.6   50.98250     FKBP4
5 ENST00000001146 protein_coding  ENST00000001146.6   56.80473   CYP26B1
6 ENST00000002125 protein_coding  ENST00000002125.9   42.67399   NDUFAF7
          gene_id
1 ENSG00000004059
2 ENSG00000003056
3 ENSG00000173153
4 ENSG00000004478
5 ENSG00000003137
6 ENSG00000003509

There is now also the columns "tx_id_version" available in the EnsDb databases that you could use to match your versioned Ensembl IDs against, e.g.:

ids <- c("ENST00000229416.10", "ENST00000509541.5", "ENST00000487168.1", "ENST00000381177.6")
idx <- match(ids, txs$tx_id_version)
txs[idx, ]
                 tx_id      tx_biotype     tx_id_version gc_content gene_name
NA                <NA>            <NA>              <NA>         NA      <NA>
114107 ENST00000509541  protein_coding ENST00000509541.5   39.82512      GCLC
97027  ENST00000487168 retained_intron ENST00000487168.1   38.66667     KRIT1
NA.1              <NA>            <NA>              <NA>         NA      <NA>
               gene_id
NA                <NA>
114107 ENSG00000001084
97027  ENSG00000001631
NA.1              <NA>

Actually, the NA in here (i.e. the IDs that could not be mapped to tx ids in the EnsDb) tells us that your IDs come from another Ensembl release than Ensembl 101 - otherwise all versioned transcript identifier would match.

ADD REPLY • link 3.5 years ago Johannes Rainer ★ 2.1k

2

Entering edit mode

Thanks Johannes for this very useful addition, especially regarding the column "tx_id_version"!

ADD REPLY • link 3.5 years ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

Thanks you all, I extracted these transcripts from a DESeq2 dataset. I used the GTEx release V8 so I suppose I should use Ensembl 26, is it correct? As from here https://gtexportal.org/home/releaseInfoPage

ADD REPLY • link 3.5 years ago suxxa ▴ 20

1

Entering edit mode

No, a quick search on Google found this info / track (at the UCSC Genome browser):

All GENCODE annotations from V26 (Ensembl 88).

In other words, GENCODE V26 corresponds to Ensembl version 88, and not 26.

ADD REPLY • link 3.5 years ago Guido Hooiveld ★ 4.0k

1

Entering edit mode

First of all, many many thanks Guido and Johannes.

About my script I don't understand how to put my dataframe in comparison with the annotation formula

This is my script

ah <- AnnotationHub()
query(ah, "EnsDb.Hsapiens.v88")
edb <- ah[["AH53715"]]

annotation <- AnnotationDbi:::select(EnsDb.Hsapiens.v88, keys =  sub("\\..*", "", rownames(adipose_ENST), 
                                     keytype = "TXID", columns = c("TXID", "TXNAME", "GENEID", "GENENAME", "DESCRIPTION", "GENEBIOTYPE", "TXBIOTYPE", "ENTREZID") )
annotation <- annotation[!duplicated(annotation[,1]),] # To be sure: get rid of duplicates, and only keep 1st hit.

But these are the errors:

> annotation <- AnnotationDbi:::select(EnsDb.Hsapiens.v88, keys =  sub("\\..*", "", rownames(adipose_ENST)), 
+                                      keytype = "TXID", columns = c("TXID", "TXNAME", "GENEID", "GENENAME", "DESCRIPTION", "GENEBIOTYPE", "TXBIOTYPE", "ENTREZID") )
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'select': oggetto "EnsDb.Hsapiens.v88" non trovato
> annotation <- annotation[!duplicated(annotation[,1]),] # To be sure: get rid of duplicates, and only keep 1st hit.
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'i' in selecting a method for function '[': error in evaluating the argument 'x' in selecting a method for function 'duplicated': object of type 'closure' is not subsettable

Why it doesn't find "EnsDb.Hsapiens.v88"? From the console I see that it was loaded

> edb <- ah[["AH53715"]]
loading from cache

Thanks again for the help

ADD REPLY • link 3.5 years ago suxxa ▴ 20

2

Entering edit mode

With edb <- ah[["AH53715"]] you assign the EnsDb database to the variable edb, so you would have to use edb instead of EnsDb.Hsapiens.v88 in your code above - or alternatively use EnsDb.Hsapiens.v88 <- ah[["AH53715"]].

ADD REPLY • link 3.5 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

I was blind, I'm sorry... Thanks so much!

ADD REPLY • link 3.5 years ago suxxa ▴ 20

0

Entering edit mode

Thanks Johannes, and the way to find the chromosome is deprecated, is it right?

ADD REPLY • link 3.5 years ago suxxa ▴ 20

0

Entering edit mode

What do you mean with "find the chromosome"? To get the chromosome name you can query the "seq_name" column (i.e. include that in the list of requested column names with parameter columns) - or you get the annotations as a GRanges and you can call seqnames on that.

ADD REPLY • link 3.5 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Ok thanks, so I'll use ensembldb not annotationdbi

ADD REPLY • link 3.5 years ago suxxa ▴ 20

1

Entering edit mode

Ciao suxxa, please use 'Add Reply' when responding (not 'Add Answer")

ADD REPLY • link 3.5 years ago Kevin Blighe ★ 4.0k