Regarding annotations of mitochondrial genes - if you're interested in Ensembl IDs (or have Ensembl gene IDs) you might give ensembldb
and EnsDb
databases a try. You can fetch mitochondrial genes from an EnsDb
:
> library(EnsDb.Hsapiens.v75)
> gns <- genes(EnsDb.Hsapiens.v75, filter = ~ seq_name == "MT")
> gns
GRanges object with 37 ranges and 6 metadata columns:
seqnames ranges strand | gene_id gene_name
<Rle> <IRanges> <Rle> | <character> <character>
ENSG00000210049 MT [ 577, 647] + | ENSG00000210049 MT-TF
ENSG00000211459 MT [ 648, 1601] + | ENSG00000211459 MT-RNR1
ENSG00000210077 MT [1602, 1670] + | ENSG00000210077 MT-TV
ENSG00000210082 MT [1671, 3229] + | ENSG00000210082 MT-RNR2
ENSG00000209082 MT [3230, 3304] + | ENSG00000209082 MT-TL1
... ... ... ... . ... ...
ENSG00000198695 MT [14149, 14673] - | ENSG00000198695 MT-ND6
ENSG00000210194 MT [14674, 14742] - | ENSG00000210194 MT-TE
ENSG00000198727 MT [14747, 15887] + | ENSG00000198727 MT-CYB
ENSG00000210195 MT [15888, 15953] + | ENSG00000210195 MT-TT
ENSG00000210196 MT [15956, 16023] - | ENSG00000210196 MT-TP
gene_biotype seq_coord_system symbol entrezid
<character> <character> <character> <list>
ENSG00000210049 Mt_tRNA chromosome MT-TF NA
ENSG00000211459 Mt_rRNA chromosome MT-RNR1 NA
ENSG00000210077 Mt_tRNA chromosome MT-TV NA
ENSG00000210082 Mt_rRNA chromosome MT-RNR2 100616263
ENSG00000209082 Mt_tRNA chromosome MT-TL1 NA
... ... ... ... ...
ENSG00000198695 protein_coding chromosome MT-ND6 4541
ENSG00000210194 Mt_tRNA chromosome MT-TE NA
ENSG00000198727 protein_coding chromosome MT-CYB 4519
ENSG00000210195 Mt_tRNA chromosome MT-TT NA
ENSG00000210196 Mt_tRNA chromosome MT-TP NA
-------
seqinfo: 1 sequence from GRCh37 genome
EnsDb.Hsapiens.v75
bases on the relatively old Ensembl 75 release data. If you want more recent ones I suggest you get them from AnnotationHub
:
> library(AnnotationHub)
> query(AnnotationHub(), "EnsDb.Hsapiens.")
snapshotDate(): 2017-10-27
AnnotationHub with 4 records
# snapshotDate(): 2017-10-27
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
AH56681 | Ensembl 89 EnsDb for Homo Sapiens
AH57757 | Ensembl 90 EnsDb for Homo Sapiens
> edb <- AnnotationHub()[["AH57757"]]
hope that helps,
cheers, jo
Thanks Johannes, that's very helpful. Do you know if there are any major differences between the AH##### databases and Homo.sapiens that I should be aware of?
Just to avoid confusion:
AnnotationHub
is a central repository for annotation resources.AnnotationHub
contains many different databases among thoseEnsDb
databases,TxDb
databases, genomic sequences etc. I wouldn't call them AH#### databases, the AH#### is just the ID of the resource inAnnotationHub
. In the example I was extracting anEnsDb
database from theAnnotationHub
. For more information on these you might want to have a look at theensembldb
package (vignettes).As far as I know, the Homo.sapiens database/resource contains a
TxDb
database providing the genomic coordinates of genes/transcripts/exons. TheseTxDb
are usually based on annotations from UCSC.EnsDb
annotations are designed for and built from Ensembl annotations. Their versions match the Ensembl release version on which they are built (i.e.EnsDb.Hsapiens.90
contains all human annotations for Ensembl release 90).