Hi Sarah,
An alternate way of annotating your Ensembl gene would be to use the Bioconductor package AnnotationHub.
It contains gtf files from Ensembl release 69 to 81 for all organisms released by Ensembl.
The data is presented as GRanges which can easily be manipulated to get information about the gene, exons, CDS etc..
Load the package
> library(AnnotationHub)
> ah = AnnotationHub()
snapshotDate(): 2015-08-26
Search for a GTF file coming from Ensembl for mus musculus for release-80
> gtf <- query(ah, c("gtf","mus musculus", "80", "ensembl"))
> gtf
AnnotationHub with 1 record
# snapshotDate(): 2015-08-26
# names(): AH47076
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: GRanges
# $title: Mus_musculus.GRCm38.80.gtf
# $description: Gene Annotation for Mus musculus
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: GTF
# $sourceurl: ftp://ftp.ensembl.org/pub/release-80/gtf/mus_musculus/Mus_musculus.GRCm38.80.gtf.gz
# $sourcelastmodifieddate: 2015-05-01
# $sourcesize: 25292510
# $tags: GTF, ensembl, Gene, Transcript, Annotation
# retrieve record with 'object[["AH47076"]]'
Download the File
> gtfFile <- gtf[[1]]
require(“GenomicRanges”)
retrieving 1 resource
|===========================================================================================| 100%
using guess work to populate seqinfo
There were 50 or more warnings (use warnings() to see the first 50)
This object is downloaded as a GenomicRanges object which contains data on all the genes, The ensembl gene names are contained in the mcols() "gene_id"
> gtfFile
GRanges object with 1524100 ranges and 22 metadata columns:
seqnames ranges strand | source type score phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
[1] 1 [3073253, 3074322] + | havana gene <NA> <NA>
[2] 1 [3073253, 3074322] + | havana transcript <NA> <NA>
[3] 1 [3073253, 3074322] + | havana exon <NA> <NA>
[4] 1 [3102016, 3102125] + | ensembl gene <NA> <NA>
[5] 1 [3102016, 3102125] + | ensembl transcript <NA> <NA>
... ... ... ... ... ... ... ... ...
[1524096] JH584295.1 [708, 752] - | ensembl CDS <NA> 2
[1524097] JH584295.1 [565, 633] - | ensembl exon <NA> <NA>
[1524098] JH584295.1 [565, 633] - | ensembl CDS <NA> 2
[1524099] JH584295.1 [ 66, 109] - | ensembl exon <NA> <NA>
[1524100] JH584295.1 [ 66, 109] - | ensembl CDS <NA> 2
gene_id gene_version gene_name gene_source gene_biotype
<character> <numeric> <character> <character> <character>
[1] ENSMUSG00000102693 1 4933401J01Rik havana TEC
[2] ENSMUSG00000102693 1 4933401J01Rik havana TEC
[3] ENSMUSG00000102693 1 4933401J01Rik havana TEC
[4] ENSMUSG00000064842 1 Gm26206 ensembl snRNA
[5] ENSMUSG00000064842 1 Gm26206 ensembl snRNA
... ... ... ... ... ...
[1524096] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding
[1524097] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding
[1524098] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding
[1524099] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding
[1524100] ENSMUSG00000095742 1 CAAA01147332.1 ensembl protein_coding
transcript_id transcript_version transcript_name transcript_source
<character> <numeric> <character> <character>
[1] <NA> <NA> <NA> <NA>
[2] ENSMUST00000193812 1 4933401J01Rik-001 havana
[3] ENSMUST00000193812 1 4933401J01Rik-001 havana
[4] <NA> <NA> <NA> <NA>
[5] ENSMUST00000082908 1 Gm26206-201 ensembl
... ... ... ... ...
[1524096] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl
[1524097] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl
[1524098] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl
[1524099] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl
[1524100] ENSMUST00000179436 1 CAAA01147332.1-201 ensembl
transcript_biotype tag exon_number exon_id exon_version
<character> <character> <numeric> <character> <numeric>
[1] <NA> <NA> <NA> <NA> <NA>
[2] TEC basic <NA> <NA> <NA>
[3] TEC basic 1 ENSMUSE00001343744 1
[4] <NA> <NA> <NA> <NA> <NA>
[5] snRNA basic <NA> <NA> <NA>
... ... ... ... ... ...
[1524096] protein_coding basic 5 <NA> <NA>
[1524097] protein_coding basic 6 ENSMUSE00000997159 1
[1524098] protein_coding basic 6 <NA> <NA>
[1524099] protein_coding basic 7 ENSMUSE00001007635 1
[1524100] protein_coding basic 7 <NA> <NA>
transcript_support_level ccds_id protein_id protein_version
<character> <character> <character> <numeric>
[1] <NA> <NA> <NA> <NA>
[2] <NA> <NA> <NA> <NA>
[3] <NA> <NA> <NA> <NA>
[4] <NA> <NA> <NA> <NA>
[5] NA <NA> <NA> <NA>
... ... ... ... ...
[1524096] 5 <NA> ENSMUSP00000137004 1
[1524097] 5 <NA> <NA> <NA>
[1524098] 5 <NA> ENSMUSP00000137004 1
[1524099] 5 <NA> <NA> <NA>
[1524100] 5 <NA> ENSMUSP00000137004 1
-------
seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths
Simple search to find if your gene of interest is present or not!
> which(mcols(gtfFile)$gene_id=="ENSMUSG00000071528")
[1] 1512778 1512779 1512780 1512781 1512782 1512783 1512784 1512785 1512786 1512787 1512788 1512789
[13] 1512790 1512791
Subset the GenomicRanges object to make a smaller one which contains data only for your gene of interest
and store it in want.
> want <- gtfFile[which(mcols(gtfFile)$gene_id=="ENSMUSG00000071528"),]
> want
GRanges object with 14 ranges and 22 metadata columns:
seqnames ranges strand | source type score phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
[1] 19 [47083471, 47090625] - | ensembl gene <NA> <NA>
[2] 19 [47083471, 47090625] - | ensembl transcript <NA> <NA>
[3] 19 [47090573, 47090625] - | ensembl exon <NA> <NA>
[4] 19 [47086134, 47086229] - | ensembl exon <NA> <NA>
[5] 19 [47086134, 47086220] - | ensembl CDS <NA> 0
... ... ... ... ... ... ... ... ...
[10] 19 [47083471, 47083569] - | ensembl exon <NA> <NA>
[11] 19 [47090573, 47090625] - | ensembl UTR <NA> <NA>
[12] 19 [47086221, 47086229] - | ensembl UTR <NA> <NA>
[13] 19 [47085955, 47085957] - | ensembl UTR <NA> <NA>
[14] 19 [47083471, 47083569] - | ensembl UTR <NA> <NA>
gene_id gene_version gene_name gene_source gene_biotype transcript_id
<character> <numeric> <character> <character> <character> <character>
[1] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding <NA>
[2] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
[3] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
[4] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
[5] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
... ... ... ... ... ... ...
[10] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
[11] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
[12] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
[13] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
[14] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding ENSMUST00000096014
transcript_version transcript_name transcript_source transcript_biotype tag
<numeric> <character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA> <NA>
[2] 3 Usmg5-201 ensembl protein_coding basic
[3] 3 Usmg5-201 ensembl protein_coding basic
[4] 3 Usmg5-201 ensembl protein_coding basic
[5] 3 Usmg5-201 ensembl protein_coding basic
... ... ... ... ... ...
[10] 3 Usmg5-201 ensembl protein_coding basic
[11] 3 Usmg5-201 ensembl protein_coding basic
[12] 3 Usmg5-201 ensembl protein_coding basic
[13] 3 Usmg5-201 ensembl protein_coding basic
[14] 3 Usmg5-201 ensembl protein_coding basic
exon_number exon_id exon_version transcript_support_level ccds_id
<numeric> <character> <numeric> <character> <character>
[1] <NA> <NA> <NA> <NA> <NA>
[2] <NA> <NA> <NA> 1 CCDS38014
[3] 1 ENSMUSE00000617995 3 1 CCDS38014
[4] 2 ENSMUSE00000617994 1 1 CCDS38014
[5] 2 <NA> <NA> 1 CCDS38014
... ... ... ... ... ...
[10] 4 ENSMUSE00000617992 3 1 CCDS38014
[11] <NA> <NA> <NA> 1 CCDS38014
[12] <NA> <NA> <NA> 1 CCDS38014
[13] <NA> <NA> <NA> 1 CCDS38014
[14] <NA> <NA> <NA> 1 CCDS38014
protein_id protein_version
<character> <numeric>
[1] <NA> <NA>
[2] <NA> <NA>
[3] <NA> <NA>
[4] <NA> <NA>
[5] ENSMUSP00000093713 3
... ... ...
[10] <NA> <NA>
[11] <NA> <NA>
[12] <NA> <NA>
[13] <NA> <NA>
[14] <NA> <NA>
-------
seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths
The "type" column tells you what information is available for the Ensembl gene id that you're interested in
> mcols(want)$type
[1] gene transcript exon exon CDS start_codon exon CDS
[9] stop_codon exon UTR UTR UTR UTR
Levels: CDS exon gene Selenocysteine start_codon stop_codon transcript UTR
All the information that you want is found here - the gene's start,end chromosome co-ordinate,
strand, external gene name (gene_name) can be found with type=="gene"
> want[mcols(want)$type=="gene",]
GRanges object with 1 range and 22 metadata columns:
seqnames ranges strand | source type score phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
[1] 19 [47083471, 47090625] - | ensembl gene <NA> <NA>
gene_id gene_version gene_name gene_source gene_biotype transcript_id
<character> <numeric> <character> <character> <character> <character>
[1] ENSMUSG00000071528 3 Usmg5 ensembl protein_coding <NA>
transcript_version transcript_name transcript_source transcript_biotype tag
<numeric> <character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA> <NA>
exon_number exon_id exon_version transcript_support_level ccds_id protein_id
<numeric> <character> <numeric> <character> <character> <character>
[1] <NA> <NA> <NA> <NA> <NA> <NA>
protein_version
<numeric>
[1] <NA>
-------
seqinfo: 61 sequences (1 circular) from GRCm38 genome; no seqlengths
Hope that helps!
Sonali.