Dear all,
I expect AnnotationHub resource for ensembl v71 and a direct download using rtracklayer should give identical results. The code snipped below shows that there are some gene_identifiers missing in the AnnotationHub extracted data. Is there some additional processing done in the AnnotationHub recipe for ensembl v71? The last section of the code shows there is a mismatch in annotation between the two resources. Does AH7666 really reflects Homo_sapiens.GRCh37.71.gtf?
Kind regards,
Maarten
ah <- AnnotationHub()
updating metadata: retrieving 1 resource
|======================================================================| 100%
snapshotDate(): 2016-10-11
> query(ah, c("ensembl", "sapiens", "71"))
AnnotationHub with 7 records
# snapshotDate(): 2016-10-11
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: FaFile, GRanges
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH7666"]]'
title
AH7666 | Homo_sapiens.GRCh37.71.gtf
AH11241 | Homo_sapiens.GRCh37.71.cdna.all.fa
AH11242 | Homo_sapiens.GRCh37.71.dna_rm.toplevel.fa
AH11243 | Homo_sapiens.GRCh37.71.dna_sm.toplevel.fa
AH11244 | Homo_sapiens.GRCh37.71.dna.toplevel.fa
AH11245 | Homo_sapiens.GRCh37.71.ncrna.fa
AH11246 | Homo_sapiens.GRCh37.71.pep.all.fa
> gtfAH <- ah[["AH7666"]]
require(“GenomicRanges”)
downloading from ‘https://annotationhub.bioconductor.org/fetch/7666’
retrieving 1 resource
|======================================================================| 100%
using guess work to populate seqinfo
> library(rtracklayer)
> gtfRT <- import(metadata(gtfAH)$`Data Source`)
trying URL 'ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz'
ftp data connection made, file length 26733273 bytes
==================================================
downloaded 25.5 MB
> ensRT <- mcols(gtfRT)$gene_id
> ensAH <- mcols(gtfAH)$gene_id
> str(ensRT)
chr [1:2253155] "ENSG00000261516" "ENSG00000261516" "ENSG00000261516" ...
> str(ensAH)
chr [1:2253155] "ENSG00000261516" "ENSG00000261516" "ENSG00000261516" ...
> length(unique(ensRT))
[1] 61893
> length(unique(ensAH))
[1] 37482
> head(setdiff(ensRT, ensAH))
[1] "ENSG00000236896" "ENSG00000212521" "ENSG00000095380" "ENSG00000106785"
[5] "ENSG00000106789" "ENSG00000095383"
> which(mcols(gtfAH)$gene_id == "ENSG00000236896")
integer(0)
> gtfRT[mcols(gtfRT)$gene_id == "ENSG00000236896"]
GRanges object with 2 ranges and 12 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 9 [100749802, 100749938] - | antisense exon <NA>
[2] 9 [100748833, 100749097] - | antisense exon <NA>
phase gene_id transcript_id exon_number gene_name
<integer> <character> <character> <character> <character>
[1] <NA> ENSG00000236896 ENST00000411981 1 RP11-535C21.3
[2] <NA> ENSG00000236896 ENST00000411981 2 RP11-535C21.3
gene_biotype transcript_name exon_id protein_id
<character> <character> <character> <character>
[1] antisense RP11-535C21.3-001 ENSE00001610477 <NA>
[2] antisense RP11-535C21.3-001 ENSE00001796798 <NA>
-------
seqinfo: 244 sequences from an unspecified genome; no seqlengths
> gr <- gtfRT[mcols(gtfRT)$gene_id == "ENSG00000236896"]
> subsetByOverlaps(gr, gtfAH)
GRanges object with 2 ranges and 12 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 9 [100749802, 100749938] - | antisense exon <NA>
[2] 9 [100748833, 100749097] - | antisense exon <NA>
phase gene_id transcript_id exon_number gene_name
<integer> <character> <character> <character> <character>
[1] <NA> ENSG00000236896 ENST00000411981 1 RP11-535C21.3
[2] <NA> ENSG00000236896 ENST00000411981 2 RP11-535C21.3
gene_biotype transcript_name exon_id protein_id
<character> <character> <character> <character>
[1] antisense RP11-535C21.3-001 ENSE00001610477 <NA>
[2] antisense RP11-535C21.3-001 ENSE00001796798 <NA>
-------
seqinfo: 244 sequences from an unspecified genome; no seqlengths
> sessionInfo()
R Under development (unstable) (2016-08-25 r71150)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] rtracklayer_1.34.1 GenomicRanges_1.26.2 GenomeInfoDb_1.10.2
[4] IRanges_2.8.1 S4Vectors_0.12.1 AnnotationHub_2.6.4
[7] BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 compiler_3.4.0
[3] BiocInstaller_1.24.0 XVector_0.14.0
[5] bitops_1.0-6 tools_3.4.0
[7] zlibbioc_1.20.0 digest_0.6.11
[9] RSQLite_1.1-2 memoise_1.0.0
[11] lattice_0.20-34 Matrix_1.2-7.1
[13] shiny_1.0.0 DBI_0.5-1
[15] curl_2.3 yaml_2.1.14
[17] httr_1.2.1 Biostrings_2.42.1
[19] grid_3.4.0 Biobase_2.34.0
[21] R6_2.2.0 AnnotationDbi_1.36.1
[23] XML_3.98-1.5 BiocParallel_1.8.1
[25] Rsamtools_1.26.1 htmltools_0.3.5
[27] GenomicAlignments_1.10.0 SummarizedExperiment_1.4.0
[29] mime_0.5 interactiveDisplayBase_1.12.0
[31] xtable_1.8-2 httpuv_1.3.3
[33] RCurl_1.95-4.8

I don't know if the
GRangesfor gtf files are rebuilt regularly, I guess not, so the only explanation would be that the version ofrtracklayerat the time theGRangeswas built did not import all of the data from the GTF. I don't believe that Ensembl changes there files and adds new content to old releases (in fact, the GTF file was created on March 26 2016).