Search
Question: mismatch between AnnotationHub ensembl v71 hsapiens and direct download using rtracklayer
1
gravatar for mviterson
10 months ago by
mviterson10
Netherlands
mviterson10 wrote:

Dear all,

I expect AnnotationHub resource for ensembl v71 and a direct download using rtracklayer should give identical results. The code snipped below shows that there are some gene_identifiers missing in the AnnotationHub extracted data. Is there some additional processing done in the AnnotationHub recipe for ensembl v71? The last section of the code shows there is a mismatch in annotation between the two resources. Does AH7666  really reflects Homo_sapiens.GRCh37.71.gtf?

Kind regards,

Maarten

ah <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%
snapshotDate(): 2016-10-11
> query(ah, c("ensembl", "sapiens", "71"))
AnnotationHub with 7 records
# snapshotDate(): 2016-10-11
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: FaFile, GRanges
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH7666"]]'

            title                                    
  AH7666  | Homo_sapiens.GRCh37.71.gtf               
  AH11241 | Homo_sapiens.GRCh37.71.cdna.all.fa       
  AH11242 | Homo_sapiens.GRCh37.71.dna_rm.toplevel.fa
  AH11243 | Homo_sapiens.GRCh37.71.dna_sm.toplevel.fa
  AH11244 | Homo_sapiens.GRCh37.71.dna.toplevel.fa   
  AH11245 | Homo_sapiens.GRCh37.71.ncrna.fa          
  AH11246 | Homo_sapiens.GRCh37.71.pep.all.fa        
> gtfAH <- ah[["AH7666"]]
require(“GenomicRanges”)
downloading from ‘https://annotationhub.bioconductor.org/fetch/7666’
retrieving 1 resource
  |======================================================================| 100%
using guess work to populate seqinfo
> library(rtracklayer)
> gtfRT <- import(metadata(gtfAH)$`Data Source`)
trying URL 'ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz'
ftp data connection made, file length 26733273 bytes
==================================================
downloaded 25.5 MB

> ensRT <- mcols(gtfRT)$gene_id
> ensAH <- mcols(gtfAH)$gene_id
> str(ensRT)
 chr [1:2253155] "ENSG00000261516" "ENSG00000261516" "ENSG00000261516" ...
> str(ensAH)
 chr [1:2253155] "ENSG00000261516" "ENSG00000261516" "ENSG00000261516" ...
> length(unique(ensRT))
[1] 61893
> length(unique(ensAH))
[1] 37482
> head(setdiff(ensRT, ensAH))
[1] "ENSG00000236896" "ENSG00000212521" "ENSG00000095380" "ENSG00000106785"
[5] "ENSG00000106789" "ENSG00000095383"
> which(mcols(gtfAH)$gene_id == "ENSG00000236896")
integer(0)

> gtfRT[mcols(gtfRT)$gene_id == "ENSG00000236896"]
GRanges object with 2 ranges and 12 metadata columns:
      seqnames                 ranges strand |    source     type     score
         <Rle>              <IRanges>  <Rle> |  <factor> <factor> <numeric>
  [1]        9 [100749802, 100749938]      - | antisense     exon      <NA>
  [2]        9 [100748833, 100749097]      - | antisense     exon      <NA>
          phase         gene_id   transcript_id exon_number     gene_name
      <integer>     <character>     <character> <character>   <character>
  [1]      <NA> ENSG00000236896 ENST00000411981           1 RP11-535C21.3
  [2]      <NA> ENSG00000236896 ENST00000411981           2 RP11-535C21.3
      gene_biotype   transcript_name         exon_id  protein_id
       <character>       <character>     <character> <character>
  [1]    antisense RP11-535C21.3-001 ENSE00001610477        <NA>
  [2]    antisense RP11-535C21.3-001 ENSE00001796798        <NA>
  -------
  seqinfo: 244 sequences from an unspecified genome; no seqlengths
> gr <- gtfRT[mcols(gtfRT)$gene_id == "ENSG00000236896"]
> subsetByOverlaps(gr, gtfAH)
GRanges object with 2 ranges and 12 metadata columns:
      seqnames                 ranges strand |    source     type     score
         <Rle>              <IRanges>  <Rle> |  <factor> <factor> <numeric>
  [1]        9 [100749802, 100749938]      - | antisense     exon      <NA>
  [2]        9 [100748833, 100749097]      - | antisense     exon      <NA>
          phase         gene_id   transcript_id exon_number     gene_name
      <integer>     <character>     <character> <character>   <character>
  [1]      <NA> ENSG00000236896 ENST00000411981           1 RP11-535C21.3
  [2]      <NA> ENSG00000236896 ENST00000411981           2 RP11-535C21.3
      gene_biotype   transcript_name         exon_id  protein_id
       <character>       <character>     <character> <character>
  [1]    antisense RP11-535C21.3-001 ENSE00001610477        <NA>
  [2]    antisense RP11-535C21.3-001 ENSE00001796798        <NA>
  -------
  seqinfo: 244 sequences from an unspecified genome; no seqlengths

> sessionInfo()
R Under development (unstable) (2016-08-25 r71150)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] rtracklayer_1.34.1   GenomicRanges_1.26.2 GenomeInfoDb_1.10.2
[4] IRanges_2.8.1        S4Vectors_0.12.1     AnnotationHub_2.6.4
[7] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9                   compiler_3.4.0               
 [3] BiocInstaller_1.24.0          XVector_0.14.0               
 [5] bitops_1.0-6                  tools_3.4.0                  
 [7] zlibbioc_1.20.0               digest_0.6.11                
 [9] RSQLite_1.1-2                 memoise_1.0.0                
[11] lattice_0.20-34               Matrix_1.2-7.1               
[13] shiny_1.0.0                   DBI_0.5-1                    
[15] curl_2.3                      yaml_2.1.14                  
[17] httr_1.2.1                    Biostrings_2.42.1            
[19] grid_3.4.0                    Biobase_2.34.0               
[21] R6_2.2.0                      AnnotationDbi_1.36.1         
[23] XML_3.98-1.5                  BiocParallel_1.8.1           
[25] Rsamtools_1.26.1              htmltools_0.3.5              
[27] GenomicAlignments_1.10.0      SummarizedExperiment_1.4.0   
[29] mime_0.5                      interactiveDisplayBase_1.12.0
[31] xtable_1.8-2                  httpuv_1.3.3                 
[33] RCurl_1.95-4.8              

 

 

ADD COMMENTlink modified 10 months ago by Valerie Obenchain ♦♦ 6.4k • written 10 months ago by mviterson10

I don't know if the GRanges for gtf files are rebuilt regularly, I guess not, so the only explanation would be that the version of rtracklayer at the time the GRanges was built did not import all of the data from the GTF. I don't believe that Ensembl changes there files and adds new content to old releases (in fact, the GTF file was created on March 26 2016).

ADD REPLYlink written 10 months ago by Johannes Rainer1.0k
0
gravatar for mviterson
10 months ago by
mviterson10
Netherlands
mviterson10 wrote:

The mismatch between annotationhub and direct download using rtracklayer does not only involve ensembl gtf v71. For example, I got similar results for

ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz

Furthermore, all possible snapshot dates gave showed the same mismatch? Is there something wrong with the conversion of AnnotationHub from GTF to GRanges? I try to find the code of this but could not find it yet!

 

ADD COMMENTlink written 10 months ago by mviterson10

have you tried that with something more recent too, e.g. Ensembl 84 or 85?

ADD REPLYlink written 10 months ago by Johannes Rainer1.0k
0
gravatar for Valerie Obenchain
10 months ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:

Hi,

Prior to July 2016, the GTF files from Ensembl were pre-built and stored in S3 (versions <=83). Once built they are not updated unless a bug is reported. The date of the GTF 71 files in the S3 bucket is November 2013.

As of July 2016, there is no pre-building of the GTF (versions >=84) The metadata in AnnotationHub points to the Ensembl url, the file is downloaded and converted to GTF on the fly. The processing happens in the .get1,GTFFileResource-method in AnnotationHub.

In this code chunk, 'yy' is the url and AnnotationHub:::.tidyGRanges() adds metadata to the GRanges, cleans up the seqinfo and does some sorting.

setMethod(".get1", "GTFFileResource",
    function(x, ...)
{
    message("Importing File into R ..")
    .require("rtracklayer")
    .require("GenomeInfoDb")
    yy <- getHub(x)
    gtf <- rtracklayer::import(cache(yy), format="gtf", genome=yy$genome, ...)
    .tidyGRanges(x, gtf)
})

If the Ensembl 71 GTF files were built in Nov 2013 you'd need to look at the AnnotationHubData (not AnnotationHub) code in Bioconductor 2.12 to see how the files were processed. As you say, something may have changed in rtrackalyer or it could be how rtracklayer::import() was called from the AnnotationHubData recipe.

As Jo mentioned, it would be interesting to see if you find inconsistencies with more current versions. I'm open to modifying the .get1() method if data that should be included are being dropped.

Valerie

ADD COMMENTlink written 10 months ago by Valerie Obenchain ♦♦ 6.4k

Thank you Valerie for the explanation. I will check other Ensembl version as suggested by Johannes.

Just a thought; would it be possible for all version of the Ensembl gtf's to be build on the fly?
 

ADD REPLYlink written 10 months ago by mviterson10

Single-square-bracket subsetting a hub object returns information about the resource, including it's source url. Use that for direct import, rtracklayer::import(ah["AH7666"]$sourceurl)

ADD REPLYlink written 10 months ago by Martin Morgan ♦♦ 20k

Thanks this is a really useful to know!

ADD REPLYlink written 9 months ago by mviterson10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 308 users visited in the last hour