GFF3 and FASTA files from the latest release of Gencode are now available via AnnotationHub. (biocVersion 3.2 only)
One can access GFF3 and FASTA files from the latest release of Homo sapiens (release 23) using the following code snippet :
> library(AnnotationHub)
> ah = AnnotationHub()
snapshotDate(): 2015-08-26
> Human_gff = query(ah, c("Gencode", "gff", "human"))
> Human_gff
AnnotationHub with 9 records
# snapshotDate(): 2015-08-26
# $dataprovider: Gencode
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
# sourcetype
# retrieve records with, e.g., 'object[["AH49554"]]'
title
AH49554 | gencode.v23.2wayconspseudos.gff3.gz
AH49555 | gencode.v23.annotation.gff3.gz
AH49556 | gencode.v23.basic.annotation.gff3.gz
AH49557 | gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz
AH49558 | gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz
AH49559 | gencode.v23.long_noncoding_RNAs.gff3.gz
AH49560 | gencode.v23.polyAs.gff3.gz
AH49561 | gencode.v23.primary_assembly.annotation.gff3.gz
AH49562 | gencode.v23.tRNAs.gff3.gz
> Human_fasta = query(ah, c("Gencode", "fasta", "human"))
> Human_fasta
AnnotationHub with 5 records
# snapshotDate(): 2015-08-26
# $dataprovider: Gencode
# $species: Homo sapiens
# $rdataclass: FaFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
# sourcetype
# retrieve records with, e.g., 'object[["AH49563"]]'
title
AH49563 | gencode.v23.chr_patch_hapl_scaff.transcripts.fa.gz
AH49564 | gencode.v23.lncRNA_transcripts.fa.gz
AH49565 | gencode.v23.pc_transcripts.fa.gz
AH49566 | gencode.v23.pc_translations.fa.gz
AH49567 | gencode.v23.transcripts.fa.gz
To access information about the file, use the '[' operator and use the '[[' to download the file.
> ah["AH49562"] AnnotationHub with 1 record # snapshotDate(): 2015-08-26 # names(): AH49562 # $dataprovider: Gencode # $species: Homo sapiens # $rdataclass: GRanges # $title: gencode.v23.tRNAs.gff3.gz # $description: tRNA structures predicted by tRNA-Scan on reference chromosomes # $taxonomyid: 9606 # $genome: GRCh38 # $sourcetype: GFF # $sourceurl: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/ge... # $sourcelastmodifieddate: 2015-07-16 # $sourcesize: 17419 # $tags: gencode, v23, tRNAs, gff3 # retrieve record with 'object[["AH49562"]]'
> gff = ah[["AH49562"]] require(“rtracklayer”) retrieving 1 resource |======================================================================| 100%
The GFF3 files are downloaded and read into R as a GenomicRanges object, while the FASTA files are indexed and both the Fasta file and its index are returned as a 'FaFile' object.
> class(gff) [1] "GRanges" attr(,"package") [1] "GenomicRanges" > fas = ah[["AH49567"]] retrieving 2 resources |======================================================================| 100% |======================================================================| 100% There were 50 or more warnings (use warnings() to see the first 50) > class(fas) [1] "FaFile" attr(,"package") [1] "Rsamtools" > fas class: FaFile path: /home/sarora/.AnnotationHub/56291 index: /home/sarora/.AnnotationHub/56292 isOpen: FALSE yieldSize: NA
Similarly, Gencode GFF3 and FASTA files for current Mouse release ( M6 ) can be accessed with :
> Mouse_gff = query(ah, c("Gencode", "gff", "mouse")) > Mouse_fasta = query(ah, c("Gencode", "fasta", "mouse")) > packageVersion('AnnotationHub') [1] ‘2.1.40’
Sonali.