Search
Question: DEXSeq - Where to find and download the appropriate GTF reference file for Python Script
0
gravatar for greg.mulhearn
4 days ago by
Australia
greg.mulhearn0 wrote:

We are using the DEXSeq package and are following the steps in the workflow vignette (“Inferring differential exon usage in RNASeq data with the DEXSeq package”).

We initially downloaded the Biconductor “Pasilla” Data Package (at http://bioconductor.org/packages/release/data/experiment/html/pasilla.html).  Using the sample data files that were provided as part of this package we were able to successfully work our way through the DEXSeq workflow from Section 3 onwards.

Now however, we have gone back and are trying to use the 2 Python scripts described in Section 2 of the DEXSeq workflow, to generate the flattened “.GFF” genome reference (output of the “dexseq_prepare_annotation.py” Python script), and the “.TXT” text files (output of the “dexseq_count.py” Python script).

We downloaded the relevant FastQ files from NCBI – GEO at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18508. We then uploaded these to the GALAXY hub, and used the TopHat aligner to produce associated BAM files. When running TopHat in GALAXY, we specified “Use a built-in genome”, and selected “Fruit Fly (Drosophila melanogaster): dm3” as the reference genome to align against.

This produced BAM files where the chromosomes are identified in the format “CHRn” (where “n” is the chromosome name).  Therefore, for the input to the  “dexseq_prepare_annotation.py” Python script, we need to supply a GTF file where the chromosome names are specified in the “CHRn” format.  When we downloaded a GTF reference file from the Ensembl website (“http://www.ensembl.org/info/data/ftp/index.html”), the chromosomes were in the “n” format (rather than “CHRn” format).  We then went to the UCSC Table Browser (at “http://genome.ucsc.edu/cgi-bin/hgTables”) and downloaded a GTF reference file for “Drosophila Melanogaster BDGP R5/dm3”.  This file had the chromosomes in the correct “CHRn” format.  However when you look at the contents of this GTF file, the first few lines look like this:

chr4 dm3_ensGene     exon 90287 90475 0.000000   +    .    gene_id "FBtr0309803"; transcript_id "FBtr0309803";

chr4 dm3_ensGene     start_codon     93056 93058 0.000000   +    .    gene_id "FBtr0309803"; transcript_id "FBtr0309803";

chr4 dm3_ensGene     CDS  93056 93220 0.000000   +    0    gene_id "FBtr0309803"; transcript_id "FBtr0309803";

chr4 dm3_ensGene     exon 92947 93220 0.000000   +    .    gene_id "FBtr0309803"; transcript_id "FBtr0309803";

 

The “gene_id” fields are all in the form “FBtr0309803” when we would have expected them to be in the form “FBgn0309803”, (i.e. FBgn___ rather than FBtr___).

Thus when we tried to use this GTF file in the Python scripts, the results did not come out as expected. We have searched high and low for a GTF file where the “gene_id”s are in the FBgn___ format.  We have found a number of sources for GTF files, but it appears that all the ones (e.g. from the Ensembl website) that have the correct “gene_id”, have the chromosome names in the “n’ format (rather than the “CHRn” format).  Conversely, all of the GTF files we found that had the chromosome names in the needed ‘CHRn” format, had the “gene_id” form of FBtr___.

Can anyone please advise where we can obtain a GTF reference file in the correct format to feed in to the “dexseq_prepare_annotation.py” Python script for a BAM file that has chromosomes in the “CHRn” format.

ADD COMMENTlink modified 3 days ago by James W. MacDonald43k • written 4 days ago by greg.mulhearn0
1
gravatar for James W. MacDonald
3 days ago by
United States
James W. MacDonald43k wrote:

You could use the AnnotationHub.

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2016-10-11
> query(hub, c("drosophila","gtf"))
AnnotationHub with 21 records
# snapshotDate(): 2016-10-11
# $dataprovider: Ensembl
# $species: Drosophila melanogaster
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH7549"]]'

            title                                        
  AH7549  | Drosophila_melanogaster.BDGP5.70.gtf         
  AH7610  | Drosophila_melanogaster.BDGP5.69.gtf         
  AH7657  | Drosophila_melanogaster.BDGP5.71.gtf         
  AH7717  | Drosophila_melanogaster.BDGP5.72.gtf         
  AH7780  | Drosophila_melanogaster.BDGP5.73.gtf         
  ...       ...                                          
  AH50816 | Drosophila_melanogaster.BDGP6.84.chr.gtf     
  AH50817 | Drosophila_melanogaster.BDGP6.84.gtf         
  AH50985 | Drosophila_melanogaster.BDGP6.85.abinitio.gtf
  AH50986 | Drosophila_melanogaster.BDGP6.85.chr.gtf     
  AH50987 | Drosophila_melanogaster.BDGP6.85.gtf        

Pick the last one, because I probably don't know any better...

> z <- hub[["AH50987"]]
Importing File into R ..
downloading from 'https://annotationhub.bioconductor.org/fetch/57725'
retrieving 1 resource
  |======================================================================| 100%
<does stuff>
> z
GRanges object with 538684 ranges and 16 metadata columns:

<snip>

And now for the trick! This is the most important part...

> seqlevelsStyle(z) <- "UCSC"

And now we can export

> export(z, "tmp.gtf","gtf") 

And at a terminal prompt:

head -n 4 tmp.gtf

##gff-version 2
##date 2017-04-21
##genome-build .        BDGP6
chr3R   FlyBase gene    722370  722621  .       -       .       gene_id "FBgn0085804"; gene_name "CR41571"; gene_source "FlyBase"; gene_biotype "pseudogene";
ADD COMMENTlink written 3 days ago by James W. MacDonald43k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 133 users visited in the last hour