We are using the DEXSeq package and are following the steps in the workflow vignette (“Inferring differential exon usage in RNASeq data with the DEXSeq package”).
We initially downloaded the Biconductor “Pasilla” Data Package (at http://bioconductor.org/packages/release/data/experiment/html/pasilla.html). Using the sample data files that were provided as part of this package we were able to successfully work our way through the DEXSeq workflow from Section 3 onwards.
Now however, we have gone back and are trying to use the 2 Python scripts described in Section 2 of the DEXSeq workflow, to generate the flattened “.GFF” genome reference (output of the “dexseq_prepare_annotation.py” Python script), and the “.TXT” text files (output of the “dexseq_count.py” Python script).
We downloaded the relevant FastQ files from NCBI – GEO at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18508. We then uploaded these to the GALAXY hub, and used the TopHat aligner to produce associated BAM files. When running TopHat in GALAXY, we specified “Use a built-in genome”, and selected “Fruit Fly (Drosophila melanogaster): dm3” as the reference genome to align against.
This produced BAM files where the chromosomes are identified in the format “CHRn” (where “n” is the chromosome name). Therefore, for the input to the “dexseq_prepare_annotation.py” Python script, we need to supply a GTF file where the chromosome names are specified in the “CHRn” format. When we downloaded a GTF reference file from the Ensembl website (“http://www.ensembl.org/info/data/ftp/index.html”), the chromosomes were in the “n” format (rather than “CHRn” format). We then went to the UCSC Table Browser (at “http://genome.ucsc.edu/cgi-bin/hgTables”) and downloaded a GTF reference file for “Drosophila Melanogaster BDGP R5/dm3”. This file had the chromosomes in the correct “CHRn” format. However when you look at the contents of this GTF file, the first few lines look like this:
chr4 dm3_ensGene exon 90287 90475 0.000000 + . gene_id "FBtr0309803"; transcript_id "FBtr0309803";
chr4 dm3_ensGene start_codon 93056 93058 0.000000 + . gene_id "FBtr0309803"; transcript_id "FBtr0309803";
chr4 dm3_ensGene CDS 93056 93220 0.000000 + 0 gene_id "FBtr0309803"; transcript_id "FBtr0309803";
chr4 dm3_ensGene exon 92947 93220 0.000000 + . gene_id "FBtr0309803"; transcript_id "FBtr0309803";
The “gene_id” fields are all in the form “FBtr0309803” when we would have expected them to be in the form “FBgn0309803”, (i.e. FBgn___ rather than FBtr___).
Thus when we tried to use this GTF file in the Python scripts, the results did not come out as expected. We have searched high and low for a GTF file where the “gene_id”s are in the FBgn___ format. We have found a number of sources for GTF files, but it appears that all the ones (e.g. from the Ensembl website) that have the correct “gene_id”, have the chromosome names in the “n’ format (rather than the “CHRn” format). Conversely, all of the GTF files we found that had the chromosome names in the needed ‘CHRn” format, had the “gene_id” form of FBtr___.
Can anyone please advise where we can obtain a GTF reference file in the correct format to feed in to the “dexseq_prepare_annotation.py” Python script for a BAM file that has chromosomes in the “CHRn” format.