DEXSeq - Where to find and download the appropriate GTF reference file for Python Script
1
0
Entering edit mode
@gregmulhearn-7639
Last seen 4.2 years ago
Australia

We are using the DEXSeq package and are following the steps in the workflow vignette (“Inferring differential exon usage in RNASeq data with the DEXSeq package”).

We initially downloaded the Biconductor “Pasilla” Data Package (at http://bioconductor.org/packages/release/data/experiment/html/pasilla.html).  Using the sample data files that were provided as part of this package we were able to successfully work our way through the DEXSeq workflow from Section 3 onwards.

Now however, we have gone back and are trying to use the 2 Python scripts described in Section 2 of the DEXSeq workflow, to generate the flattened “.GFF” genome reference (output of the “dexseq_prepare_annotation.py” Python script), and the “.TXT” text files (output of the “dexseq_count.py” Python script).

We downloaded the relevant FastQ files from NCBI – GEO at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18508. We then uploaded these to the GALAXY hub, and used the TopHat aligner to produce associated BAM files. When running TopHat in GALAXY, we specified “Use a built-in genome”, and selected “Fruit Fly (Drosophila melanogaster): dm3” as the reference genome to align against.

This produced BAM files where the chromosomes are identified in the format “CHRn” (where “n” is the chromosome name).  Therefore, for the input to the  “dexseq_prepare_annotation.py” Python script, we need to supply a GTF file where the chromosome names are specified in the “CHRn” format.  When we downloaded a GTF reference file from the Ensembl website (“http://www.ensembl.org/info/data/ftp/index.html”), the chromosomes were in the “n” format (rather than “CHRn” format).  We then went to the UCSC Table Browser (at “http://genome.ucsc.edu/cgi-bin/hgTables”) and downloaded a GTF reference file for “Drosophila Melanogaster BDGP R5/dm3”.  This file had the chromosomes in the correct “CHRn” format.  However when you look at the contents of this GTF file, the first few lines look like this:

chr4 dm3_ensGene     exon 90287 90475 0.000000   +    .    gene_id "FBtr0309803"; transcript_id "FBtr0309803";

chr4 dm3_ensGene     start_codon     93056 93058 0.000000   +    .    gene_id "FBtr0309803"; transcript_id "FBtr0309803";

chr4 dm3_ensGene     CDS  93056 93220 0.000000   +    0    gene_id "FBtr0309803"; transcript_id "FBtr0309803";

chr4 dm3_ensGene     exon 92947 93220 0.000000   +    .    gene_id "FBtr0309803"; transcript_id "FBtr0309803";

The “gene_id” fields are all in the form “FBtr0309803” when we would have expected them to be in the form “FBgn0309803”, (i.e. FBgn___ rather than FBtr___).

Thus when we tried to use this GTF file in the Python scripts, the results did not come out as expected. We have searched high and low for a GTF file where the “gene_id”s are in the FBgn___ format.  We have found a number of sources for GTF files, but it appears that all the ones (e.g. from the Ensembl website) that have the correct “gene_id”, have the chromosome names in the “n’ format (rather than the “CHRn” format).  Conversely, all of the GTF files we found that had the chromosome names in the needed ‘CHRn” format, had the “gene_id” form of FBtr___.

Can anyone please advise where we can obtain a GTF reference file in the correct format to feed in to the “dexseq_prepare_annotation.py” Python script for a BAM file that has chromosomes in the “CHRn” format.

dexseq gtf gff reference genome • 1.8k views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

You could use the AnnotationHub.

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2016-10-11
> query(hub, c("drosophila","gtf"))
AnnotationHub with 21 records
# snapshotDate(): 2016-10-11
# $dataprovider: Ensembl #$species: Drosophila melanogaster
# \$rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description,
#   sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH7549"]]'

title
AH7549  | Drosophila_melanogaster.BDGP5.70.gtf
AH7610  | Drosophila_melanogaster.BDGP5.69.gtf
AH7657  | Drosophila_melanogaster.BDGP5.71.gtf
AH7717  | Drosophila_melanogaster.BDGP5.72.gtf
AH7780  | Drosophila_melanogaster.BDGP5.73.gtf
...       ...
AH50816 | Drosophila_melanogaster.BDGP6.84.chr.gtf
AH50817 | Drosophila_melanogaster.BDGP6.84.gtf
AH50985 | Drosophila_melanogaster.BDGP6.85.abinitio.gtf
AH50986 | Drosophila_melanogaster.BDGP6.85.chr.gtf
AH50987 | Drosophila_melanogaster.BDGP6.85.gtf        

Pick the last one, because I probably don't know any better...

> z <- hub[["AH50987"]]
Importing File into R ..
retrieving 1 resource
|======================================================================| 100%
<does stuff>
> z
GRanges object with 538684 ranges and 16 metadata columns:

<snip>

And now for the trick! This is the most important part...

> seqlevelsStyle(z) <- "UCSC"

And now we can export

> export(z, "tmp.gtf","gtf")

And at a terminal prompt:

head -n 4 tmp.gtf

##gff-version 2
##date 2017-04-21
##genome-build .        BDGP6
chr3R   FlyBase gene    722370  722621  .       -       .       gene_id "FBgn0085804"; gene_name "CR41571"; gene_source "FlyBase"; gene_biotype "pseudogene";
0
Entering edit mode

Thanks James for your reply.  I have followed your detailed instructions above and from looking at the resultant object, it looks like this will solve our problem.

However, when I get to the last step within "R"  (> export(z, "tmp.gtf", "gtf")), I get the message: Error: could not find function "export".  Is there another library/package (other than "AnnotationHub") that I need to load to make the "export" function available. I tried re-loading Bioconductor, but that didn't help. I assume that "export" must be part of some other library/package.

Also, when I looked at the "z" object both BEFORE and AFTER running the (seqlevelsStyle(z) <- "UCSC") command, it appears that the only difference is that it appended "CHR" to the front of the chromosome names in the "seqnames" column. Is this correct?

0
Entering edit mode

Yes, you also need rtracklayer, which provides the export function. Sorry for the oversight.

Well, seqlevelsStyle does more than simply adding a chr to the front of the chromosome names, but for our purposes that's the only part that matters.

0
Entering edit mode

Thanks for that James.

However, I have run in to another problem. When I went back in to R (actually I'm using RStudio on a Windows PC), and tried to re-run all the commands in your initial solution steps, I successfully completed the first command ( library(AnnotationHub) ).  However, when I entered the second command ( hub <- AnnotationHub() ), I get the following error message:

> hub <- AnnotationHub()
Error in loadNamespace(name) : there is no package called ‘curl’
database may not be current
database: ‘C:/Users/gmul3410/Documents/AppData/.AnnotationHub/annotationhub.sqlite3’
reason: there is no package called ‘curl’
>

This command worked the first time I used it yesterday.  The only thing that I can think of that has changed since then is that when I was trying to figure out what library I needed to load to get the "export" function working, I re-loaded Bioconductor, which may have installed a new version of Bioconductor, and some of the dependent packages associated with Bioconductor.

So thinking it may now be a compatibility issue between Bioconductor and "R" and maybe RStudio, I upgraded to the latest versions of "R" (Ver 3.4.0), and RStudio (Ver 1.0.143), and Bioconductor (Ver 3.5).

However, rather than fixing the problem, after I upgraded to the latest versions of everything, I now cannot even run the library(AnnotationHub) command.  I now get the following error message:

> library(AnnotationHub)
Error in library(AnnotationHub) :
there is no package called ‘AnnotationHub’

I tried going back to the previous version of "R" (Ver 3.3.3), which was still installed on my PC. This allowed me to successfully run the library(AnnotationHub) command again, but I still get the same error when I try to run the hub <- AnnotationHub()  command.

0
Entering edit mode

If you ever get an error of the form

Error in library(<somepackagename>): there is no package named <somepackagename>

That means you need to install.

library(BiocInstaller)

biocLite("<somepackagename>")
0
Entering edit mode

Thanks James.  I have been away for a week or so, but have been able to successfully follow through the processes in your answer, and everything worked well.  We used the GTF file:  Drosophila_melanogaster.BDGP5.78.gtf (loaded using the command   z <- hub[["AH28596"]] )

However, we now have a need to use a different GTF file (Drosophila_melanogaster.BDGP5.25.62.gtf).  But I was not able to locate this GTF by querying the AnnotationHub library.  However, we have separately downloaded this file from the Ensembl website (ftp://ftp.ensembl.org/pub/release-62/gtf/drosophila_melanogaster) onto a Windows-based file share.

Is there a way to import this (Drosophila_melanogaster.BDGP5.25.62.gtf) file from a Windows file share in to "R" and apply the same "seqlevelsStyle" function to it, (rather than using a command like  z <- hub[["AH28596"]] )

0
Entering edit mode

Yes.

> library(rtracklayer)
> z <- import("Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
> seqlevelsStyle(z) <- "UCSC"
> export(z, "tmp.gtf", "gtf")

Then at a terminal prompt

head Drosophila_melanogaster.BDGP5.25.62.gtf
3R      protein_coding  exon    380     509     .       +       .        gene_id "FBgn0037213"; transcript_id "FBtr0078961"; exon_number "1"; gene_name "CG12581"; transcript_name "CG12581-RB";
3R      protein_coding  exon    578     1913    .       +       .        gene_id "FBgn0037213"; transcript_id "FBtr0078961"; exon_number "2"; gene_name "CG12581"; transcript_name "CG12581-RB";
3R      protein_coding  CDS     1115    1913    .       +       0        gene_id "FBgn0037213"; transcript_id "FBtr0078961"; exon_number "2"; gene_name "CG12581"; transcript_name "CG12581-RB"; protein_id "FBpp0078601";

And after the above we get

head tmp.gtf
##gff-version 2
##source-version rtracklayer 1.36.0
##date 2017-05-12
chr3R   protein_coding  exon    380     509     .       +       .       gene_id "FBgn0037213"; transcript_id "FBtr0078961"; exon_number "1"; gene_name "CG12581"; transcript_name "CG12581-RB";
chr3R   protein_coding  exon    578     1913    .       +       .       gene_id "FBgn0037213"; transcript_id "FBtr0078961"; exon_number "2"; gene_name "CG12581"; transcript_name "CG12581-RB";
chr3R   protein_coding  CDS     1115    1913    .       +       0       gene_id "FBgn0037213"; transcript_id "FBtr0078961"; exon_number "2"; gene_name "CG12581"; transcript_name "CG12581-RB"; protein_id "FBpp0078601"