DEXseq conversion of GTF to GFF creating incorrect exon coordinates
0
0
Entering edit mode
knuppdav • 0
@knuppdav-20469
Last seen 4.6 years ago

Hi all,

I have a problem converting ensembl mouse gtf file to gff for dexseq. The script "works" but incorrectly, unless i'm doing something wrong. The problem is that is is collapsing transcript incorrectly and assigning the wrong exon coordinates.

Example:

GTF File (i'm not showing full GTF line just so i can get the point across) Here are two exons from the same gene (Efnb2)

chr8 HAVANA exon 8660484 8661242 . - . gene_id "ENSMUSG00000001300.16"; transcript_id "ENSMUST00000001319.14"; gene_type "protein_coding"; gene_name "Efnb2"; transcript_type "protein_coding"; transcript_name "AC161867.1-001"; exon_number 1;

chr8 HAVANA exon 8639206 8639489 . - . gene_id "ENSMUSG00000001300.16"; transcript_id "ENSMUST00000001319.14"; gene_type "protein_coding"; gene_name "Efnb2"; transcript_type "protein_coding"; transcript_name "AC161867.1-001"; exon_number 2;

From same GTF file, two exons from same gene but DIFFERENT transcript.

chr8 HAVANA exon 8639206 8639363 . - . gene_id "ENSMUSG00000001300.16"; transcript_id "ENSMUST00000152698.1"; gene_type "protein_coding"; gene_name "Efnb2"; transcript_type "protein_coding"; transcript_name "AC161867.1-002"; exon_number 1;

chr8 HAVANA exon 8622331 8622444 . - . gene_id "ENSMUSG00000001300.16"; transcript_id "ENSMUST00000152698.1"; gene_type "protein_coding"; gene_name "Efnb2"; transcript_type "protein_coding"; transcript_name "AC161867.1-002"; exon_number 2;

Take note of the coordinates for each of these exons.

Output in GFF file from DEXSeq:

chr8 dexseq_prepare_annotation.py exonic_part 8617434 8620596 . - . gene_id "ENSMUSG00000001300.16"; transcripts "ENSMUST00000001319.14"; exonic_part_number "001"

chr8 dexseq_prepare_annotation.py exonic_part 8620597 8620976 . - . gene_id "ENSMUSG00000001300.16"; transcripts "ENSMUST00000152698.1+ENSMUST00000001319.14"; exonic_part_number "002"

These are not the right coordinates!

Instead, DEXSeq has them annotated as this, where exon 2 is now broken into exon5 and exon6, compare start coordinate of exon5 and end coordinate of exon6 listed below, with start/stop coordinate of exon 2 from GTF file:

chr8 dexseq_prepare_annotation.py exonic_part 8639206 8639363 . - . gene_id "ENSMUSG00000001300.16"; transcripts "ENSMUST00000152698.1+ENSMUST00000001319.14"; exonic_part_number "005"

chr8 dexseq_prepare_annotation.py exonic_part 8639364 8639489 . - . gene_id "ENSMUSG00000001300.16"; transcripts "ENSMUST00000001319.14"; exonic_part_number "006"

Anyone else have this problem or know how to fix it, I think it is collapsing different transcript exons together but obviously not in the correct way. When I load the exon coordinates as bed files into IGV, none of them line up as you'd expect.

software error DEXSeq • 1.2k views
ADD COMMENT
0
Entering edit mode

Thanks for your post! Sorry, I could not really follow your message completely. Is the script doing something different to what is describe on the documentation? Based on your post I did not see that this was the case, but Icould be wrong.

See this for image as a reference: https://images.app.goo.gl/QDZWMmsttkBTGaGZA

Or what is the output that you were expecting?

ADD REPLY

Login before adding your answer.

Traffic: 1033 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6