Question

DEXSeq Python script dexseq_prepare_annotation.py unable to process my .gtf file

0

Entering edit mode

Raito92 ▴ 60

@raito92-20399

Last seen 3.6 years ago

Italy

Good evening to everyone,

I'm trying to run the workflow DEXSeq to perform a DE analysis on Olea europaea transcripts.

I think I need some help here... More specifically, I'm stuck at the paragraph 2.4 Preparing the annotation. Apparently, I need to run the script dexseqprepareannotation.py to prepare my .GTF file for the workflow. I'm using the release Oe6 from https://denovo.cnag.cat/olive , which is a .GFF3 instead. This is the exact file I've used.

Maybe I'm missing something, I thought it would make more sense to use the .GFF3 directly, but I first converted it to .GTF and then to a .GFF to stick to the workflow, as follows:

gffread OE6A.newFA.gff3 -T -o OE6A.newFA.gtf

python dexseq_prepare_annotation.py OE6A.newFA.gtf OE6A.newFA.gff

After a few seconds, I get this error:

Traceback (most recent call last):
  File "dexseq_prepare_annotation.py", line 127, in <module>
    assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early"
AssertionError: <GenomicFeature: exonic_part 'OE6A112357' at Oe6_s02107: 324809 -> 321700 (strand '-')> starts too early

I tried to remove this specific gene from my .GFF, but there are plenty of them, always making the script outputting the same error, just with a different line. And of course, since I found at least a dozen, this will end up potentially cutting out important information.

Here, an user reported a similar problem. Apparently, it's something related to the way my annotation file was built.

Look at the OE6A112357 annotation retrieved from my .GTF.

Oe6_s02107  CNAG    transcript  318359  324919  .   +   .   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    exon    318359  318658  .   +   .   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    exon    319380  319449  .   +   .   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    exon    319924  321129  .   +   .   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    exon    321690  324919  .   +   .   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    CDS 318600  318658  .   +   0   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    CDS 319380  319449  .   +   1   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    CDS 319924  321129  .   +   0   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    CDS 321690  322262  .   +   0   transcript_id "OE6A112357T1"; gene_id "OE6A112357";
Oe6_s02107  CNAG    transcript  319165  324919  .   +   .   transcript_id "OE6A112357T2"; gene_id "OE6A112357";
Oe6_s02107  CNAG    exon    319165  321129  .   +   .   transcript_id "OE6A112357T2"; gene_id "OE6A112357";
Oe6_s02107  CNAG    exon    321690  324919  .   +   .   transcript_id "OE6A112357T2"; gene_id "OE6A112357";
Oe6_s02107  CNAG    CDS 320125  321129  .   +   0   transcript_id "OE6A112357T2"; gene_id "OE6A112357";
Oe6_s02107  CNAG    CDS 321690  322262  .   +   0   transcript_id "OE6A112357T2"; gene_id "OE6A112357";
**Oe6_s02107    CNAG    transcript  321702  326749  .   -   .   transcript_id "OE6A112357T4"; gene_id "OE6A112357";
Oe6_s02107  CNAG    exon    321702  324810  .   -   .   transcript_id "OE6A112357T4"; gene_id "OE6A112357";
Oe6_s02107  CNAG    exon    325991  326749  .   -   .   transcript_id "OE6A112357T4"; gene_id "OE6A112357";
Oe6_s02107  CNAG    CDS 322942  324810  .   -   0   transcript_id "OE6A112357T4"; gene_id "OE6A112357";
Oe6_s02107  CNAG    CDS 325991  326230  .   -   0   transcript_id "OE6A112357T4"; gene_id "OE6A112357";**

The OE6A112357T4 part is apparently the problem, since that specific transcript is on the - strain, while the others are on the +.

Can anyone help me here?

Can I use my .GFF3 directly for the analysis somehow (I tried and it's not accepted by the scripts), maybe bypassing the problem? If this isn't the case, what may the problem about this script be and how can I solve it?

I'm also attaching my converted .GTF file, as obtained by gffread.

Thanks in advance for your help!

dexseq python gtf gff3 annotation • 2.4k views

ADD COMMENT • link 5.8 years ago • updated 5.7 years ago Raito92 ▴ 60

score 1 · Answer 1 · 2020-05-05

1

Entering edit mode

Alejandro Reyes ★ 1.9k

@alejandro-reyes-5124

Last seen 18 months ago

Novartis Institutes for BioMedical Rese…

Thanks for your detailed message. That's an unusual (and maybe biologically very interesting!?) annotation. HTSeq is interpreting as an error in the annotation, as all the transcripts of a gene should be in the same strand. An easy solution would be to modify the gene identifier for the transcript in the "-" strand or removing that transcript from the gtf annotation.

ADD COMMENT • link 5.7 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hello, thanks for your answer! I have no idea how accurate the annotation may be, since it wasn't created by my research team. As far as I know, it's only a draft and a better version for this specific species is still to be created. For sure, it's quite interesting and unusual, as you said as well. Thanks for your suggestion, that's exactly what I did, even though it meant cutting out some information. In the end, I removed the genes which lead to this problem (only 18 of them, they were quite few, fortunately) from my annotation file manually. Id did this by just retrieving their names every time I re-ran the script and got an error, removing the corresponding annotation information from my .gtf file, re-inputing the corrected file and so on, till everything worked fine.

ADD REPLY • link 5.7 years ago Raito92 ▴ 60