Question

DEXSeq: dexseq_count.py - Failure parsing GFF attribute line

0

Entering edit mode

lcscs12345 • 0

@lcscs12345-9530

Last seen 8.3 years ago

Hi,

I received the following errors when running dexseq_count.py. I've flattened Mus_musculus.GRCm38.83.gtf downloaded from Ensembl (same errors on flattened Mus_musculus.GRCm38.75.gtf, Mus_musculus.NCBIM37.66.gtf). Should I report these errors to HTSeq instead? Thank you!

$ python dexseq_count.py -p no -s no accepted_hits.sam flattened.gtf dexseq.out
Traceback (most recent call last):
  File "dexseq_count.py", line 94, in <module>
    for f in  HTSeq.GFF_Reader( gff_file ):
  File "/usr/local/lib/python2.7/dist-packages/HTSeq/__init__.py", line 208, in __iter__
    ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
  File "/usr/local/lib/python2.7/dist-packages/HTSeq/__init__.py", line 164, in parse_GFF_attribute_string
    raise ValueError, "Failure parsing GFF attribute line"
ValueError: Failure parsing GFF attribute line

$ pip show HTSeq
---
Name: HTSeq
Version: 0.6.1p1
Location: /usr/local/lib/python2.7/dist-packages
Requires:

$ cat /path/to/DEXSeq/DESCRIPTION
Package: DEXSeq
Version: 1.16.7

----------------------------

EDIT:

$ head dexseq.gtf
1       dexseq_prepare_annotation.py    aggregate_gene  3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"
1       dexseq_prepare_annotation.py    exonic_part     3073253 3074322 .       +       .       transcripts "ENSMUST00000193812"; exonic_part_number "001"; gene_id "ENSMUSG00000102693"
1       dexseq_prepare_annotation.py    aggregate_gene  3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"
1       dexseq_prepare_annotation.py    exonic_part     3102016 3102125 .       +       .       transcripts "ENSMUST00000082908"; exonic_part_number "001"; gene_id "ENSMUSG00000064842"
1       dexseq_prepare_annotation.py    aggregate_gene  3205901 3671498 .       -       .       gene_id "ENSMUSG00000051951"
1       dexseq_prepare_annotation.py    exonic_part     3205901 3206522 .       -       .       transcripts "ENSMUST00000162897"; exonic_part_number "001"; gene_id "ENSMUSG00000051951"
1       dexseq_prepare_annotation.py    exonic_part     3206523 3207317 .       -       .       transcripts "ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "002"; gene_id "ENSMUSG00000051951"
1       dexseq_prepare_annotation.py    exonic_part     3213439 3213608 .       -       .       transcripts "ENSMUST00000159265"; exonic_part_number "003"; gene_id "ENSMUSG00000051951"
1       dexseq_prepare_annotation.py    exonic_part     3213609 3214481 .       -       .       transcripts "ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "004"; gene_id "ENSMUSG00000051951"
1       dexseq_prepare_annotation.py    exonic_part     3214482 3215632 .       -       .       transcripts "ENSMUST00000070533+ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "005"; gene_id "ENSMUSG00000051951"

Different errors when using -f bam

$ python dexseq_count.py -p no -s no -f bam accepted_hits.bam dexseq.gtf dexseq.out
Traceback (most recent call last):
  File "dexseq_count.py", line 94, in <module>
    for f in  HTSeq.GFF_Reader( gff_file ):
  File "/usr/local/lib/python2.7/dist-packages/HTSeq/__init__.py", line 207, in __iter__
    strand, frame, attributeStr ) = line.split( "\t", 8 )
ValueError: need more than 2 values to unpack

----------------------------

DEXSeq • 1.9k views

ADD COMMENT • link updated 8.3 years ago by Alejandro Reyes ★ 1.9k • written 8.3 years ago by lcscs12345 • 0

0

Entering edit mode

Thanks a lot for your detailed report! Could you include the first lines of your flattened gtf file?

Alejandro

ADD REPLY • link 8.3 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Interestingly, I received different errors for flattened Homo_sapiens.GRCh37.70.gtf.

$ python dexseq_count.py -p no accepted_hits.sam dexseq.gtf dexseq.out
Traceback (most recent call last):
  File "dexseq_count.py", line 94, in <module>
    for f in  HTSeq.GFF_Reader( gff_file ):
  File "/usr/local/lib/python2.7/dist-packages/HTSeq/__init__.py", line 210, in __iter__
    iv = GenomicInterval( seqname, int(start)-1, int(end), strand )
  File "_HTSeq.pyx", line 62, in HTSeq._HTSeq.GenomicInterval.__init__ (src/_HTSeq.c:2789)
  File "_HTSeq.pyx", line 71, in HTSeq._HTSeq.GenomicInterval.strand.__set__ (src/_HTSeq.c:2910)
ValueError: Strand must be'+', '-', or '.'.

$ head dexseq.gtf
1       dexseq_prepare_annotation.py    aggregate_gene  11869   14412   .       +       .       gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     11869   11871   .       +       .       transcripts "ENST00000456328"; exonic_part_number "001"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     11872   11873   .       +       .       transcripts "ENST00000456328+ENST00000515242"; exonic_part_number "002"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     11874   12009   .       +       .       transcripts "ENST00000456328+ENST00000515242+ENST00000518655"; exonic_part_number "003"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     12010   12057   .       +       .       transcripts "ENST00000456328+ENST00000515242+ENST00000450305+ENST00000518655"; exonic_part_number "004"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     12058   12178   .       +       .       transcripts "ENST00000456328+ENST00000515242+ENST00000518655"; exonic_part_number "005"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     12179   12227   .       +       .       transcripts "ENST00000456328+ENST00000515242+ENST00000450305+ENST00000518655"; exonic_part_number "006"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     12595   12612   .       +       .       transcripts "ENST00000518655"; exonic_part_number "007"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     12613   12697   .       +       .       transcripts "ENST00000456328+ENST00000515242+ENST00000450305+ENST00000518655"; exonic_part_number "008"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     12698   12721   .       +       .       transcripts "ENST00000456328+ENST00000515242+ENST00000518655"; exonic_part_number "009"; gene_id "ENSG00000223972"

ADD REPLY • link 8.3 years ago lcscs12345 • 0

score 0 · Answer 1 · 2016-01-17

0

Entering edit mode

Alejandro Reyes ★ 1.9k

@alejandro-reyes-5124

Last seen 6 days ago

Novartis Institutes for BioMedical Rese…

Hi again,

Strange, I could not reproduce the error message, this is what I did:

wget ftp://ftp.ensembl.org/pub/release-83/gtf/mus_musculus/Mus_musculus.GRCm38.83.gtf.gz gunzip Mus_musculus.GRCm38.83.gtf.gz python /g/huber/users/reyes/Rpcks/branches/DEXSeq/inst/python_scripts/dexseq_prepare_annotation.py -r no Mus_musculus.GRCm38.83.gtf Mus_musculus.GRCm38.83.DEXSeq.gtf

And then in python, I ran the part of the script that uses the GFF reader:

>>> gff_file = "Mus_musculus.GRCm38.83.DEXSeq.gtf"
>>> features = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
>>> for f in  HTSeq.GFF_Reader( gff_file ):
...    if f.type == "exonic_part":
...       f.name = f.attr['gene_id'] + ":" + f.attr['exonic_part_number']
...       features[f.iv] += f
...
>>>

But I did not get an error message. Could you include the code that you are using?
Alejandro

ADD COMMENT • link 8.3 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

The script runs on my Ubuntu local machine but not on a RedHat server. On the server, I've tried ActivePython 2.7 + HTSeq 0.6.1 (original post) and Python 2.6 + HTSeq 0.5.4 (below errors). GFF reader works fine however.

$ wget ftp://ftp.ensembl.org/pub/release-83/gtf/mus_musculus/Mus_musculus.GRCm38.83.gtf.gz
$ gunzip Mus_musculus.GRCm38.83.gtf.gz
$ python ~/R/x86_64-redhat-linux-gnu-library/3.2/DEXSeq/python_scripts/dexseq_prepare_annotation.py -r no Mus_musculus.GRCm38.83.gtf Mus_musculus.GRCm38.83.dexseq.gtf
$ python ~/R/x86_64-redhat-linux-gnu-library/3.2/DEXSeq/python_scripts/dexseq_count.py -p no ~/doc/mouse/nih3t3/tophat_out/accepted_hits.sam Mus_musculus.GRCm38.83.dexseq.gtf out
Traceback (most recent call last):
  File "/Network/Servers/biocldap.otago.ac.nz/Volumes/BiochemXsan/student_users/chunshenlim/R/x86_64-redhat-linux-gnu-library/3.2/DEXSeq/python_scripts/dexseq_count.py", line 94, in <module>
    for f in  HTSeq.GFF_Reader( gff_file ):
  File "/usr/lib64/python2.6/site-packages/HTSeq-0.5.4p5-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 221, in __iter__
    ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
  File "/usr/lib64/python2.6/site-packages/HTSeq-0.5.4p5-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 177, in parse_GFF_attribute_string
    raise ValueError, "Failure parsing GFF attribute line"
ValueError: Failure parsing GFF attribute line

$ python
>>> import HTSeq
>>> gff_file = "Mus_musculus.GRCm38.83.dexseq.gtf"
>>> features = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
>>> for f in  HTSeq.GFF_Reader( gff_file ):
...     if f.type == "exonic_part":
...         f.name = f.attr['gene_id'] + ":" + f.attr['exonic_part_number']
...         features[f.iv] += f
...
>>> quit()

ADD REPLY • link 8.3 years ago lcscs12345 • 0