Entering edit mode
Bulak Arpat
▴
10
@bulak-arpat-5361
Last seen 10.6 years ago
Stephen Turner <vustephen at="" ...=""> writes:
>
> Alejandro, Simon, Wolfgang, et al.:
>
> I'm trying to use the dexseq_prepare_annotation.py script to parse
the
> UCSC hg18 genes.gtf GTF file included with the Illumina igenomes
> packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an
> error:
>
> Traceback (most recent call last):
> File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in
<module>
> raise ValueError, "Same name found on two chromosomes: %s, %s" %
(
> str(l[i]), str(l[i+1]) )
> ValueError: Same name found on two chromosomes: <genomicfeature:> exonic_part 'CFB' at chr6_qbl_hap2: 3167392 -> 3167602 (strand
'+')>,
> <genomicfeature: exonic_part="" 'cfb'="" at="" chr6_cox_hap1:="" 3359983="" -="">
> 3360325 (strand '+')>
>
> I'm guessing this is because the same gene name is found in two
> separate places. I'm not entirely sure what these two chromosomal
> segments refer to, but I removed them from the GTF file and the
python
> script threw another error:
>
> Traceback (most recent call last):
> File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in
<module>
> assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts
too early"
> AssertionError: <genomicfeature: exonic_part="" 'hist2h3c+hist2h3a'="" at=""> chr1: 148079388 -> 148078883 (strand '-')> starts too early
>
> I'm really unsure what to make of this or how to fix it. The script
> works without issues with the Ensembl GTF. Any help would be greatly
> appreciated.
>
> Stephen
>
> -----------------------------------------
> Stephen D. Turner, Ph.D.
> Bioinformatics Core Director
> University of Virginia School of Medicine
> bioinformatics.virginia.edu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
Dear Stephen,
I had the same problem when I tried dexseq_prepare_annotation.py with
the mm9 or
mm10 GTF files from the Illumina igenomes collection. And like you
have
mentioned it worked well with an Ensembl version. Going through the
script and
the data files, I have realized all the problems go back to one root:
Ensembl
has unique gene_id for each locus whereas other files have gene_id
generated
from gene_name attribute. This replicates the gene_id for some loci as
there are
multiple (cis/trans) coding regions. For a quick fix I have done the
following
modification to the script (around line# 28):
f.attr['gene_id'] = f.iv.chrom + '_' + f.attr['gene_id'].replace( ":",
"_" ) +
f.iv.strand
This generates a 'unique' gene_id for the script by combining the
chromosome
number, gene name and strand information. As I said, it is a quick fix
but it
seems to work so far without problems. I hope it might be of use for
you.
Best,
Bulak Arpat, PhD
Bioinformatician
Center for Integrative Genomics
University of Lausanne