DEXSeq: problem with dexseq_prepare_annotation.py
0
0
Entering edit mode
Bulak Arpat ▴ 10
@bulak-arpat-5361
Last seen 9.6 years ago
Stephen Turner <vustephen at="" ...=""> writes: > > Alejandro, Simon, Wolfgang, et al.: > > I'm trying to use the dexseq_prepare_annotation.py script to parse the > UCSC hg18 genes.gtf GTF file included with the Illumina igenomes > packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an > error: > > Traceback (most recent call last): > File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in <module> > raise ValueError, "Same name found on two chromosomes: %s, %s" % ( > str(l[i]), str(l[i+1]) ) > ValueError: Same name found on two chromosomes: <genomicfeature:> exonic_part 'CFB' at chr6_qbl_hap2: 3167392 -> 3167602 (strand '+')>, > <genomicfeature: exonic_part="" 'cfb'="" at="" chr6_cox_hap1:="" 3359983="" -=""> > 3360325 (strand '+')> > > I'm guessing this is because the same gene name is found in two > separate places. I'm not entirely sure what these two chromosomal > segments refer to, but I removed them from the GTF file and the python > script threw another error: > > Traceback (most recent call last): > File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in <module> > assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early" > AssertionError: <genomicfeature: exonic_part="" 'hist2h3c+hist2h3a'="" at=""> chr1: 148079388 -> 148078883 (strand '-')> starts too early > > I'm really unsure what to make of this or how to fix it. The script > works without issues with the Ensembl GTF. Any help would be greatly > appreciated. > > Stephen > > ----------------------------------------- > Stephen D. Turner, Ph.D. > Bioinformatics Core Director > University of Virginia School of Medicine > bioinformatics.virginia.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at ... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > Dear Stephen, I had the same problem when I tried dexseq_prepare_annotation.py with the mm9 or mm10 GTF files from the Illumina igenomes collection. And like you have mentioned it worked well with an Ensembl version. Going through the script and the data files, I have realized all the problems go back to one root: Ensembl has unique gene_id for each locus whereas other files have gene_id generated from gene_name attribute. This replicates the gene_id for some loci as there are multiple (cis/trans) coding regions. For a quick fix I have done the following modification to the script (around line# 28): f.attr['gene_id'] = f.iv.chrom + '_' + f.attr['gene_id'].replace( ":", "_" ) + f.iv.strand This generates a 'unique' gene_id for the script by combining the chromosome number, gene name and strand information. As I said, it is a quick fix but it seems to work so far without problems. I hope it might be of use for you. Best, Bulak Arpat, PhD Bioinformatician Center for Integrative Genomics University of Lausanne
GO GO • 1.8k views
ADD COMMENT

Login before adding your answer.

Traffic: 965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6