To Whom It May Concern,
I am currently working on a project examining differential expression in RNA-seq data with DEXseq and I have encountered a few flags when using the dexseq_prepare_annotation.py script to interpret my .gtf file. My annotation file was generated by NCBI and I initially had to change 'Parent' to 'gene_id' in the attribute column (please see below). However now I am getting another flag regarding 'transcript_id' but I don't understand why? Any insight as to how I can modify the format to meet the criteria for proper .gtf input would be greatly appreciated.
Thank you,
A. Romney
error:
(env2)arom2:~:1015 > python /vol/apps/user/stow/R-3.2.1/lib64/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py fhet_prep.gtf Fhet_final.gff
Traceback (most recent call last):
File "/vol/apps/user/stow/R-3.2.1/lib64/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py", line 55, in <module>
exons[f.iv] += ( f.attr['gene_id'], f.attr['transcript_id'] )
KeyError: 'transcript_id'
Here is a sub-section of the gtf file I am using to show you how the gene_id and transcript_id are identified.
NW_012224401.1 Gnomon exon 71785 71884 . + . ID=id44;gene_id=rna2;Dbxref=GeneID:105915271,Genbank:XM_012849362.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X1;transcript_id=XM_012849362.1
NW_012224401.1 Gnomon exon 72804 72915 . + . ID=id45;gene_id=rna2;Dbxref=GeneID:105915271,Genbank:XM_012849362.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X1;transcript_id=XM_012849362.1
NW_012224401.1 Gnomon exon 76564 76791 . + . ID=id46;gene_id=rna2;Dbxref=GeneID:105915271,Genbank:XM_012849362.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X1;transcript_id=XM_012849362.1
NW_012224401.1 Gnomon mRNA 62183 76791 . + . ID=rna3;gene_id=gene2;Dbxref=GeneID:105915271,Genbank:XM_012849437.1;Name=XM_012849437.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X2;transcript_id=XM_012849437.1
NW_012224401.1 Gnomon exon 62183 62306 . + . ID=id47;gene_id=rna3;Dbxref=GeneID:105915271,Genbank:XM_012849437.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X2;transcript_id=XM_012849437.1
NW_012224401.1 Gnomon exon 62565 62634 . + . ID=id48;gene_id=rna3;Dbxref=GeneID:105915271,Genbank:XM_012849437.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X2;transcript_id=XM_012849437.1
NW_012224401.1 Gnomon exon 63886 64173 . + . ID=id49;gene_id=rna3;Dbxref=GeneID:105915271,Genbank:XM_012849437.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X2;transcript_id=XM_012849437.1
NW_012224401.1 Gnomon exon 64260 64547 . + . ID=id50;gene_id=rna3;Dbxref=GeneID:105915271,Genbank:XM_012849437.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X2;transcript_id=XM_012849437.1
NW_012224401.1 Gnomon exon 64630 64766 . + . ID=id51;gene_id=rna3;Dbxref=GeneID:105915271,Genbank:XM_012849437.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X2;transcript_id=XM_012849437.1
NW_012224401.1 Gnomon exon 64845 65010 . + . ID=id52;gene_id=rna3;Dbxref=GeneID:105915271,Genbank:XM_012849437.1;gbkey=mRNA;gene=csf1r;product=colony stimulating factor 1 receptor%2C transcript variant X2;transcript_id=XM_012849437.1
Hi Rommey,
I am not sure what is exactly the problem with your annotation file. But I noticed that it has lots of fields, some of them with strange characters and spaces in the attribute values. The format 'attribute=value' is gff format, gtf format usually uses the 'attribute = "value"' format.
I manually modified your file (but of course can be written in a small script) to the shape below, and the DEXSeq script worked as expected!
NW_012224401.1 Gnomon exon 71785 71884 . + . gene_id rna2; transcript_id XM_012849362.1
NW_012224401.1 Gnomon exon 72804 72915 . + . gene_id rna2; transcript_id XM_012849362.1
NW_012224401.1 Gnomon exon 76564 76791 . + . gene_id rna2; transcript_id XM_012849362.1
NW_012224401.1 Gnomon mRNA 62183 76791 . + . gene_id gene2; transcript_id XM_012849437.1
NW_012224401.1 Gnomon exon 62183 62306 . + . gene_id rna3; transcript_id XM_012849437.1
NW_012224401.1 Gnomon exon 62565 62634 . + . gene_id rna3; transcript_id XM_012849437.1
NW_012224401.1 Gnomon exon 63886 64173 . + . gene_id rna3; transcript_id XM_012849437.1
NW_012224401.1 Gnomon exon 64260 64547 . + . gene_id rna3; transcript_id XM_012849437.1
NW_012224401.1 Gnomon exon 64630 64766 . + . gene_id rna3; transcript_id XM_012849437.1
NW_012224401.1 Gnomon exon 64845 65010 . + . gene_id rna3; transcript_id XM_012849437.1