Is there a way to ignore non-coding regions for DEXseq analysis? I am getting a lot of UTRs and it would be nice to eliminate those. Is there a way to do that in DEXseq itself or do I need to somehow modify the GTF file or is there some other solution?
How would you modify the GTF to remove non-coding parts? Since the GTF file doesn't differentiate between coding and non-coding exons, is the best option to remove exons and UTRs and rename CDSs to exons? Since that seems like a somewhat questionable approach, is there a better solution?
Probably an easier option would be to change the dexseq_prepare_annotation.py script here:
exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
for f in HTSeq.GFF_Reader( gtf_file ):
if f.type != "exon":
continue
f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" )
exons[f.iv] += ( f.attr['gene_id'], f.attr['transcript_id'] )
Replace with
if f.type != "CDS":
This should work, but I have not tested it before!
Alejandro