Hi,
I have aligned RNA-Seq data from human samples to GRCh37. The flattened file has also been created using the same database using the dexseq_prepare_annotation.py and the counts have been generated using the dexseq_counts.py script. However, the total number of exon for most of the genes are greater than what has been reported in literature. e.g. SNCA, which is reported to have 5 exons shows 26 exons in the analysis and in the flattened file. I have previously worked with the mouse genome using GRCh38 but have not had any issue. What could be going wrong?
Thanks in advance for your suggestions,
Aditi
Hey,
Have a look at the vignette for DEXSeq. The prepare_annotation script flattens all isoforms from the same gene into a single representation of this gene. During this process it creates exonic bins out of the overlapping exons. For example if you have gene with two isoforms and in isoform A the exon goes from coordinate 15 to 50 and in isoform B the exon goes from from coordinate 15 to 25 it will create two bins: exonic_001:15-25 and exonic_002:26:50.
Most likely that's why you have 26 exonic bins.