Question

dexseq zero length exons

0

Entering edit mode

Jose M Garcia Manteiga ▴ 310

@jose-m-garcia-manteiga-6046

Last seen 7.7 years ago

Italy

Dear Alejandro, I have just analysed my results using DEXSeq 1.9. I had prepared the gff annotation using the version of dexseq_prepare.annotation.py in DEXSeq 1.9 that contained the -r parameter but counted with an old version, as you advised to me, of the dexseq_count.py since the newest included a check for a NH tag, which my bam files (SOAPSplice) did not contain. Everything worked fine but when checking some of the exons I realised that there are some, and some amongst the DEU results too, that showed length equal zero. I have checked and there are a number of those amongst the GFF. I have checked in the original gtf file I got from Ensembl (v72 hg19) for one of those genes and I do not understand why there should be one of length 0. Here you are a histogram of all exons showing very short exons and 0 length ones and an example of the gtf and gff file prepared by dexseq_prepare_annotation.py. Is it normal? How can there be exons with the same start and end in the gff file? How can they have reads assigned and even pass the tests? # exons length exon.length<-fData(ecs)$end - fData(ecs)$start hist(log2(exon.length),breaks=100) The gff file 'grepped' for one example: $ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff chr1 dexseq_prepare_annotation.py aggregate_gene 24683489 24743424 . - . gene_id "ENSG00000001460" *chr1 dexseq_prepare_annotation.py exonic_part 24683489 24683489 . - . transcripts "ENST00000374409"; exonic_part_number "001"; gene_id "ENSG00000001460"* chr1 dexseq_prepare_annotation.py exonic_part 24683490 24683494 . - . transcripts "ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id "ENSG00000001460" chr1 dexseq_prepare_annotation.py exonic_part 24683495 24683526 . - . transcripts "ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409"; exonic_part_number "003"; gene_id "ENSG00000001460" chr1 dexseq_prepare_annotation.py exonic_part 24683527 24685109 . - . transcripts "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST0 0000337248"; exonic_part_number "004"; gene_id "ENSG00000001460" chr1 dexseq_prepare_annotation.py exonic_part 24687341 24687531 . - . transcripts "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST0 0000337248"; exonic_part_number "005"; gene_id "ENSG00000001460" chr1 dexseq_prepare_annotation.py exonic_part 24695194 24695532 . - . transcripts "ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460" Now the gtf used as input for dexseq_prepare_annotation.py for that gene searching for the start and end coordinate: grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489' chr1 hg19_ensGene exon *24683489 24685109* 0.000000 - . gene_id "ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name "STPG1"; It seems that the gtf is OK and there is an exon that starts at that position but nothing at the end. I checked also the ecs object coming from an older analysis of the same dataset using the two py scripts from the same release that gave me a good results but I wanted to repeat to include the -r parameter, and it also contains a lot of these zero length exons. Is it normal? I have further checked within the Genome Browser and it seems that there is another Transcript with start at +1 with respect to that. If this is so it means that zero-length in the end-start difference from the gff file that I found in the fData of the ecs object is indeed 1bp length and just a matter of 0/1 conventions. If this is so, I would be more relieved, but please confirm. In anycase, may I ask how reliable counts of ~75bp read can be to an exon made from 1 pb difference in the start of two transcripts in the annotation? how can they even pass the filter of p-value for DEU analysis? Shouldn't it be a sort of cut off for this 'pseudo'exons? Thanks again for your help Best Jose -- Jose M. Garcia Manteiga PhD Research Assistant - Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 3A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-4751 e-mail: garciamanteiga.josemanuel at hsr.it -------------- next part -------------- A non-text attachment was scrubbed... Name: Exon_length_1.9.pdf Type: application/pdf Size: 5196 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140310="" b3021f2f="" attachment.pdf="">

Annotation DEXSeq Annotation DEXSeq • 1.3k views

ADD COMMENT • link updated 10.8 years ago by Alejandro Reyes ★ 1.9k • written 10.8 years ago by Jose M Garcia Manteiga ▴ 310

score 0 · Answer 1 · 2014-03-10

Dear Jose, Thanks a lot for your detailed report! The cases that you mention are "normal" cases that are originated from our exon bining, these are instances where the annotation file contains cases like the one you just describe (having a transcript initiation site or alternative splice site with single base pair differences). These are exon bins covering a single base pair (the length of the exon bins is fData(ecs)$end - fData(ecs)$start + 1). The reason you can get counts and can achieve significance for DEU for small exon bins is because the reads mapping to multiple exon bins are counted in all the exon bins, so these are cases where the read maps to this small exon bins and many other regions of the genome. I think having differences in these exons can still biologically informative despite its length, but it always depends on the analysis you are doing. Say, if you are interested in large cassette skipping events you could get rid of this small bins and reduce the number of tests. Best regards, Alejandro > Dear Alejandro, > I have just analysed my results using DEXSeq 1.9. > I had prepared the gff annotation using the version of > dexseq_prepare.annotation.py <http: dexseq_prepare.annotation.py=""> in > DEXSeq 1.9 that contained the -r parameter but counted with an old > version, as you advised to me, of the dexseq_count.py since the newest > included a check for a NH tag, which my bam files (SOAPSplice) did not > contain. Everything worked fine but when checking some of the exons > I realised that there are some, and some amongst the DEU results too, > that showed length equal zero. I have checked and there are a number > of those amongst the GFF. I have checked in the original gtf file I > got from Ensembl (v72 hg19) for one of those genes and I do not > understand why there should be one of length 0. > Here you are a histogram of all exons showing very short exons and 0 > length ones and an example of the gtf and gff file prepared by > dexseq_prepare_annotation.py. > Is it normal? How can there be exons with the same start and end in > the gff file? How can they have reads assigned and even pass the tests? > > # exons length > exon.length<-fData(ecs)$end - fData(ecs)$start > hist(log2(exon.length),breaks=100) > > The gff file 'grepped' for one example: > > $ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff > > chr1dexseq_prepare_annotation.pyaggregate_gene2468348924743424.-.gen e_id > "ENSG00000001460" > > *chr1dexseq_prepare_annotation.pyexonic_part_24683489__24683489_.-.t ranscripts > "ENST00000374409"; exonic_part_number "001"; gene_id "ENSG00000001460"* > > chr1dexseq_prepare_annotation.pyexonic_part2468349024683494.-.transc ripts > "ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id > "ENSG00000001460" > > chr1dexseq_prepare_annotation.pyexonic_part2468349524683526.-.transc ripts > "ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409"; > exonic_part_number "003"; gene_id "ENSG00000001460" > > chr1dexseq_prepare_annotation.pyexonic_part2468352724685109.-.transc ripts > "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS T00000337248"; > exonic_part_number "004"; gene_id "ENSG00000001460" > > chr1dexseq_prepare_annotation.pyexonic_part2468734124687531.-.transc ripts > "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENS T00000337248"; > exonic_part_number "005"; gene_id "ENSG00000001460" > > chr1dexseq_prepare_annotation.pyexonic_part2469519424695532.-.transc ripts > "ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460" > > Now the gtf used as input for dexseq_prepare_annotation.py for that > gene searching for the start and end coordinate: > > grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489' > > chr1hg19_ensGeneexon*2468348924685109*0.000000-.gene_id > "ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name "STPG1"; > > It seems that the gtf is OK and there is an exon that starts at that > position but nothing at the end. > > I checked also the ecs object coming from an older analysis of the > same dataset using the two py scripts from the same release that gave > me a good results but I wanted to repeat to include the -r parameter, > and it also contains a lot of these zero length exons. > > Is it normal? > > I have further checked within the Genome Browser and it seems that > there is another Transcript with start at +1 with respect to that. > > If this is so it means that zero-length in the end-start difference > from the gff file that I found in the fData of the ecs object is > indeed 1bp length and just a matter of 0/1 conventions. If this is so, > I would be more relieved, but please confirm. > > In anycase, may I ask how reliable counts of ~75bp read can be to an > exon made from 1 pb difference in the start of two transcripts in the > annotation? how can they even pass the filter of p-value for DEU > analysis? Shouldn't it be a sort of cut off for this 'pseudo'exons? > > > Thanks again for your help > > Best > > Jose > > > > > > > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > <mailto:garciamanteiga.josemanuel at="" hsr.it="">