DEXSeq - too many exons in gene

0

Entering edit mode

@antonio-miguel-de-jesus-domingues-5182

Last seen 2.0 years ago

Germany

Hi Bioconductors, I happened upon a funny thing in DEXseq: a gene which appears to have more exons in the final DEXseq output than the annotation suggests. The gene ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 exons in a flattened gene model. However, the DEXSeq results lists 13 exons (here showing the output of htseq-count): grep ENSMUSG00000027854 htseq_count_out.txt ENSMUSG00000027854:001 0 ENSMUSG00000027854:002 6 ENSMUSG00000027854:003 18 ENSMUSG00000027854:004 0 ENSMUSG00000027854:005 0 ENSMUSG00000027854:006 86 ENSMUSG00000027854:007 0 ENSMUSG00000027854:008 113 ENSMUSG00000027854:009 52 ENSMUSG00000027854:010 76 ENSMUSG00000027854:011 0 ENSMUSG00000027854:012 310 ENSMUSG00000027854:013 554 This comes from the annotation created with: dexseq_prepare_annotation.py mm10_ensGene.gtf mm10_ensGene.gff grep ENSMUSG00000027854 ../../data/gtf/mm10_ensGene.gff chr3 mm10_ensGene.gtf aggregate_gene 102995728 103003914 . + . gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102995728 102995729 . + . transcripts "ENSMUST00000029447"; exonic_part_number "001"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102995730 102995794 . + . transcripts "ENSMUST00000029447+ENSMUST00000151065"; exonic_part_number "002"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102995795 102995967 . + . transcripts "ENSMUST00000151065+ENSMUST00000029447+ENSMUST00000119450"; exonic_part_number "003"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102995968 102996048 . + . transcripts "ENSMUST00000151065"; exonic_part_number "004"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102996049 102996155 . + . transcripts "ENSMUST00000151065+ENSMUST00000137332"; exonic_part_number "005"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102996156 102996261 . + . transcripts "ENSMUST00000029447+ENSMUST00000137332+ENSMUST00000151065"; exonic_part_number "006"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102996262 102997242 . + . transcripts "ENSMUST00000151065"; exonic_part_number "007"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102997243 102997351 . + . transcripts "ENSMUST00000029447+ENSMUST00000137332+ENSMUST00000151065"; exonic_part_number "008"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102997352 102997385 . + . transcripts "ENSMUST00000029447+ENSMUST00000151065"; exonic_part_number "009"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102998490 102998603 . + . transcripts "ENSMUST00000151065+ENSMUST00000029447+ENSMUST00000119450"; exonic_part_number "010"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 102998604 102999251 . + . transcripts "ENSMUST00000151065"; exonic_part_number "011"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 103001708 103002194 . + . transcripts "ENSMUST00000029447+ENSMUST00000119450"; exonic_part_number "012"; gene_id "ENSMUSG00000027854" chr3 mm10_ensGene.gtf exonic_part 103002195 103003914 . + . transcripts "ENSMUST00000029447"; exonic_part_number "013"; gene_id "ENSMUSG00000027854" Between exon1 is only 1 base long (?) and exons1 to 4 are contiguous. As far as I am aware, DEXSeq model should have flattened all of these into one single "exon". Is this correct? is the error coming from the gtf? (at the end of the message there is also the gene annotation in the gtf). This is specially concerning for me because I am interested in selecting the first and last exon of genes, using the exon ranking from DEXSeq, to analyze further. Thanks, Ant?nio > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] C attached base packages: [1] grDevices datasets stats graphics utils methods base other attached packages: [1] DEXSeq_1.4.0 GenomicFeatures_1.10.2 GenomicRanges_1.10.5 [4] IRanges_1.16.6 data.table_1.8.9 stringr_0.6.2 [7] ggplot2_0.9.3.1 AnnotationDbi_1.20.2 Biobase_2.18.0 [10] BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] BSgenome_1.26.1 Biostrings_2.26.3 DBI_0.2-5 MASS_7.3-23 [5] RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.10.2 [9] XML_3.98-1.1 biomaRt_2.14.0 bitops_1.0-6 colorspace_1.2-4 [13] dichromat_2.0-0 digest_0.6.3 grid_2.15.2 gtable_0.1.2 [17] hwriter_1.3 labeling_0.2 munsell_0.4.2 parallel_2.15.2 [21] plyr_1.8 proto_0.3-10 reshape2_1.2.2 rtracklayer_1.18.1 [25] scales_0.2.3 statmod_1.4.17 stats4_2.15.2 tools_2.15.2 [29] zlibbioc_1.4.0 grep ENSMUSG00000027854 ../../data/gtf/mm10_ensGene.gtf chr3 ensGene exon 102995728 102995967 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene CDS 102995809 102995967 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102996156 102996261 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "2"; exon_id "ENSMUST00000029447.2"; gene_name "ENSMUSG00000027854"; chr3 ensGene CDS 102996156 102996261 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "2"; exon_id "ENSMUST00000029447.2"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102997243 102997385 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "3"; exon_id "ENSMUST00000029447.3"; gene_name "ENSMUSG00000027854"; chr3 ensGene CDS 102997243 102997385 . + 2 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "3"; exon_id "ENSMUST00000029447.3"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102998490 102998603 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "4"; exon_id "ENSMUST00000029447.4"; gene_name "ENSMUSG00000027854"; chr3 ensGene CDS 102998490 102998603 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "4"; exon_id "ENSMUST00000029447.4"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 103001708 103003914 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "5"; exon_id "ENSMUST00000029447.5"; gene_name "ENSMUSG00000027854"; chr3 ensGene CDS 103001708 103001806 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "5"; exon_id "ENSMUST00000029447.5"; gene_name "ENSMUSG00000027854"; chr3 ensGene start_codon 102995809 102995811 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene stop_codon 103001807 103001809 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number "1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102995730 102997385 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000151065"; exon_number "1"; exon_id "ENSMUST00000151065.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102998490 102999251 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000151065"; exon_number "2"; exon_id "ENSMUST00000151065.2"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102995795 102995967 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number "1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene CDS 102995809 102995967 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number "1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102998490 102998603 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number "2"; exon_id "ENSMUST00000119450.2"; gene_name "ENSMUSG00000027854"; chr3 ensGene CDS 102998490 102998603 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number "2"; exon_id "ENSMUST00000119450.2"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 103001708 103002194 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number "3"; exon_id "ENSMUST00000119450.3"; gene_name "ENSMUSG00000027854"; chr3 ensGene CDS 103001708 103001806 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number "3"; exon_id "ENSMUST00000119450.3"; gene_name "ENSMUSG00000027854"; chr3 ensGene start_codon 102995809 102995811 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number "1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene stop_codon 103001807 103001809 . + 0 gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number "1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102996049 102996261 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000137332"; exon_number "1"; exon_id "ENSMUST00000137332.1"; gene_name "ENSMUSG00000027854"; chr3 ensGene exon 102997243 102997351 . + . gene_id "ENSMUSG00000027854"; transcript_id "ENSMUST00000137332"; exon_number "2"; exon_id "ENSMUST00000137332.2"; gene_name "ENSMUSG00000027854"; -- Ant?nio Miguel de Jesus Domingues, PhD Postdoctoral researcher Deep Sequencing Group - SFB655 Biotechnology Center (Biotec) Technische Universit?t Dresden Fetscherstra?e 105 01307 Dresden Phone: +49 (351) 458 82362 Email: antonio.domingues(at)biotec.tu-dresden.de -- The Unbearable Lightness of Molecular Biology -------------- next part -------------- A non-text attachment was scrubbed... Name: Internal_tranbscript.pdf Type: application/pdf Size: 8751 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140206="" c58f158d="" attachment.pdf="">

Sequencing Annotation DEXSeq Sequencing Annotation DEXSeq • 3.1k views

ADD COMMENT • link updated 12.0 years ago by Steve Lianoglou ★ 13k • written 12.0 years ago by António Miguel de Jesus Domingues ▴ 510

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 4 days ago

United States

Hi, A few comments in line: On Thu, Feb 6, 2014 at 9:01 AM, Ant?nio domingues <amjdomingues at="" gmail.com=""> wrote: > Hi Bioconductors, > > I happened upon a funny thing in DEXseq: a gene which appears to have more > exons in the final DEXseq output than the annotation suggests. The gene > ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 > exons in a flattened gene model. However, the DEXSeq results lists 13 exons > (here showing the output of htseq-count): Not sure why you say the *gene* only has 3 exons ... you have highlighted one isoform of the gene which has very few exons, but you can from both your picture and the exons definitions you pasted below for ENSMUSG00000027854 (presumably that's Csde1 :-) that if you consider all of the isoforms of the gene together, it has many more than just three exons. Know what I mean? > Between exon1 is only 1 base long (?) and exons1 to 4 are contiguous. As far > as I am aware, DEXSeq model should have flattened all of these into one > single "exon". Is this correct? is the error coming from the gtf? (at the > end of the message there is also the gene annotation in the gtf). I'm trying to parse the various exon annotations from your email, but I don't see where the 1-width exon is. Figure 1 from their paper shows pretty clearly how the "break down" of exons are calcualted across isoforms to create *counting bins* -- just keep in mind that these things are not necessarily "exons" anymore. > This is specially concerning for me because I am interested in selecting the > first and last exon of genes, using the exon ranking from DEXSeq, to analyze > further. I'm not sure if what I posted was at all helpful, but if someone else doesn't do a better job of providing you with the answer you were looking for, you might try to draw a figure of a gene model (with a few splicing isoforms) and point out what it is, exactly, that you hope to extract from it. While it's clear what "First and last" exon of a *single transcript isoform* of a gene might be, it might get hairy when you start summarizing the "counting bins" across multiple isoforms of the same gene. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 12.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, thank for the comments. First of all, my apologies, I have sent the wrong screenshot. It should have been the one (attached) for Sike1. Long day. Anyway, see my replies bellow to the points that are still valid. On 02/06/2014 06:54 PM, Steve Lianoglou wrote: > Hi, > > A few comments in line: > > On Thu, Feb 6, 2014 at 9:01 AM, Ant?nio domingues > <amjdomingues at="" gmail.com=""> wrote: >> Hi Bioconductors, >> >> I happened upon a funny thing in DEXseq: a gene which appears to have more >> exons in the final DEXseq output than the annotation suggests. The gene >> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 >> exons in a flattened gene model. However, the DEXSeq results lists 13 exons >> (here showing the output of htseq-count): > Not sure why you say the *gene* only has 3 exons ... you have > highlighted one isoform of the gene which has very few exons, but you > can from both your picture and the exons definitions you pasted below > for ENSMUSG00000027854 (presumably that's Csde1 :-) that if you > consider all of the isoforms of the gene together, it has many more > than just three exons. > > Know what I mean? It is not Csde1 :s > >> Between exon1 is only 1 base long (?) and exons1 to 4 are contiguous. As far >> as I am aware, DEXSeq model should have flattened all of these into one >> single "exon". Is this correct? is the error coming from the gtf? (at the >> end of the message there is also the gene annotation in the gtf). > I'm trying to parse the various exon annotations from your email, but > I don't see where the 1-width exon is. This one: chr3 mm10_ensGene.gtf exonic_part 102995728 102995729 . + . transcripts "ENSMUST00000029447"; exonic_part_number "001"; gene_id "ENSMUSG00000027854" Unless I calculated it incorrectly. > > Figure 1 from their paper shows pretty clearly how the "break down" of > exons are calcualted across isoforms to create *counting bins* -- just > keep in mind that these things are not necessarily "exons" anymore. Yes I am aware of that but I should have been clearer in the distinction from "exon" and counting bin. I thin that with the new screenshot it will become more apparent what I mean. > >> This is specially concerning for me because I am interested in selecting the >> first and last exon of genes, using the exon ranking from DEXSeq, to analyze >> further. > I'm not sure if what I posted was at all helpful, but if someone else > doesn't do a better job of providing you with the answer you were > looking for, you might try to draw a figure of a gene model (with a > few splicing isoforms) and point out what it is, exactly, that you > hope to extract from it. > > While it's clear what "First and last" exon of a *single transcript > isoform* of a gene might be, it might get hairy when you start > summarizing the "counting bins" across multiple isoforms of the same > gene. True. I am only using the DEXseq results as a quick and dirty approach before I get data from other tools which handle better. For example, miso has annotations for alternative polyadenilation and Cufflinks provides some information on alternative promoter usage. Regardless, if the gene model is incorrect, which I hope it is and this is only me being thick, then DEXseq results from some counting bins not be trustworthy. > Oh, and by the way: > > >> Hi Bioconductors, >> >> I happened upon a funny thing in DEXseq: a gene which appears to have more >> exons in the final DEXseq output than the annotation suggests. The gene >> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 >> exons in a flattened gene model. > I'd argue that the isoform of the gene that you highlighted in your > original screen shot only has*two* exons > > -steve ehehe, correct. > > HTH, > -steve > Cheers, Ant?nio -------------- next part -------------- A non-text attachment was scrubbed... Name: ENSMUSG00000027854.png Type: image/png Size: 12348 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140206="" 2c8d57a1="" attachment.png="">

ADD REPLY • link 12.0 years ago António Miguel de Jesus Domingues ▴ 510

0

Entering edit mode

Hi Antonio, I counted 13 exonic bins by eye. What do you find to be amiss there? Remember that you're not using a flattened/union gene model with DEXseq, but rather pretty much the exact opposite (maybe it should be called a "disjoint gene model"?). BTW, that first bin is actually 2bp wide. Regards, Devon ____________________________________________ Devon Ryan, Ph.D. Email: dpryan at dpryan.com Tel: +49 (0)178 298-6067 Molecular and Cellular Cognition Lab German Centre for Neurodegenerative Diseases (DZNE) Ludwig-Erhard-Allee 2 53175 Bonn, Germany On Feb 6, 2014, at 7:12 PM, Ant?nio domingues wrote: > Hi Steve, > > thank for the comments. First of all, my apologies, I have sent the wrong screenshot. It should have been the one (attached) for Sike1. Long day. Anyway, see my replies bellow to the points that are still valid. > > On 02/06/2014 06:54 PM, Steve Lianoglou wrote: >> Hi, >> >> A few comments in line: >> >> On Thu, Feb 6, 2014 at 9:01 AM, Ant?nio domingues >> <amjdomingues at="" gmail.com=""> wrote: >>> Hi Bioconductors, >>> >>> I happened upon a funny thing in DEXseq: a gene which appears to have more >>> exons in the final DEXseq output than the annotation suggests. The gene >>> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 >>> exons in a flattened gene model. However, the DEXSeq results lists 13 exons >>> (here showing the output of htseq-count): >> Not sure why you say the *gene* only has 3 exons ... you have >> highlighted one isoform of the gene which has very few exons, but you >> can from both your picture and the exons definitions you pasted below >> for ENSMUSG00000027854 (presumably that's Csde1 :-) that if you >> consider all of the isoforms of the gene together, it has many more >> than just three exons. >> >> Know what I mean? > > It is not Csde1 :s > >> >>> Between exon1 is only 1 base long (?) and exons1 to 4 are contiguous. As far >>> as I am aware, DEXSeq model should have flattened all of these into one >>> single "exon". Is this correct? is the error coming from the gtf? (at the >>> end of the message there is also the gene annotation in the gtf). >> I'm trying to parse the various exon annotations from your email, but >> I don't see where the 1-width exon is. > This one: > chr3 mm10_ensGene.gtf exonic_part 102995728 102995729 . + . transcripts "ENSMUST00000029447"; exonic_part_number "001"; gene_id "ENSMUSG00000027854" > > Unless I calculated it incorrectly. > >> >> Figure 1 from their paper shows pretty clearly how the "break down" of >> exons are calcualted across isoforms to create *counting bins* -- just >> keep in mind that these things are not necessarily "exons" anymore. > > Yes I am aware of that but I should have been clearer in the distinction from "exon" and counting bin. I thin that with the new screenshot it will become more apparent what I mean. >> >>> This is specially concerning for me because I am interested in selecting the >>> first and last exon of genes, using the exon ranking from DEXSeq, to analyze >>> further. >> I'm not sure if what I posted was at all helpful, but if someone else >> doesn't do a better job of providing you with the answer you were >> looking for, you might try to draw a figure of a gene model (with a >> few splicing isoforms) and point out what it is, exactly, that you >> hope to extract from it. >> >> While it's clear what "First and last" exon of a *single transcript >> isoform* of a gene might be, it might get hairy when you start >> summarizing the "counting bins" across multiple isoforms of the same >> gene. > > True. I am only using the DEXseq results as a quick and dirty approach before I get data from other tools which handle better. For example, miso has annotations for alternative polyadenilation and Cufflinks provides some information on alternative promoter usage. Regardless, if the gene model is incorrect, which I hope it is and this is only me being thick, then DEXseq results from some counting bins not be trustworthy. > > >> Oh, and by the way: >> >> >>> Hi Bioconductors, >>> >>> I happened upon a funny thing in DEXseq: a gene which appears to have more >>> exons in the final DEXseq output than the annotation suggests. The gene >>> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 >>> exons in a flattened gene model. >> I'd argue that the isoform of the gene that you highlighted in your >> original screen shot only has*two* exons >> >> -steve > ehehe, correct. > >> >> HTH, >> -steve >> > Cheers, > Ant?nio > > > <ensmusg00000027854.png>____________________________________________ ___ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Devon Ryan ▴ 200

0

Entering edit mode

Hi Devon, thank you for the clarification. I thought DEXSeq used a union model, but under the "disjoint gene model" it all makes sense now. Best, Ant?nio On 06/02/14 19:42, Devon Ryan wrote: > Hi Antonio, > > I counted 13 exonic bins by eye. What do you find to be amiss there? Remember that you're not using a flattened/union gene model with DEXseq, but rather pretty much the exact opposite (maybe it should be called a "disjoint gene model"?). > > BTW, that first bin is actually 2bp wide. > > Regards, > Devon > > ____________________________________________ > Devon Ryan, Ph.D. > Email: dpryan at dpryan.com > Tel: +49 (0)178 298-6067 > Molecular and Cellular Cognition Lab > German Centre for Neurodegenerative Diseases (DZNE) > Ludwig-Erhard-Allee 2 > 53175 Bonn, Germany > > On Feb 6, 2014, at 7:12 PM, Ant?nio domingues wrote: > >> Hi Steve, >> >> thank for the comments. First of all, my apologies, I have sent the wrong screenshot. It should have been the one (attached) for Sike1. Long day. Anyway, see my replies bellow to the points that are still valid. >> >> On 02/06/2014 06:54 PM, Steve Lianoglou wrote: >>> Hi, >>> >>> A few comments in line: >>> >>> On Thu, Feb 6, 2014 at 9:01 AM, Ant?nio domingues >>> <amjdomingues at="" gmail.com=""> wrote: >>>> Hi Bioconductors, >>>> >>>> I happened upon a funny thing in DEXseq: a gene which appears to have more >>>> exons in the final DEXseq output than the annotation suggests. The gene >>>> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 >>>> exons in a flattened gene model. However, the DEXSeq results lists 13 exons >>>> (here showing the output of htseq-count): >>> Not sure why you say the *gene* only has 3 exons ... you have >>> highlighted one isoform of the gene which has very few exons, but you >>> can from both your picture and the exons definitions you pasted below >>> for ENSMUSG00000027854 (presumably that's Csde1 :-) that if you >>> consider all of the isoforms of the gene together, it has many more >>> than just three exons. >>> >>> Know what I mean? >> It is not Csde1 :s >> >>>> Between exon1 is only 1 base long (?) and exons1 to 4 are contiguous. As far >>>> as I am aware, DEXSeq model should have flattened all of these into one >>>> single "exon". Is this correct? is the error coming from the gtf? (at the >>>> end of the message there is also the gene annotation in the gtf). >>> I'm trying to parse the various exon annotations from your email, but >>> I don't see where the 1-width exon is. >> This one: >> chr3 mm10_ensGene.gtf exonic_part 102995728 102995729 . + . transcripts "ENSMUST00000029447"; exonic_part_number "001"; gene_id "ENSMUSG00000027854" >> >> Unless I calculated it incorrectly. >> >>> Figure 1 from their paper shows pretty clearly how the "break down" of >>> exons are calcualted across isoforms to create *counting bins* -- just >>> keep in mind that these things are not necessarily "exons" anymore. >> Yes I am aware of that but I should have been clearer in the distinction from "exon" and counting bin. I thin that with the new screenshot it will become more apparent what I mean. >>>> This is specially concerning for me because I am interested in selecting the >>>> first and last exon of genes, using the exon ranking from DEXSeq, to analyze >>>> further. >>> I'm not sure if what I posted was at all helpful, but if someone else >>> doesn't do a better job of providing you with the answer you were >>> looking for, you might try to draw a figure of a gene model (with a >>> few splicing isoforms) and point out what it is, exactly, that you >>> hope to extract from it. >>> >>> While it's clear what "First and last" exon of a *single transcript >>> isoform* of a gene might be, it might get hairy when you start >>> summarizing the "counting bins" across multiple isoforms of the same >>> gene. >> True. I am only using the DEXseq results as a quick and dirty approach before I get data from other tools which handle better. For example, miso has annotations for alternative polyadenilation and Cufflinks provides some information on alternative promoter usage. Regardless, if the gene model is incorrect, which I hope it is and this is only me being thick, then DEXseq results from some counting bins not be trustworthy. >> >> >>> Oh, and by the way: >>> >>> >>>> Hi Bioconductors, >>>> >>>> I happened upon a funny thing in DEXseq: a gene which appears to have more >>>> exons in the final DEXseq output than the annotation suggests. The gene >>>> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 >>>> exons in a flattened gene model. >>> I'd argue that the isoform of the gene that you highlighted in your >>> original screen shot only has*two* exons >>> >>> -steve >> ehehe, correct. >> >>> HTH, >>> -steve >>> >> Cheers, >> Ant?nio >> >> >> <ensmusg00000027854.png>___________________________________________ ____ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Ant?nio Miguel de Jesus Domingues, PhD Postdoctoral researcher Deep Sequencing Group - SFB655 Biotechnology Center (Biotec) Technische Universit?t Dresden Fetscherstra?e 105 01307 Dresden Phone: +49 (351) 458 82362 Email: antonio.domingues(at)biotec.tu-dresden.de -- The Unbearable Lightness of Molecular Biology

ADD REPLY • link 12.0 years ago António Miguel de Jesus Domingues ▴ 510

0

Entering edit mode

Hi Antonio, As an extra comment, the binning that we do in DEXSeq is not mandatory to use DEXSeq for testing for alternative exon usage. For example, DEXSeq can be also used introducing counts from exon-exon junction reads (as in tools like MISO) or a "union" model like the one you mentioned. Another example of a very creative use is the one from a paper by Steve (10.1101/gad.229328.113), where they adapted their counting bins to test specifically for alternative 3' UTR lengthening independent from changes in gene expression. I guess all these are alternative approaches to try to quantify and assemble transcript isoforms, that still has some limitations (e.g. 10.1038/nmeth.2714). Best regards, Alejandro > Hi Devon, > > thank you for the clarification. I thought DEXSeq used a union model, > but under the "disjoint gene model" it all makes sense now. > > Best, > Ant?nio > > On 06/02/14 19:42, Devon Ryan wrote: >> Hi Antonio, >> >> I counted 13 exonic bins by eye. What do you find to be amiss there? >> Remember that you're not using a flattened/union gene model with >> DEXseq, but rather pretty much the exact opposite (maybe it should be >> called a "disjoint gene model"?). >> >> BTW, that first bin is actually 2bp wide. >> >> Regards, >> Devon >> >> ____________________________________________ >> Devon Ryan, Ph.D. >> Email: dpryan at dpryan.com >> Tel: +49 (0)178 298-6067 >> Molecular and Cellular Cognition Lab >> German Centre for Neurodegenerative Diseases (DZNE) >> Ludwig-Erhard-Allee 2 >> 53175 Bonn, Germany >> >> On Feb 6, 2014, at 7:12 PM, Ant?nio domingues wrote: >> >>> Hi Steve, >>> >>> thank for the comments. First of all, my apologies, I have sent the >>> wrong screenshot. It should have been the one (attached) for Sike1. >>> Long day. Anyway, see my replies bellow to the points that are >>> still valid. >>> >>> On 02/06/2014 06:54 PM, Steve Lianoglou wrote: >>>> Hi, >>>> >>>> A few comments in line: >>>> >>>> On Thu, Feb 6, 2014 at 9:01 AM, Ant?nio domingues >>>> <amjdomingues at="" gmail.com=""> wrote: >>>>> Hi Bioconductors, >>>>> >>>>> I happened upon a funny thing in DEXseq: a gene which appears to >>>>> have more >>>>> exons in the final DEXseq output than the annotation suggests. The >>>>> gene >>>>> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests >>>>> the 3 >>>>> exons in a flattened gene model. However, the DEXSeq results lists >>>>> 13 exons >>>>> (here showing the output of htseq-count): >>>> Not sure why you say the *gene* only has 3 exons ... you have >>>> highlighted one isoform of the gene which has very few exons, but you >>>> can from both your picture and the exons definitions you pasted below >>>> for ENSMUSG00000027854 (presumably that's Csde1 :-) that if you >>>> consider all of the isoforms of the gene together, it has many more >>>> than just three exons. >>>> >>>> Know what I mean? >>> It is not Csde1 :s >>> >>>>> Between exon1 is only 1 base long (?) and exons1 to 4 are >>>>> contiguous. As far >>>>> as I am aware, DEXSeq model should have flattened all of these >>>>> into one >>>>> single "exon". Is this correct? is the error coming from the gtf? >>>>> (at the >>>>> end of the message there is also the gene annotation in the gtf). >>>> I'm trying to parse the various exon annotations from your email, but >>>> I don't see where the 1-width exon is. >>> This one: >>> chr3 mm10_ensGene.gtf exonic_part 102995728 102995729 . >>> + . transcripts "ENSMUST00000029447"; exonic_part_number >>> "001"; gene_id "ENSMUSG00000027854" >>> >>> Unless I calculated it incorrectly. >>> >>>> Figure 1 from their paper shows pretty clearly how the "break down" of >>>> exons are calcualted across isoforms to create *counting bins* -- just >>>> keep in mind that these things are not necessarily "exons" anymore. >>> Yes I am aware of that but I should have been clearer in the >>> distinction from "exon" and counting bin. I thin that with the new >>> screenshot it will become more apparent what I mean. >>>>> This is specially concerning for me because I am interested in >>>>> selecting the >>>>> first and last exon of genes, using the exon ranking from DEXSeq, >>>>> to analyze >>>>> further. >>>> I'm not sure if what I posted was at all helpful, but if someone else >>>> doesn't do a better job of providing you with the answer you were >>>> looking for, you might try to draw a figure of a gene model (with a >>>> few splicing isoforms) and point out what it is, exactly, that you >>>> hope to extract from it. >>>> >>>> While it's clear what "First and last" exon of a *single transcript >>>> isoform* of a gene might be, it might get hairy when you start >>>> summarizing the "counting bins" across multiple isoforms of the same >>>> gene. >>> True. I am only using the DEXseq results as a quick and dirty >>> approach before I get data from other tools which handle better. For >>> example, miso has annotations for alternative polyadenilation and >>> Cufflinks provides some information on alternative promoter usage. >>> Regardless, if the gene model is incorrect, which I hope it is and >>> this is only me being thick, then DEXseq results from some counting >>> bins not be trustworthy. >>> >>> >>>> Oh, and by the way: >>>> >>>> >>>>> Hi Bioconductors, >>>>> >>>>> I happened upon a funny thing in DEXseq: a gene which appears to >>>>> have more >>>>> exons in the final DEXseq output than the annotation suggests. The >>>>> gene >>>>> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests >>>>> the 3 >>>>> exons in a flattened gene model. >>>> I'd argue that the isoform of the gene that you highlighted in your >>>> original screen shot only has*two* exons >>>> >>>> -steve >>> ehehe, correct. >>> >>>> HTH, >>>> -steve >>>> >>> Cheers, >>> Ant?nio >>> >>> >>> <ensmusg00000027854.png>__________________________________________ _____ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 12.0 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hi, On Fri, Feb 7, 2014 at 7:12 AM, Alejandro Reyes <alejandro.reyes at="" embl.de=""> wrote: > Hi Antonio, > > As an extra comment, the binning that we do in DEXSeq is not mandatory to > use DEXSeq for testing for alternative exon usage. For example, DEXSeq can > be also used introducing counts from exon-exon junction reads (as in tools > like MISO) or a "union" model like the one you mentioned. Another example of > a very creative use is the one from a paper by Steve > (10.1101/gad.229328.113), where they adapted their counting bins to test > specifically for alternative 3' UTR lengthening independent from changes in > gene expression. Alejandro: thanks for the kind words and virtual citation ;-) DEXSeq was an instrumental part of our analysis and we were very fortunate to see it early on in bioc (even before your publication), so thank you for putting your work out there for all of us to benefit. Antonio: if you're working on ApA (I'm gathering yes ;-) and you have any Q's about our work, I'd be happy to chat about it offline. -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 12.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Alejandro, thank you for the extra information in particular the referral to Steve's paper. It eluded me, but now I have some weekend reading. Best, Ant?nio -- Ant?nio Miguel de Jesus Domingues, PhD Postdoctoral researcher Deep Sequencing Group - SFB655 Biotechnology Center (Biotec) Technische Universit?t Dresden Fetscherstra?e 105 01307 Dresden Phone: +49 (351) 458 82362 Email: antonio.domingues(at)biotec.tu-dresden.de -- The Unbearable Lightness of Molecular Biology On 02/07/2014 04:12 PM, Alejandro Reyes wrote: > Hi Antonio, > > As an extra comment, the binning that we do in DEXSeq is not mandatory > to use DEXSeq for testing for alternative exon usage. For example, > DEXSeq can be also used introducing counts from exon-exon junction > reads (as in tools like MISO) or a "union" model like the one you > mentioned. Another example of a very creative use is the one from a > paper by Steve (10.1101/gad.229328.113), where they adapted their > counting bins to test specifically for alternative 3' UTR lengthening > independent from changes in gene expression. > > I guess all these are alternative approaches to try to quantify and > assemble transcript isoforms, that still has some limitations (e.g. > 10.1038/nmeth.2714). > > Best regards, > Alejandro > > > > >> Hi Devon, >> >> thank you for the clarification. I thought DEXSeq used a union model, >> but under the "disjoint gene model" it all makes sense now. >> >> Best, >> Ant?nio >> >> On 06/02/14 19:42, Devon Ryan wrote: >>> Hi Antonio, >>> >>> I counted 13 exonic bins by eye. What do you find to be amiss there? >>> Remember that you're not using a flattened/union gene model with >>> DEXseq, but rather pretty much the exact opposite (maybe it should >>> be called a "disjoint gene model"?). >>> >>> BTW, that first bin is actually 2bp wide. >>> >>> Regards, >>> Devon >>> >>> ____________________________________________ >>> Devon Ryan, Ph.D. >>> Email: dpryan at dpryan.com >>> Tel: +49 (0)178 298-6067 >>> Molecular and Cellular Cognition Lab >>> German Centre for Neurodegenerative Diseases (DZNE) >>> Ludwig-Erhard-Allee 2 >>> 53175 Bonn, Germany >>> >>> On Feb 6, 2014, at 7:12 PM, Ant?nio domingues wrote: >>> >>>> Hi Steve, >>>> >>>> thank for the comments. First of all, my apologies, I have sent the >>>> wrong screenshot. It should have been the one (attached) for Sike1. >>>> Long day. Anyway, see my replies bellow to the points that are >>>> still valid. >>>> >>>> On 02/06/2014 06:54 PM, Steve Lianoglou wrote: >>>>> Hi, >>>>> >>>>> A few comments in line: >>>>> >>>>> On Thu, Feb 6, 2014 at 9:01 AM, Ant?nio domingues >>>>> <amjdomingues at="" gmail.com=""> wrote: >>>>>> Hi Bioconductors, >>>>>> >>>>>> I happened upon a funny thing in DEXseq: a gene which appears to >>>>>> have more >>>>>> exons in the final DEXseq output than the annotation suggests. >>>>>> The gene >>>>>> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests >>>>>> the 3 >>>>>> exons in a flattened gene model. However, the DEXSeq results >>>>>> lists 13 exons >>>>>> (here showing the output of htseq-count): >>>>> Not sure why you say the *gene* only has 3 exons ... you have >>>>> highlighted one isoform of the gene which has very few exons, but you >>>>> can from both your picture and the exons definitions you pasted below >>>>> for ENSMUSG00000027854 (presumably that's Csde1 :-) that if you >>>>> consider all of the isoforms of the gene together, it has many more >>>>> than just three exons. >>>>> >>>>> Know what I mean? >>>> It is not Csde1 :s >>>> >>>>>> Between exon1 is only 1 base long (?) and exons1 to 4 are >>>>>> contiguous. As far >>>>>> as I am aware, DEXSeq model should have flattened all of these >>>>>> into one >>>>>> single "exon". Is this correct? is the error coming from the gtf? >>>>>> (at the >>>>>> end of the message there is also the gene annotation in the gtf). >>>>> I'm trying to parse the various exon annotations from your email, but >>>>> I don't see where the 1-width exon is. >>>> This one: >>>> chr3 mm10_ensGene.gtf exonic_part 102995728 102995729 . >>>> + . transcripts "ENSMUST00000029447"; exonic_part_number >>>> "001"; gene_id "ENSMUSG00000027854" >>>> >>>> Unless I calculated it incorrectly. >>>> >>>>> Figure 1 from their paper shows pretty clearly how the "break >>>>> down" of >>>>> exons are calcualted across isoforms to create *counting bins* -- >>>>> just >>>>> keep in mind that these things are not necessarily "exons" anymore. >>>> Yes I am aware of that but I should have been clearer in the >>>> distinction from "exon" and counting bin. I thin that with the new >>>> screenshot it will become more apparent what I mean. >>>>>> This is specially concerning for me because I am interested in >>>>>> selecting the >>>>>> first and last exon of genes, using the exon ranking from DEXSeq, >>>>>> to analyze >>>>>> further. >>>>> I'm not sure if what I posted was at all helpful, but if someone else >>>>> doesn't do a better job of providing you with the answer you were >>>>> looking for, you might try to draw a figure of a gene model (with a >>>>> few splicing isoforms) and point out what it is, exactly, that you >>>>> hope to extract from it. >>>>> >>>>> While it's clear what "First and last" exon of a *single transcript >>>>> isoform* of a gene might be, it might get hairy when you start >>>>> summarizing the "counting bins" across multiple isoforms of the same >>>>> gene. >>>> True. I am only using the DEXseq results as a quick and dirty >>>> approach before I get data from other tools which handle better. >>>> For example, miso has annotations for alternative polyadenilation >>>> and Cufflinks provides some information on alternative promoter >>>> usage. Regardless, if the gene model is incorrect, which I hope it >>>> is and this is only me being thick, then DEXseq results from some >>>> counting bins not be trustworthy. >>>> >>>> >>>>> Oh, and by the way: >>>>> >>>>> >>>>>> Hi Bioconductors, >>>>>> >>>>>> I happened upon a funny thing in DEXseq: a gene which appears to >>>>>> have more >>>>>> exons in the final DEXseq output than the annotation suggests. >>>>>> The gene >>>>>> ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests >>>>>> the 3 >>>>>> exons in a flattened gene model. >>>>> I'd argue that the isoform of the gene that you highlighted in your >>>>> original screen shot only has*two* exons >>>>> >>>>> -steve >>>> ehehe, correct. >>>> >>>>> HTH, >>>>> -steve >>>>> >>>> Cheers, >>>> Ant?nio >>>> >>>> >>>> <ensmusg00000027854.png>_________________________________________ ______ >>>> >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 12.0 years ago António Miguel de Jesus Domingues ▴ 510

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 4 days ago

United States

Oh, and by the way: On Thu, Feb 6, 2014 at 9:01 AM, Ant?nio domingues <amjdomingues at="" gmail.com=""> wrote: > Hi Bioconductors, > > I happened upon a funny thing in DEXseq: a gene which appears to have more > exons in the final DEXseq output than the annotation suggests. The gene > ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests the 3 > exons in a flattened gene model. I'd argue that the isoform of the gene that you highlighted in your original screen shot only has *two* exons :-) -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 12.0 years ago Steve Lianoglou ★ 13k

Login before adding your answer.