Entering edit mode
António Miguel de Jesus Domingues
▴
510
@antonio-miguel-de-jesus-domingues-5182
Last seen 10 months ago
Germany
Hi Bioconductors,
I happened upon a funny thing in DEXseq: a gene which appears to have
more exons in the final DEXseq output than the annotation suggests.
The
gene ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests
the 3 exons in a flattened gene model. However, the DEXSeq results
lists
13 exons (here showing the output of htseq-count):
grep ENSMUSG00000027854 htseq_count_out.txt
ENSMUSG00000027854:001 0
ENSMUSG00000027854:002 6
ENSMUSG00000027854:003 18
ENSMUSG00000027854:004 0
ENSMUSG00000027854:005 0
ENSMUSG00000027854:006 86
ENSMUSG00000027854:007 0
ENSMUSG00000027854:008 113
ENSMUSG00000027854:009 52
ENSMUSG00000027854:010 76
ENSMUSG00000027854:011 0
ENSMUSG00000027854:012 310
ENSMUSG00000027854:013 554
This comes from the annotation created with:
dexseq_prepare_annotation.py mm10_ensGene.gtf mm10_ensGene.gff
grep ENSMUSG00000027854 ../../data/gtf/mm10_ensGene.gff
chr3 mm10_ensGene.gtf aggregate_gene 102995728
103003914 . + . gene_id
"ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102995728
102995729 . + . transcripts
"ENSMUST00000029447"; exonic_part_number "001"; gene_id
"ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102995730
102995794 . + . transcripts
"ENSMUST00000029447+ENSMUST00000151065"; exonic_part_number "002";
gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102995795
102995967 . + . transcripts
"ENSMUST00000151065+ENSMUST00000029447+ENSMUST00000119450";
exonic_part_number "003"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102995968
102996048 . + . transcripts
"ENSMUST00000151065"; exonic_part_number "004"; gene_id
"ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102996049
102996155 . + . transcripts
"ENSMUST00000151065+ENSMUST00000137332"; exonic_part_number "005";
gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102996156
102996261 . + . transcripts
"ENSMUST00000029447+ENSMUST00000137332+ENSMUST00000151065";
exonic_part_number "006"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102996262
102997242 . + . transcripts
"ENSMUST00000151065"; exonic_part_number "007"; gene_id
"ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102997243
102997351 . + . transcripts
"ENSMUST00000029447+ENSMUST00000137332+ENSMUST00000151065";
exonic_part_number "008"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102997352
102997385 . + . transcripts
"ENSMUST00000029447+ENSMUST00000151065"; exonic_part_number "009";
gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102998490
102998603 . + . transcripts
"ENSMUST00000151065+ENSMUST00000029447+ENSMUST00000119450";
exonic_part_number "010"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102998604
102999251 . + . transcripts
"ENSMUST00000151065"; exonic_part_number "011"; gene_id
"ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 103001708
103002194 . + . transcripts
"ENSMUST00000029447+ENSMUST00000119450"; exonic_part_number "012";
gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 103002195
103003914 . + . transcripts
"ENSMUST00000029447"; exonic_part_number "013"; gene_id
"ENSMUSG00000027854"
Between exon1 is only 1 base long (?) and exons1 to 4 are contiguous.
As
far as I am aware, DEXSeq model should have flattened all of these
into
one single "exon". Is this correct? is the error coming from the gtf?
(at the end of the message there is also the gene annotation in the
gtf).
This is specially concerning for me because I am interested in
selecting
the first and last exon of genes, using the exon ranking from DEXSeq,
to
analyze further.
Thanks,
Ant?nio
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] grDevices datasets stats graphics utils methods base
other attached packages:
[1] DEXSeq_1.4.0 GenomicFeatures_1.10.2
GenomicRanges_1.10.5
[4] IRanges_1.16.6 data.table_1.8.9 stringr_0.6.2
[7] ggplot2_0.9.3.1 AnnotationDbi_1.20.2 Biobase_2.18.0
[10] BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] BSgenome_1.26.1 Biostrings_2.26.3 DBI_0.2-5
MASS_7.3-23
[5] RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.2
Rsamtools_1.10.2
[9] XML_3.98-1.1 biomaRt_2.14.0 bitops_1.0-6
colorspace_1.2-4
[13] dichromat_2.0-0 digest_0.6.3 grid_2.15.2
gtable_0.1.2
[17] hwriter_1.3 labeling_0.2 munsell_0.4.2
parallel_2.15.2
[21] plyr_1.8 proto_0.3-10 reshape2_1.2.2
rtracklayer_1.18.1
[25] scales_0.2.3 statmod_1.4.17 stats4_2.15.2
tools_2.15.2
[29] zlibbioc_1.4.0
grep ENSMUSG00000027854 ../../data/gtf/mm10_ensGene.gtf
chr3 ensGene exon 102995728 102995967 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102995809 102995967 . +
0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "1"; exon_id
"ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102996156 102996261 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"2"; exon_id "ENSMUST00000029447.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102996156 102996261 . +
0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "2"; exon_id
"ENSMUST00000029447.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102997243 102997385 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"3"; exon_id "ENSMUST00000029447.3"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102997243 102997385 . +
2 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "3"; exon_id
"ENSMUST00000029447.3"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102998490 102998603 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"4"; exon_id "ENSMUST00000029447.4"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102998490 102998603 . +
0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "4"; exon_id
"ENSMUST00000029447.4"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 103001708 103003914 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"5"; exon_id "ENSMUST00000029447.5"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 103001708 103001806 . +
0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "5"; exon_id
"ENSMUST00000029447.5"; gene_name "ENSMUSG00000027854";
chr3 ensGene start_codon 102995809 102995811 .
+ 0 gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene stop_codon 103001807 103001809 .
+ 0 gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102995730 102997385 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000151065"; exon_number
"1"; exon_id "ENSMUST00000151065.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102998490 102999251 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000151065"; exon_number
"2"; exon_id "ENSMUST00000151065.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102995795 102995967 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102995809 102995967 . +
0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000119450"; exon_number "1"; exon_id
"ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102998490 102998603 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"2"; exon_id "ENSMUST00000119450.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102998490 102998603 . +
0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000119450"; exon_number "2"; exon_id
"ENSMUST00000119450.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 103001708 103002194 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"3"; exon_id "ENSMUST00000119450.3"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 103001708 103001806 . +
0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000119450"; exon_number "3"; exon_id
"ENSMUST00000119450.3"; gene_name "ENSMUSG00000027854";
chr3 ensGene start_codon 102995809 102995811 .
+ 0 gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene stop_codon 103001807 103001809 .
+ 0 gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102996049 102996261 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000137332"; exon_number
"1"; exon_id "ENSMUST00000137332.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102997243 102997351 . +
. gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000137332"; exon_number
"2"; exon_id "ENSMUST00000137332.2"; gene_name "ENSMUSG00000027854";
--
Ant?nio Miguel de Jesus Domingues, PhD
Postdoctoral researcher
Deep Sequencing Group - SFB655
Biotechnology Center (Biotec)
Technische Universit?t Dresden
Fetscherstra?e 105
01307 Dresden
Phone: +49 (351) 458 82362
Email: antonio.domingues(at)biotec.tu-dresden.de
--
The Unbearable Lightness of Molecular Biology
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Internal_tranbscript.pdf
Type: application/pdf
Size: 8751 bytes
Desc: not available
URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140206="" c58f158d="" attachment.pdf="">