Hello,
I want to use DEXSeq on introns and from what I read from some threads, it is possible to do and recommended. However I'm having a few problems and questions.
DEXSeq is dividing exons into exon-bins with the script dexseq_prepare_annotation.py. If two exons, from different transcripts but the same gene, overlap each other, then it creates unique bins. Let's apply this to introns. So I create my set of introns out of a Ensembl gtf file with a script. Many of these introns (e.g. intron 1 from isoform A of gene Z) will overlap with exons (e.g. exon 1 from isoform B of Gene Z) and after creating the introns-bins and counting, bins which overlap with exon will have more reads compared to introns which are pure introns (no overlap with any exon).
1. I'd like to now if this will have influence on the calculation for the usage?
So I was thinking about a more stringent approach and only using intronic parts which don't overlap with exons.
2. Is this a good approach?
I know you will loose the information about intron-isoform relation but at the moment I'm just interested to see if there is a difference in the intron usage between my condition.
So I have created a file which only has unique introns and ran into a problem while using dexseq_prepare_annotation.py. Ensembl75 has a gene http://feb2014.archive.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000092329;r=8:119910841-124345722;t=ENSMUST00000127664 which has 16 ( = 15 introns) and the distance between the 1st and 2nd exon is over 4 Mb. Now if I run the script on the gene and it's 15 introns, then I get 362 intron-bins and 738 intron-bins with the aggregate option set to True. From what I could see the problem seems to be in Step 3 of the script. Somehow the large first intron is split into the intron parts of other genes within the large intron. Right now, I would just remove the gene from the gff and do a recount on the data.
Now I ran DEXSeq and this is my sampleTable:
sSampleTable <- data.frame(row.names=c("S1", "S2", "S3", "S4", "S5", "S6"), Type=c("Het", "Het", "Het", "Ko", "Ko", "Ko"))
and this is my design:
~ sample + exon + Type:exon I get the following error: sdxd <- estimateExonFoldChanges(sdxd, BPPARAM=multicore) Error in estimateExonFoldChanges(sdxd, BPPARAM = multicore) : The value of the parameter fitExpToVar,'condition', is not a column name of colData
Is the fitExpToVar parameter set to condition by default? After changing renaming Type to condition, it worked. But wouldn't it be better to not have the default?
Thx
Mathias
HI Mathias,
You could also specify fitExpToVar="Type" when calling estimateExonFoldChanges instead of renaming Type to condition.
Alejandro
Hey Alejandro,
ok thx, I'll do that next time. One question if one has more than one factor like Type and Sequencing, as it is described in the vignette. Which one should one use for fitExpToVar?