Question

Eliminating minor exon bins from DEXSeq / associated python scripts

0

Entering edit mode

np ▴ 10

@np-8420

Last seen 9.8 years ago

United States

hello,

i want to use the DEXSeq pipeline to get exon counts... and i have followed the vignette. everything works well except that I am interested in counts from only the major exons (i.e. annotated exons).

is there a way to quickly eliminate the non-major exon bins that the associated DEXSeq python scripts generate? i understand there is likely a very good reason for including so many bins, but for my purposes i would like only major exon counts. i have been struggling to find documentation for how to do this.

sincerely,

np

dexseq deseq2 • 2.0k views

ADD COMMENT • link 10.3 years ago • updated 10.2 years ago np ▴ 10

0

Entering edit mode

Hello,

I would suggest to subset the original gtf file, to create a new gtf file with a selection of the transcript isoforms that you are interested before running dexseq_prepare_annotation.py. This way, you would eliminate lots of isoforms and avoid unnecessary exon binning.

Alejandro

ADD REPLY • link 10.3 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hello Alejandro,

This is helpful -- I will read about how to subset the .gtf. One question I had -- is there any way within DESeq to get the coordinates of the exon bins? For our research question, we will eventually have a limited number of gene candidates to focus on and just want to graphically display the counts across exons in individual samples without really performing the full DEU analyses.

It would be nice to have a full DEXSeq count dataset for all genes:exons, and then analyze individual gene candidates for quick interrogations as they come up.

As an example...GAPDH is annotated to have only 9 exons and using the full ENCODE gtf we get 35 exon bins. Is there a way within DEXSeq to determine which of the 35 bins represent the 9 major exons? If we can figure this out, we'll just extract the appropriate exons for each candidate as they come up -- if that makes sense.

np

ADD REPLY • link 10.3 years ago np ▴ 10

score 1 · Answer 1 · 2015-08-24

1

Entering edit mode

Alejandro Reyes ★ 1.9k

@alejandro-reyes-5124

Last seen 16 months ago

Novartis Institutes for BioMedical Rese…

Hi,

The coordinates of the exon bins and to which transcripts it overlaps can be found in the slot rowRanges(dxd) of a DEXSeqDataSet object. In the slot rowRanges(dxd)$transcripts, you will find a list of transcript from which each exon bin was derived.

The tricky part here would be to identify the transcript identifiers that you are interested on. What annotation are you using when you mention that GAPDH has only 9 exons? Are these annotated in the "gold" transcripts from ensembl, http://www.ensembl.org/Help/View?id=151, )?

Once you have a set of transcript identifiers, you could (1) either subset your DEXSeqDataSet by eliminating those exon bins that are not overlapping with any of the set of transcripts by using the rowRanges(dxd)$transcripts or (2) subset the initial gtf file based on the transcript identifiers that you are interested on. This script can do the job for you:

https://github.com/areyesq/useful-short-scripts/blob/master/subsetGTFByAttribute.pl

Regarding displaying the number of counts for a specific exon, have you looked at the plotDEXSeq function?

Alejandro

ADD COMMENT • link 10.2 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hello Alejandro,

Thank you for replying. I am getting the number of exons from the UCSC browser -- I found that most of them had an approximate exon # of 9, although there could be +/- an exon or two based on transcript isoform. The gtf I used to prepare the annotations was the full gtf provided by ENCODE. Perhaps I should subset the gtf using your script and identify only the "gold transcripts" to get the major exon bins that I am looking for.

For the plotDEXSeq function - this is exactly what our group wanted. However, I noticed that in order to use it, you need a DEXSeqResults object. Unfortunately, the samples that we have sequenced do not have biological replicates (patient tumor tissue) so I cannot produce a DEXSeqResults object since I cannot estimate dispersions. Is there a way to plot the counts from a DEXSeqDataSet using the plotDEXSeq function? If so, I would love to use the DEXplots.

np

ADD REPLY • link 10.2 years ago np ▴ 10

0

Entering edit mode

Hi @np,

Thanks a lot for your feedback! I have added to plotDEXSeq the possibility to input a DEXSeqDataSet object for plotting the normalized read counts without the need to estimate dispersions for all the exons. The scale will be in log10 without any vst transformation (since one would need the dispersion function for this transformation), for example:

library(DEXSeq)
data("pasillaDEXSeqDataSet", package="pasilla")
dxd <- estimateSizeFactors(dxd)
plotDEXSeq( dxd, "FBgn0010909", norCounts=TRUE, expression=FALSE)

These changes should be available in DEXSeq 1.15.11

Alejandro Reyes

ADD REPLY • link 10.2 years ago Alejandro Reyes ★ 1.9k

score 0 · Answer 2 · 2015-08-24

0

Entering edit mode

np ▴ 10

@np-8420

Last seen 9.8 years ago

United States

(deleted)

ADD COMMENT • link 10.2 years ago np ▴ 10