Entering edit mode
Assaf Gordon
▴
30
@assaf-gordon-5206
Last seen 10.3 years ago
Hello,
I have a question regarding the gene model source to use with DESeq.
Assuming the following workflow:
1. Map reads to genome (bowtie/tophat/bwa/etc).
2. Count hits-per-gene (HTSeq / CoverageBed / etc. )
3. Repeat 1,2 for all samples, merge together into one table.
4. Run DESeq on merged table.
My question is about step 2:
What is the recommended gene model to use when counting hits-per-gene
?
RefSeq-Genes, UCSC Known Genes, Ensembl Genes and others come to mind,
but those usually contain multiple transcripts per gene as different
records - would that skew the DESeq results?
(Note that I'm interested in gene-level differential expression, not
worried about isoform-level differential expression).
I've read previous discussions about transcript vs. gene level [1] and
exon level considerations [2] but perhaps I've missed the bottom line:
Is it OK to have multiple isoforms per gene (and treat each transcript
as "gene record", which will result in some double-counting of reads),
or do I need to pre-process the gene model file, to ensure there are
no overlaps (e.g. by merging all isoforms of a single gene) ?
Or, is some post-processing needed to the DESeq results (from
nbinomTest()) to "normalize" genes with multiple isoforms?
Any suggestions and comments will be appreciated (or corrections, if
something above is wrong).
Thanks,
-gordon
[1] http://article.gmane.org/gmane.science.biology.informatics.conduct
or/38805/
[2] http://article.gmane.org/gmane.science.biology.informatics.conduct
or/38915/