I'm running DESeq and EdgeR on RNA-Seq data that was already processed
RSEM (downloaded from TCGA web site).
Since these methods require the raw read counts I'm using the
column of the RSEM output but I'm not sure this is the right thing to
it the actual raw count required ?)
Here's an example file for the RSEM output file downloaded from TCGA:
[[alternative HTML version deleted]]
I am not familiar with RSEM software, but you have non-integer values in the raw_count column, for example 31.95 and 258.35. Non-integer values are not appropriate for DESeq or edgeR analysis. Searching for a minute on Google it seems that these raw counts involve assigning fractions of ambiguously mapped reads (but you should check with the RSEM developers). If you don't have access to any lower level data, the next best option is to round the raw_count values and proceed. Update (12/13/15): See Simon's response below. Also, after investigation into the RSEM method, I've come around and recommend the option of using rounded estimated gene-level counts from RSEM as input to DESeq2.
On 20/03/13 14:15, dvir.tau at gmail.com wrote:
> I'm running DESeq and EdgeR on RNA-Seq data that was already
> RSEM (downloaded from TCGA web site).
> Since these methods require the raw read counts I'm using the
> column of the RSEM output but I'm not sure this is the right thing
to do (is
> it the actual raw count required ?)
The real issue is not that your counts are not integer, but that RSEM
gives you counts per isoform rather than per gene. Now, if you have
very similar isoforms, RSEM will be unable to decide which isoform to
assign a read to and just spread them proportionally over both. Hence,
even if only one of the two isoforms is differentially expressed, you
will incorrectly see differential expression for both isoforms.
This is why the output of isoform quantification methods such as RSEM
MMSeq are not suitable as input for differential expression tests.
At the very minimum, you need also the information about the
of the assignments of reads to isoforms. In fact, RSEM provides this
information if you run it in its Bayesian mode, but this seems to be
hardly ever done in practice.
If you really need to perform differential expression analysis on a
level finer than whole gene expression, you should either use a tool
differential exon usage testing, such as our DEXSeq package, or one
combines isoform abundance estimation and testing for differences in a
unified framework, such as BitSeq. In both cases, you will need the
If you are fine with staying on the gene level for your analysis, you
need to get counts per gene, not per isoform. I am not familiar enough
with RSEM, though, to tell you whether adding up the counts from all
isoforms per gene would be a good idea.