Hi Simon,
I totally agree with your statements on the importance of generating
appropriate read counting tables. In my email I should have used as an
example "sum of coverage" per feature rather than mean coverage since
I
don't want to direct the discussion into an area that has been
discussed
many times. Is it really so unreasonable to use this type of discrete
raw expression values (sum of cov/feature) instead of read counts, and
if so why?
Thomas
On Tue, May 21, 2013 at 07:01:39PM +0000, Simon Anders wrote:
> Hi Thomas
>
> On 21/05/13 17:49, Thomas Girke wrote:
> > Because of these complications, I am sometimes wondering whether
one
> > couldn't use for many RNA-Seq use cases coverage values (e.g. mean
> > coverage) as raw expression measure instead of read counts. Has
anyone
> > systematically tested whether this would be a suitable approach
for the
> > downstream DEG analysis? Right now everyone believes RNA-Seq
analysis
> > requires read counting, but honestly I don't see why that is.
Perhaps
> > the benefits of this are so minor that it is not worth dealing
with a
> > different type of expression measure.
>
> The "complications" I had in mind apply to mean coverage as well as
to
> reads.
>
> Actually, at least in my personal opinion, it is quite clear which
rules
> one should stick to when obtaining the count table, and therefore,
> obtaining a correct count table is not at all complicated if one
thinks
> carefully enough about it.
>
> The main issue is this: Imagine, a number of reads can be mapped to
two
> distinct genes, either because the genes overlap, or, more commonly,
> because the genes share repetitive sequence. If you count your reads
for
> both genes then both genes will appear differentially expressed even
if
> only one gene actually is. Hence, one must discard reads that
cannot be
> unambiguously assigned to one gene. Of course, genes which lose
reads in
> this manner will have to low expression estimates, i.e., you incur a
> negative bias. However, this bias cancels out when comparing the
same
> gene across samples, i.e., it is not a reason for concern in
> differential expression testing.
>
> Consequently, a method which aims at getting _unbiased_ point
estimates
> of expression strength is typically unsuitable as input for strength
for
> differential expression testing. The point that counting for
expression
> estimation and for DE testing are different tasks is subtle and
often
> overlooked. Mistakes arising from this can cause strange effects.
>
> A particularly common mistake is to map the reads to transcripts,
not
> genes, or to count overlaps not with annotated genes but with
annotated
> transcripts. Of course, most reads will map to several transcripts
> because most transcript isoforms of a given gene overlap heavily. If
one
> counts reads mapping to multiple transcripts for each of these, one
gets
> severe artifacts in downstream analysis, and if one discards
multiply
> mapping reads one loses most of the genes.
>
> I do not know what the original poster meant when she said that she
> counted for UCSC transcripts (rather than genes) but if she took
care to
> avoid the common pitfalls just described, she must have employed a
quite
> sophisticated construction. This is why my first question was how
she
> obtained the counts.
>
> > Right now everyone believes RNA-Seq analysis
> > requires read counting, but honestly I don't see why that is.
>
> Just to clarify this: I don't think that this is universally
believed.
> It is only that a method is typically designed for a specific type
of
> data, and several commonly used Bioconductor packages for RNA-Seq
> analysis, namely edgeR, DESeq, BaySeq, DSS, expect count data,
because
> the specific statistical methods used in these packages are build on
the
> assumption that they get read counts. Obviously, they will not give
> correct results if they are given any other kind of data. This is
stated
> quite clearly in the vignettes, but nevertheless people keep asking
> whether they can supply arbitrary other kind of data such as rounded
> coverage values. And this insistence in using methods in a manner
> clearly violating instructions is, quite frankly, a bit frustrating.
So
> when you observed that people like me seem to be quite insistent on
the
> importance of using counts this may have been with respect to these
very
> common question, which are of course specific to certain methods.
>
> Simon