Dear all,
I use edgeR for differential expression analysis on a RNAseq dataset.
But I found that edgeR is very sensitive to outlier samples. For
example, for one gene, overall the expression pattern is similar
between control group and experimental group, but there is one single
sample which behaves very differently from the others, then this gene
is very likely to be falsely detected as differentially expressed. So
can anyone please tell me if there's any option in the algorithm that
can control the outlier impact?
I'm thinking to use median read count value instead of mean read count
value to fit the NB distribution, and to estimate the dispersions.
Just wondering if there's an option available in edgeR? Or is there
any other RNAseq DE analysis package which is less sensitive to
outliers?
The outlier sample might be different when you look at different
genes, so we can't take the whole sample out in the analysis.
Yuan
Dear Yuan
On 2012-01-27 06:19, Yuan Tian wrote:
> I use edgeR for differential expression analysis on a RNAseq
dataset.
> But I found that edgeR is very sensitive to outlier samples. For
> example, for one gene, overall the expression pattern is similar
> between control group and experimental group, but there is one
single
> sample which behaves very differently from the others, then this
gene
> is very likely to be falsely detected as differentially expressed.
So
[...]
> Or is there any other RNAseq DE analysis package which is less
> sensitive to outliers?
You may want to give our DESeq package a try, as we made some major
changes in the previous release to address precisely this issue.
Simon
hi,
On Thu, 2012-01-26 at 21:19 -0800, Yuan Tian wrote:
> Dear all,
>
> I use edgeR for differential expression analysis on a RNAseq
dataset. But I found
> that edgeR is very sensitive to outlier samples. For example, for
one gene, overall
> the expression pattern is similar between control group and
experimental group, but
> there is one single sample which behaves very differently from the
others, then this
> gene is very likely to be falsely detected as differentially
expressed. So can anyone
> please tell me if there's any option in the algorithm that can
control the outlier impact?
>
> I'm thinking to use median read count value instead of mean read
count value to fit the
> NB distribution, and to estimate the dispersions. Just wondering if
there's an option
> available in edgeR? Or is there any other RNAseq DE analysis package
which is less
> sensitive to outliers?
i think what you're referring to is illustrated in figure 2 of the
vignette of the tweeDEseq package whose underlying statistical model
can
address this kind of situation.
> The outlier sample might be different when you look at different
genes, so we can't take
> the whole sample out in the analysis.
there might be a number of reasons by which "outlier" count values
show
up but a sensible one is just biological variability (Hansen et al.
Nat.
Biotech., 29:572-573, 2011, doi:10.1038/nbt.1910), thus not only you
cannot take the sample out, but that count value in that sample might
be
true biology. if your experimental conditions convey lots of
biological
variability you may need to work with more biological replicates.
cheers,
robert.
Dear Yuan,
The edgeR empirical-Bayes algorithm is actually somewhat resistant to
outliers, because it allows for gene-specific variability, unlike
algorithms than treat the variance as a function of the mean. However
the
edgeR algorithm is designed to deal more with routine gene-specific
biological variation than with true outliers. We prefer to detect and
remove true outliers in the data checking steps rather than
accommodate
them as part of the dispersion estimation algorithm.
Let me say that the second paragraph of your email is hard to
understand,
because cannot give either median or mean counts to edgeR. You must
give
actual read counts. I wonder if your problems are not caused by
inputing
inappropriate data into edgeR?
edgeR has a great many options, and it would certainly help in writing
a
response to know which ones you are using already. For the purposes
of
this email, I am going to assume that you are doing a valid analysis
using
true read counts, and that you have used either estimateTagwiseDisp()
or
estimateGLMTagwiseDisp() with default settings.
If you do have some substantial outliers, here are some options:
1. First, filter genes before analysis as in the edgeR User's Guide
case
studies that deal with RNA-Seq data. Suppose that you have 4 control
libraries and 4 experimental: then keep genes only if they satisfy a
minimum count-per-million (cpm>1 say) in a least four samples. This
eliminates genes with RNA-Seq artifacts such that they are zero except
in
one or two samples.
2. Plot trended and tagwise dispersion estimates against abundance to
look
for outliers.
3. Test for outliers using the gof() function.
4. Reduce the prior.n setting to a smaller value.
If none of this solves your problems, you might try the voom()
function in
the limma package instead. (See the limma User's Guide.) This
approach
is more flexible in adapting automatically to gene-specific
variability in
RNA-Seq data than the edgeR algorithm, and has proved successful on
some
high-variability datasets.
Best wishes
Gordon
> Date: Thu, 26 Jan 2012 21:19:55 -0800
> From: Yuan Tian <ytianidyll at="" ucla.edu="">
> To: Bioconductor mailing list <bioconductor at="" r-project.org="">
> Subject: [BioC] how edgeR control the outliers?
>
> Dear all,
>
> I use edgeR for differential expression analysis on a RNAseq
dataset.
> But I found that edgeR is very sensitive to outlier samples. For
> example, for one gene, overall the expression pattern is similar
between
> control group and experimental group, but there is one single sample
> which behaves very differently from the others, then this gene is
very
> likely to be falsely detected as differentially expressed. So can
anyone
> please tell me if there's any option in the algorithm that can
control
> the outlier impact?
>
> I'm thinking to use median read count value instead of mean read
count
> value to fit the NB distribution, and to estimate the dispersions.
Just
> wondering if there's an option available in edgeR? Or is there any
other
> RNAseq DE analysis package which is less sensitive to outliers?
>
> The outlier sample might be different when you look at different
genes,
> so we can't take the whole sample out in the analysis.
>
> Yuan
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}