Dear list,
I have a DGElist object in edgeR, already processed with
calcNormFactors, estimateCommonDispersion and
estimateTagWiseDispersion. Now, I would like to identify tagwise
outliers in my data, I thought I could estimate NB distribution for
each tag. Given that a NB is defined by two parameters (r and p), I
assume that r = 1/x$tagwise.dispersion, how can I get tagwise p from
DGEList dataframe?
Thanks
d

Dear Davide,
I'm not sure what you meant by "identify tagwise outliers".
Are you interested in finding genes having outlier tagwise
dispersions?
If so, you can use the function estimateDisp() with robust=TRUE.
The detected outlier genes will be given smaller prior.df in the
output.
If you use r and p to parameterize the NB distribution, then r=1/phi
and
p=(phi*mu)/(1+phi*mu), where phi is the dispersion and mu is the mean.
The values of mu can be obtained from the fitted values in glmFit().
Best wishes,
Yunshun
------------------------------
Message: 23
Date: Wed, 19 Mar 2014 09:33:45 +0100
From: Davide Cittaro <cittaro.davide@hsr.it>
To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
Subject: [BioC] tagwise parameters for negative binomial distribution
in edgeR
Message-ID: <b6a4308d-647d-4239-b176-1eb4c7e12fc2 at="" hsr.it="">
Content-Type: text/plain; charset=us-ascii
Dear list,
I have a DGElist object in edgeR, already processed with
calcNormFactors,
estimateCommonDispersion and estimateTagWiseDispersion. Now, I would
like to
identify tagwise outliers in my data, I thought I could estimate NB
distribution for each tag. Given that a NB is defined by two
parameters (r
and p), I assume that r = 1/x$tagwise.dispersion, how can I get
tagwise p
from DGEList dataframe?
Thanks
d
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}

Dear Davide,
Do you want to identify tags (genes) with dispersion values that are
so
high (relative to other genes with similar count sizes) that they
should
be considered outliers?
The easiest way to do this is to use
d <- estimateDisp(d, design, robust=TRUE)
and then look at the output values for prior.df:
summary(d$prior.df)
Any tag with a small prior.df is considered an outlier. You can sort
tags
by their prior.df values to select the most significant outliers.
Note that the methodology used by the estimateDisp() robust procedure
is
more complicated than simply using NB probabilities, because one has
to
take into acccount the genome-wide distribution of the dispersion
values
as well as accounting for the fact that the fitted values (p) have
been
estimated from the same data. The methodology is mostly explained in:
http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdfhttp://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf
Best wishes
Gordon
> From: Davide Cittaro <cittaro.davide at="" hsr.it="">
> To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
> Subject: [BioC] tagwise parameters for negative binomial
distribution
> in edgeR
>
> Dear list,
> I have a DGElist object in edgeR, already processed with
> calcNormFactors, estimateCommonDispersion and
estimateTagWiseDispersion.
> Now, I would like to identify tagwise outliers in my data, I thought
I
> could estimate NB distribution for each tag. Given that a NB is
defined
> by two parameters (r and p), I assume that r =
1/x$tagwise.dispersion,
> how can I get tagwise p from DGEList dataframe?
> Thanks
>
> d
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}

Hi Davide,
Just to add another option, there is also estimateGLMRobustDisp(),
which is a wrapper for an iteratively re-weighted scheme that down
weights outliers -- effectively using a constant prior degrees of
freedom. This was developed independently of estimateDisp() from a
different perspective. But, what this gives is a matrix of weights
(so, one for each observation, not just for each tag) where identified
outliers should exhibit lower weights. So, you could use this to
identify outliers observation-wise or tag-wise (e.g. take column sum
of weights). You'd want >= 3 replicates per condition for this one
though.
In code, you could do something like:
d <- estimateGLMRobustDisp(d, design)
summary(d$weights)
More details can be found at:
http://arxiv.org/abs/1312.3382http://imlspenticton.uzh.ch/robinson_lab/edgeR_robust/supplement.pdf
(in particular, Supplementary Figure 8, which shows ROC curves for
ability to separate [simulated] outliers by weights/residuals and yet
another option is DESeq2's Cook's [observation-wise or max-by-tag]
distance; we don't have a curve for estimateDisp!)
Best regards, Mark
----------
Prof. Dr. Mark Robinson
Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://ow.ly/riRea
On 20.03.2014, at 01:04, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
> Dear Davide,
>
> Do you want to identify tags (genes) with dispersion values that are
so high (relative to other genes with similar count sizes) that they
should be considered outliers?
>
> The easiest way to do this is to use
>
> d <- estimateDisp(d, design, robust=TRUE)
>
> and then look at the output values for prior.df:
>
> summary(d$prior.df)
>
> Any tag with a small prior.df is considered an outlier. You can
sort tags by their prior.df values to select the most significant
outliers.
>
> Note that the methodology used by the estimateDisp() robust
procedure is more complicated than simply using NB probabilities,
because one has to take into acccount the genome-wide distribution of
the dispersion values as well as accounting for the fact that the
fitted values (p) have been estimated from the same data. The
methodology is mostly explained in:
>
> http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf
> http://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf
>
> Best wishes
> Gordon
>
>> From: Davide Cittaro <cittaro.davide at="" hsr.it="">
>> To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
>> Subject: [BioC] tagwise parameters for negative binomial
distribution
>> in edgeR
>>
>> Dear list,
>
>> I have a DGElist object in edgeR, already processed with
calcNormFactors, estimateCommonDispersion and
estimateTagWiseDispersion. Now, I would like to identify tagwise
outliers in my data, I thought I could estimate NB distribution for
each tag. Given that a NB is defined by two parameters (r and p), I
assume that r = 1/x$tagwise.dispersion, how can I get tagwise p from
DGEList dataframe?
>
>> Thanks
>>
>> d
>
>
______________________________________________________________________
> The information in this email is confidential and
intend...{{dropped:4}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor

Hi Mark,
Thanks for the comment.
On 20/mar/2014, at 14:58, Mark Robinson <mark.robinson at="" imls.uzh.ch="">
wrote:
> You'd want >= 3 replicates per condition for this one though.
>
Unfortunately my dataset is made of >200 unique samples :-( What
should I expect (except for a ton of warnings...)?
I'm going to read everything you (and Gordon) posted.
Thanks
d
> In code, you could do something like:
>
> d <- estimateGLMRobustDisp(d, design)
> summary(d$weights)
>
> More details can be found at:
>
> http://arxiv.org/abs/1312.3382
>
> http://imlspenticton.uzh.ch/robinson_lab/edgeR_robust/supplement.pdf
> (in particular, Supplementary Figure 8, which shows ROC curves for
ability to separate [simulated] outliers by weights/residuals and yet
another option is DESeq2's Cook's [observation-wise or max-by-tag]
distance; we don't have a curve for estimateDisp!)
>
>
> Best regards, Mark
>
>
> ----------
> Prof. Dr. Mark Robinson
> Bioinformatics, Institute of Molecular Life Sciences
> University of Zurich
> http://ow.ly/riRea
>
>
>
>
>
>
> On 20.03.2014, at 01:04, Gordon K Smyth <smyth at="" wehi.edu.au="">
wrote:
>
>> Dear Davide,
>>
>> Do you want to identify tags (genes) with dispersion values that
are so high (relative to other genes with similar count sizes) that
they should be considered outliers?
>>
>> The easiest way to do this is to use
>>
>> d <- estimateDisp(d, design, robust=TRUE)
>>
>> and then look at the output values for prior.df:
>>
>> summary(d$prior.df)
>>
>> Any tag with a small prior.df is considered an outlier. You can
sort tags by their prior.df values to select the most significant
outliers.
>>
>> Note that the methodology used by the estimateDisp() robust
procedure is more complicated than simply using NB probabilities,
because one has to take into acccount the genome-wide distribution of
the dispersion values as well as accounting for the fact that the
fitted values (p) have been estimated from the same data. The
methodology is mostly explained in:
>>
>> http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf
>> http://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf
>>
>> Best wishes
>> Gordon
>>
>>> From: Davide Cittaro <cittaro.davide at="" hsr.it="">
>>> To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
>>> Subject: [BioC] tagwise parameters for negative binomial
distribution
>>> in edgeR
>>>
>>> Dear list,
>>
>>> I have a DGElist object in edgeR, already processed with
calcNormFactors, estimateCommonDispersion and
estimateTagWiseDispersion. Now, I would like to identify tagwise
outliers in my data, I thought I could estimate NB distribution for
each tag. Given that a NB is defined by two parameters (r and p), I
assume that r = 1/x$tagwise.dispersion, how can I get tagwise p from
DGEList dataframe?
>>
>>> Thanks
>>>
>>> d
>>
>>
______________________________________________________________________
>> The information in this email is confidential and
intend...{{dropped:4}}
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor

Hi Mark,
how would the presence of observation outliers potentially causing
dispersion outliers be handled in voom?
Many thanks, Ina
----- Original Message -----
From: "Mark Robinson" <mark.robinson@imls.uzh.ch>
To: "Davide Cittaro" <cittaro.davide at="" hsr.it="">
Cc: "Bioconductor mailing list" <bioconductor at="" r-project.org="">,
"xiaobei.zhou at uzh.ch Zhou" <xiaobei.zhou at="" uzh.ch="">, "Gordon Smyth"
<smyth at="" wehi.edu.au="">
Sent: Thursday, March 20, 2014 9:58:05 AM
Subject: Re: [BioC] tagwise parameters for negative binomial
distribution in edgeR
Hi Davide,
Just to add another option, there is also estimateGLMRobustDisp(),
which is a wrapper for an iteratively re-weighted scheme that down
weights outliers -- effectively using a constant prior degrees of
freedom. This was developed independently of estimateDisp() from a
different perspective. But, what this gives is a matrix of weights
(so, one for each observation, not just for each tag) where identified
outliers should exhibit lower weights. So, you could use this to
identify outliers observation-wise or tag-wise (e.g. take column sum
of weights). You'd want >= 3 replicates per condition for this one
though.
In code, you could do something like:
d <- estimateGLMRobustDisp(d, design)
summary(d$weights)
More details can be found at:
http://arxiv.org/abs/1312.3382http://imlspenticton.uzh.ch/robinson_lab/edgeR_robust/supplement.pdf
(in particular, Supplementary Figure 8, which shows ROC curves for
ability to separate [simulated] outliers by weights/residuals and yet
another option is DESeq2's Cook's [observation-wise or max-by-tag]
distance; we don't have a curve for estimateDisp!)
Best regards, Mark
----------
Prof. Dr. Mark Robinson
Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://ow.ly/riRea
On 20.03.2014, at 01:04, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
> Dear Davide,
>
> Do you want to identify tags (genes) with dispersion values that are
so high (relative to other genes with similar count sizes) that they
should be considered outliers?
>
> The easiest way to do this is to use
>
> d <- estimateDisp(d, design, robust=TRUE)
>
> and then look at the output values for prior.df:
>
> summary(d$prior.df)
>
> Any tag with a small prior.df is considered an outlier. You can
sort tags by their prior.df values to select the most significant
outliers.
>
> Note that the methodology used by the estimateDisp() robust
procedure is more complicated than simply using NB probabilities,
because one has to take into acccount the genome-wide distribution of
the dispersion values as well as accounting for the fact that the
fitted values (p) have been estimated from the same data. The
methodology is mostly explained in:
>
> http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf
> http://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf
>
> Best wishes
> Gordon
>
>> From: Davide Cittaro <cittaro.davide at="" hsr.it="">
>> To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
>> Subject: [BioC] tagwise parameters for negative binomial
distribution
>> in edgeR
>>
>> Dear list,
>
>> I have a DGElist object in edgeR, already processed with
calcNormFactors, estimateCommonDispersion and
estimateTagWiseDispersion. Now, I would like to identify tagwise
outliers in my data, I thought I could estimate NB distribution for
each tag. Given that a NB is defined by two parameters (r and p), I
assume that r = 1/x$tagwise.dispersion, how can I get tagwise p from
DGEList dataframe?
>
>> Thanks
>>
>> d
>
>
______________________________________________________________________
> The information in this email is confidential and
intend...{{dropped:4}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor

Hi Ina,
I don't think voom has any special consideration for observation
outliers, but limma's 'eBayes' function has a 'robust' argument which
I
believe has the same effect as the corresponding argument in edgeR's
'estimateDisp', i.e. dealing with outlier tags that have abnormally
high (or low) variance.
-Ryan
On Fri 21 Mar 2014 08:48:23 AM PDT, Ina Hoeschele wrote:
> Hi Mark,
> how would the presence of observation outliers potentially
causing dispersion outliers be handled in voom?
> Many thanks, Ina
>
>
> ----- Original Message -----
> From: "Mark Robinson" <mark.robinson at="" imls.uzh.ch="">
> To: "Davide Cittaro" <cittaro.davide at="" hsr.it="">
> Cc: "Bioconductor mailing list" <bioconductor at="" r-project.org="">,
"xiaobei.zhou at uzh.ch Zhou" <xiaobei.zhou at="" uzh.ch="">, "Gordon Smyth"
<smyth at="" wehi.edu.au="">
> Sent: Thursday, March 20, 2014 9:58:05 AM
> Subject: Re: [BioC] tagwise parameters for negative binomial
distribution in edgeR
>
> Hi Davide,
>
> Just to add another option, there is also estimateGLMRobustDisp(),
which is a wrapper for an iteratively re-weighted scheme that down
weights outliers -- effectively using a constant prior degrees of
freedom. This was developed independently of estimateDisp() from a
different perspective. But, what this gives is a matrix of weights
(so, one for each observation, not just for each tag) where identified
outliers should exhibit lower weights. So, you could use this to
identify outliers observation-wise or tag-wise (e.g. take column sum
of weights). You'd want >= 3 replicates per condition for this one
though.
>
> In code, you could do something like:
>
> d <- estimateGLMRobustDisp(d, design)
> summary(d$weights)
>
> More details can be found at:
>
> http://arxiv.org/abs/1312.3382
>
> http://imlspenticton.uzh.ch/robinson_lab/edgeR_robust/supplement.pdf
> (in particular, Supplementary Figure 8, which shows ROC curves for
ability to separate [simulated] outliers by weights/residuals and yet
another option is DESeq2's Cook's [observation-wise or max-by-tag]
distance; we don't have a curve for estimateDisp!)
>
>
> Best regards, Mark
>
>
> ----------
> Prof. Dr. Mark Robinson
> Bioinformatics, Institute of Molecular Life Sciences
> University of Zurich
> http://ow.ly/riRea
>
>
>
>
>
>
> On 20.03.2014, at 01:04, Gordon K Smyth <smyth at="" wehi.edu.au="">
wrote:
>
>> Dear Davide,
>>
>> Do you want to identify tags (genes) with dispersion values that
are so high (relative to other genes with similar count sizes) that
they should be considered outliers?
>>
>> The easiest way to do this is to use
>>
>> d <- estimateDisp(d, design, robust=TRUE)
>>
>> and then look at the output values for prior.df:
>>
>> summary(d$prior.df)
>>
>> Any tag with a small prior.df is considered an outlier. You can
sort tags by their prior.df values to select the most significant
outliers.
>>
>> Note that the methodology used by the estimateDisp() robust
procedure is more complicated than simply using NB probabilities,
because one has to take into acccount the genome-wide distribution of
the dispersion values as well as accounting for the fact that the
fitted values (p) have been estimated from the same data. The
methodology is mostly explained in:
>>
>> http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf
>> http://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf
>>
>> Best wishes
>> Gordon
>>
>>> From: Davide Cittaro <cittaro.davide at="" hsr.it="">
>>> To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
>>> Subject: [BioC] tagwise parameters for negative binomial
distribution
>>> in edgeR
>>>
>>> Dear list,
>>
>>> I have a DGElist object in edgeR, already processed with
calcNormFactors, estimateCommonDispersion and
estimateTagWiseDispersion. Now, I would like to identify tagwise
outliers in my data, I thought I could estimate NB distribution for
each tag. Given that a NB is defined by two parameters (r and p), I
assume that r = 1/x$tagwise.dispersion, how can I get tagwise p from
DGEList dataframe?
>>
>>> Thanks
>>>
>>> d
>>
>>
______________________________________________________________________
>> The information in this email is confidential and
intend...{{dropped:4}}
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor

Dear Gordon,
thanks for the answer.
On 20/mar/2014, at 01:04, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
> Dear Davide,
>
> Do you want to identify tags (genes) with dispersion values that are
so
> high (relative to other genes with similar count sizes) that they
should
> be considered outliers?
Mmm, actually I would like to identify the sample that is an outlier
for a specific gene, that's why I thought I could focus on tagwise
distribution.
>
> The easiest way to do this is to use
>
> d <- estimateDisp(d, design, robust=TRUE)
>
> and then look at the output values for prior.df:
>
> summary(d$prior.df)
>
> Any tag with a small prior.df is considered an outlier. You can
sort tags
> by their prior.df values to select the most significant outliers.
Does this identify a tag that is an outlier over all samples?
>
> Note that the methodology used by the estimateDisp() robust
procedure is
> more complicated than simply using NB probabilities, because one has
to
> take into acccount the genome-wide distribution of the dispersion
values
> as well as accounting for the fact that the fitted values (p) have
been
> estimated from the same data. The methodology is mostly explained
in:
>
> http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf
> http://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf
>
I have a lot to read :-)
Thanks
d
> Best wishes
> Gordon
>
>> From: Davide Cittaro <cittaro.davide at="" hsr.it="">
>> To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org="">
>> Subject: [BioC] tagwise parameters for negative binomial
distribution
>> in edgeR
>>
>> Dear list,
>
>> I have a DGElist object in edgeR, already processed with
>> calcNormFactors, estimateCommonDispersion and
estimateTagWiseDispersion.
>> Now, I would like to identify tagwise outliers in my data, I
thought I
>> could estimate NB distribution for each tag. Given that a NB is
defined
>> by two parameters (r and p), I assume that r =
1/x$tagwise.dispersion,
>> how can I get tagwise p from DGEList dataframe?
>
>> Thanks
>>
>> d
>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:6}}

On Thu, 20 Mar 2014, Davide Cittaro wrote:
> On 20/mar/2014, at 01:04, Gordon K Smyth <smyth at="" wehi.edu.au="">
wrote:
>
>> Do you want to identify tags (genes) with dispersion values that
are so
>> high (relative to other genes with similar count sizes) that they
should
>> be considered outliers?
>
> Mmm, actually I would like to identify the sample that is an outlier
for
> a specific gene, that's why I thought I could focus on tagwise
> distribution.
See Mark Robinson's post.
It depends on your purpose however. Do you want to downweight/ignore
outliers, or do you want to identify them because they are
interesting?
>> The easiest way to do this is to use
>>
>> d <- estimateDisp(d, design, robust=TRUE)
>>
>> and then look at the output values for prior.df:
>>
>> summary(d$prior.df)
>>
>> Any tag with a small prior.df is considered an outlier. You can
sort tags
>> by their prior.df values to select the most significant outliers.
>
> Does this identify a tag that is an outlier over all samples?
Basically yes. We distinguish dispersion outliers and observation
outliers. An observation outlier is an individual count that is an
outlier (relative to other counts for the same gene). A dispersion
outlier is a gene that shows much more variability between replicates
than
other genes at the same cpm level. A dispersion outlier may arise
from
one or more observation outliers, but not necessarily. It could also
arise from systematically larger variability.
Gordon
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}

On 21/mar/2014, at 01:47, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>>
>> Mmm, actually I would like to identify the sample that is an
outlier for
>> a specific gene, that's why I thought I could focus on tagwise
>> distribution.
>
> See Mark Robinson's post.
>
> It depends on your purpose however. Do you want to
downweight/ignore
> outliers, or do you want to identify them because they are
interesting?
In this case outliers may be relevant, especially the less
represented. I'm running the estimateGLMRobustDisp approach (although
it takes a loong time)
>>> Any tag with a small prior.df is considered an outlier. You can
sort tags
>>> by their prior.df values to select the most significant outliers.
>>
>> Does this identify a tag that is an outlier over all samples?
>
> Basically yes. We distinguish dispersion outliers and observation
> outliers. An observation outlier is an individual count that is an
> outlier (relative to other counts for the same gene). A dispersion
> outlier is a gene that shows much more variability between
replicates than
> other genes at the same cpm level. A dispersion outlier may arise
from
> one or more observation outliers, but not necessarily. It could
also
> arise from systematically larger variability.
Thanks for the explanation.
Best,
d