I was wondering if anyone is aware of a gene
set enrichment algorithm for RNA-Seq data that:
1) does not require a specification of differentially
expressed (DE) genes i.e.no need to use a hard
p-value threshold cutoff for determining the DE gene
list)
2) uses subject sampling instead of gene sampling
to obtain the p-value (i.e.this would maintain
gene-gene correlations)
Basically, I'm looking for a
self-contained/subject sampling method (e.g.
SAM-GS for microarray data) or a "hybrid" method
(e.g. GSEA for microarray data). The only gene set
enrichment algorithm that I am aware of for RNA-Seq
data is GOSeq, but it uses a competitive/gene
sampling method (i.e. Fisher's Exact Test).
Note, the ideas of self-contained vs competitive and
subject sampling vs gene sampling come from the
following paper: Goeman JJ, B?hlmann P.Analyzing
gene expression data in terms of gene sets:
methodological issues. Bioinformatics. 2007 Apr 15;23(8)
Something like GSEA-SNP is close to what I want.
It uses a test-statistic that is suitable for discrete data
and uses subject sampling to calculate the p-values.
Thanks,
Julie
Dear Julie,
Subject sampling (used by GSEA and other softwares) actually makes
quite
strong distributional assumptions, in that the units being permuted
are
treated as independent and exchangeable under the null hypothesis. In
particular, this assumes that the units are homoscedastic. However
RNA-Seq counts for different libraries can be of very different sizes,
and
hence will be heteroscedastic. And this will be so even under the
null
hypothesis when the sequencing depths are different.
This one of the reasons why I prefer parametric gene set methods
(roast,
camera, romer), because heteroscedasticity and dependence between the
samples can be unravelled.
Heteroscedasticity will be less a problem when the library sizes are
similar or if the counts are large. When the biological CV in the
data is
relatively large, the E-values from voom() for each gene become
roughly
homoscedastic relatively quickly at moderate counts sizes.
The idea of voom() is that the output values can be treated as
continuous.
This can never be perfect for low counts, but genes with all low
counts
are never going to be significantly DE anyway. In my experience,
getting
the mean-variance relationship correct is more important than the
exact
distibutional law or distinguishing between discrete and continuous.
Best wishes
Gordon
--------------- original message ----------------------
[BioC] gene set enrichment analysis of RNA-Seq data
Julie Leonard julie.leonard at syngenta.com
Fri Apr 13 22:52:27 CEST 2012
Thanks Gordon-
I'll definitely look into using camera. I have a naive question
though. If I use voom() to transform the RNA-Seq count data, does
this
transformation change the data from discrete to continuous? And if
so,
can I just use the E values in the EList object that is outputted from
voom() as input to GSEA (Broad version)? I would assume I wouldn't
need
the weights b/c GSEA does not assume a specific distribution (it is
empirically calculated via subject sampling), so I wouldn't need to
adjust
the data for heteroscedasticity to fit a "normal" or any other pre-set
distribution??
Thanks,
Julie
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Dear Julie,
A good question. As far as I know, there is as yet no such method.
What
I am doing for this purpose for the time being is to use voom() in the
limma package to transform the RNA-Seq counts to a scale on which
microarray methods can be used, then using roast(). See page 104 of
the
limma User's Guide for examples of this:
http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc/us
ersguide.pdf
Note that roast() is a self-contained gene set test with the ability
to
use linear models and weights:
http://www.ncbi.nlm.nih.gov/pubmed/20610611
Another gene set enrichment option that works fine with RNA-Seq data
is
camera(). This is a competitive test, but without the usual
disadvantage
of gene sampling in that it estimates and adjusts for inter-gene
correlation. camera() is currently setup to automatically use the
weights
that come out of voom(), meaning that camera() respects the mean-
variance
relationship of RNA-Seq data. We have used it successfully on RNA-Seq
data.
Best wishes
Gordon
------------ original message ------------------
[BioC] gene set enrichment analysis of RNA-Seq data
Julie Leonard julie.leonard at syngenta.com
Thu Apr 12 23:06:54 CEST 2012
I was wondering if anyone is aware of a gene
set enrichment algorithm for RNA-Seq data that:
1) does not require a specification of differentially
expressed (DE) genes i.e.no need to use a hard
p-value threshold cutoff for determining the DE gene
list)
2) uses subject sampling instead of gene sampling
to obtain the p-value (i.e.this would maintain
gene-gene correlations)
Basically, I'm looking for a
self-contained/subject sampling method (e.g.
SAM-GS for microarray data) or a "hybrid" method
(e.g. GSEA for microarray data). The only gene set
enrichment algorithm that I am aware of for RNA-Seq
data is GOSeq, but it uses a competitive/gene
sampling method (i.e. Fisher's Exact Test).
Note, the ideas of self-contained vs competitive and
subject sampling vs gene sampling come from the
following paper: Goeman JJ, Bhlmann P.Analyzing
gene expression data in terms of gene sets:
methodological issues. Bioinformatics. 2007 Apr 15;23(8)
Something like GSEA-SNP is close to what I want.
It uses a test-statistic that is suitable for discrete data
and uses subject sampling to calculate the p-values.
Thanks,
Julie
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Thanks Gordon-
I'll definitely look into using camera. I have a naive question
though. If
I use voom() to transform the RNA-Seq count data, does this
transformation
change the data from discrete to continuous? And if so, can I just
use the E
values in the EList object that is outputted from voom() as input to
GSEA (Broad
version)? I would assume I wouldn't need the weights b/c GSEA does
not assume a
specific distribution (it is empirically calculated via subject
sampling), so I
wouldn't need to adjust the data for heteroscedasticity to fit a
"normal" or any
other pre-set distribution??
Thanks,
Julie
Gordon K Smyth <smyth at="" ...=""> writes:
>
> Dear Julie,
>
> A good question. As far as I know, there is as yet no such method.
What
> I am doing for this purpose for the time being is to use voom() in
the
> limma package to transform the RNA-Seq counts to a scale on which
> microarray methods can be used, then using roast(). See page 104 of
the
> limma User's Guide for examples of this:
>
> http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc/
usersguide.pdf
>
> Note that roast() is a self-contained gene set test with the ability
to
> use linear models and weights:
>
> http://www.ncbi.nlm.nih.gov/pubmed/20610611
>
> Another gene set enrichment option that works fine with RNA-Seq data
is
> camera(). This is a competitive test, but without the usual
disadvantage
> of gene sampling in that it estimates and adjusts for inter-gene
> correlation. camera() is currently setup to automatically use the
weights
> that come out of voom(), meaning that camera() respects the mean-
variance
> relationship of RNA-Seq data. We have used it successfully on RNA-
Seq
> data.
>
> Best wishes
> Gordon
>
> ------------ original message ------------------
> [BioC] gene set enrichment analysis of RNA-Seq data
> Julie Leonard julie.leonard at syngenta.com
> Thu Apr 12 23:06:54 CEST 2012
>
> I was wondering if anyone is aware of a gene
> set enrichment algorithm for RNA-Seq data that:
>
> 1) does not require a specification of differentially
> expressed (DE) genes i.e.no need to use a hard
> p-value threshold cutoff for determining the DE gene
> list)
>
> 2) uses subject sampling instead of gene sampling
> to obtain the p-value (i.e.this would maintain
> gene-gene correlations)
>
> Basically, I'm looking for a
> self-contained/subject sampling method (e.g.
> SAM-GS for microarray data) or a "hybrid" method
> (e.g. GSEA for microarray data). The only gene set
> enrichment algorithm that I am aware of for RNA-Seq
> data is GOSeq, but it uses a competitive/gene
> sampling method (i.e. Fisher's Exact Test).
> Note, the ideas of self-contained vs competitive and
> subject sampling vs gene sampling come from the
> following paper: Goeman JJ, Bhlmann P.Analyzing
> gene expression data in terms of gene sets:
> methodological issues. Bioinformatics. 2007 Apr 15;23(8)
>
> Something like GSEA-SNP is close to what I want.
> It uses a test-statistic that is suitable for discrete data
> and uses subject sampling to calculate the p-values.
>
> Thanks,
> Julie
>
>
______________________________________________________________________
> The information in this email is confidential and
intend...{{dropped:4}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
Dear Gordon,
We (my colleagues and myself) read your post/papers (TMM, limma User
Guide,
edgeR user Guide and paper and voom vignette) with great interest and
we are
glad you took the time to address this issue.
We have a couple of additional questions.
In a previous email you said:
"However RNA-Seq counts for different libraries
can be of very different sizes, and hence will be heteroscedastic.".
Then the question is: why it is not sufficient to use the TMM
normalized
data as it takes into account the different library size, but instead
you
propose to follow the voom transformation procedure?
Additionally we find that differentially expressed genes identified by
edgeR are substantially different from those identified by limma after
voom transformation.
Whereas we expect this behavior, due to the different statistical
model and the transformation itself,
it is always a reason of concern.
Best,
Paolo
Dear Paolo,
Well, first of all, let me say that edgeR always takes into account
the
library sizes, whether or not you use TMM or calcNormFactors.
The point though is that TMM is not a transformation. It is specific
to
the log-linear modelling approach used in edgeR. It does not
transform
the count data into a form that can be input to permutation-based
statistical methods designed for pathway analysis of microarray data.
The edgeR User's Guide says
"These adjustments are offsets in the models used for testing
DE and do not transform the counts in any way."
Here is the point: the negative binomial approach is very powerful for
genewise or transcriptwise differential expression analysis, but it is
difficult to extend it to gene set analysis. That is not to say that
such
an extension can't be done, but it will be a lot of work. On the
other
hand, limma-voom gives a pipeline that works right now.
You ask about the differences between edgeR and limma-voom for
genewise
differential expression of RNA-Seq data. I don't think the
differences
between the two methods are generally as great as you suggest. The
main
difference is that limma-voom is able to adapt to the amount of
gene-specific dispersion heterogeneity in the data, whereas edgeR does
not
do this automatically. In other words, limma estimates df.prior
whereas
this is preset (but user changeable) in edgeR. So, if you have data
where
some genes show much greater inter-library variation than others, then
limma will give relatively stronger preference to genes that are
consistent between replicates, whereas edgeR will give relatively
stronger
preference to genes with larger fold changes. These are differences
of
degrees only. edgeR can be tuned to give behavior closer to limma,
although the tuning will be different for each dataset.
edgeR can separate biological from measurement variation, which is
useful
for interpretation and which limma-voom can't do. It is also
potentially
more statistically powerful when some of the counts are small.
However
limma-voom is a reliable choice and, after doing many simulations, I
feel
that I can recommend it with confidence. My lab is using both
packages in
our day to day work analysing RNA-Seq data for our collaborators.
I will try to clarify the relationship between edgeR and voom more
completely in a paper to be submitted.
Best wishes
Gordon
> Date: Thu, 26 Apr 2012 22:44:05 +0000
> From: Paolo Guarnieri <pg2296 at="" c2b2.columbia.edu="">
> To: <bioconductor at="" stat.math.ethz.ch="">
> Subject: Re: [BioC] gene set enrichment analysis of RNA-Seq data
>
> Dear Gordon,
>
> We (my colleagues and myself) read your post/papers (TMM, limma User
> Guide, edgeR user Guide and paper and voom vignette) with great
interest
> and we are glad you took the time to address this issue.
>
> We have a couple of additional questions.
> In a previous email you said:
> "However RNA-Seq counts for different libraries
> can be of very different sizes, and hence will be heteroscedastic.".
> Then the question is: why it is not sufficient to use the TMM
normalized
> data as it takes into account the different library size, but
instead
> you propose to follow the voom transformation procedure?
>
> Additionally we find that differentially expressed genes identified
by
> edgeR are substantially different from those identified by limma
after
> voom transformation. Whereas we expect this behavior, due to the
> different statistical model and the transformation itself, it is
always
> a reason of concern.
>
> Best,
> Paolo
> Gordon K Smyth <smyth at="" ...=""> writes:
>>
>> Dear Julie,
>>
>> A good question. As far as I know, there is as yet no such method.
What
>> I am doing for this purpose for the time being is to use voom() in
the
>> limma package to transform the RNA-Seq counts to a scale on which
>> microarray methods can be used, then using roast(). See page 104
of the
>> limma User's Guide for examples of this:
>>
>> http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc
/usersguide.pdf
>>
>> Note that roast() is a self-contained gene set test with the
ability to
>> use linear models and weights:
>>
>> http://www.ncbi.nlm.nih.gov/pubmed/20610611
>>
>> Another gene set enrichment option that works fine with RNA-Seq
data is
>> camera(). This is a competitive test, but without the usual
disadvantage
>> of gene sampling in that it estimates and adjusts for inter-gene
>> correlation. camera() is currently setup to automatically use the
weights
>> that come out of voom(), meaning that camera() respects the mean-
variance
>> relationship of RNA-Seq data. We have used it successfully on RNA-
Seq
>> data.
>>
>> Best wishes
>> Gordon
>>
>> ------------ original message ------------------
>> [BioC] gene set enrichment analysis of RNA-Seq data
>> Julie Leonard julie.leonard at syngenta.com
>> Thu Apr 12 23:06:54 CEST 2012
>>
>> I was wondering if anyone is aware of a gene
>> set enrichment algorithm for RNA-Seq data that:
>>
>> 1) does not require a specification of differentially
>> expressed (DE) genes i.e.no need to use a hard
>> p-value threshold cutoff for determining the DE gene
>> list)
>>
>> 2) uses subject sampling instead of gene sampling
>> to obtain the p-value (i.e.this would maintain
>> gene-gene correlations)
>>
>> Basically, I'm looking for a
>> self-contained/subject sampling method (e.g.
>> SAM-GS for microarray data) or a "hybrid" method
>> (e.g. GSEA for microarray data). The only gene set
>> enrichment algorithm that I am aware of for RNA-Seq
>> data is GOSeq, but it uses a competitive/gene
>> sampling method (i.e. Fisher's Exact Test).
>> Note, the ideas of self-contained vs competitive and
>> subject sampling vs gene sampling come from the
>> following paper: Goeman JJ, Bhlmann P.Analyzing
>> gene expression data in terms of gene sets:
>> methodological issues. Bioinformatics. 2007 Apr 15;23(8)
>>
>> Something like GSEA-SNP is close to what I want.
>> It uses a test-statistic that is suitable for discrete data
>> and uses subject sampling to calculate the p-values.
>>
>> Thanks,
>> Julie
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}