Question: gene set enrichment analysis of RNA-Seq data
gravatar for Paolo Guarnieri
6.0 years ago by
United States
Paolo Guarnieri20 wrote:
Dear Gordon, Thank you very much for your detailed explanations and clarifications. Very useful. Thanks, Paolo -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.EDU.AU] Sent: Saturday, April 28, 2012 22:38 To: Paolo Guarnieri Cc: Bioconductor mailing list Subject: gene set enrichment analysis of RNA-Seq data Dear Paolo, Well, first of all, let me say that edgeR always takes into account the library sizes, whether or not you use TMM or calcNormFactors. The point though is that TMM is not a transformation. It is specific to the log-linear modelling approach used in edgeR. It does not transform the count data into a form that can be input to permutation-based statistical methods designed for pathway analysis of microarray data. The edgeR User's Guide says "These adjustments are offsets in the models used for testing DE and do not transform the counts in any way." Here is the point: the negative binomial approach is very powerful for genewise or transcriptwise differential expression analysis, but it is difficult to extend it to gene set analysis. That is not to say that such an extension can't be done, but it will be a lot of work. On the other hand, limma-voom gives a pipeline that works right now. You ask about the differences between edgeR and limma-voom for genewise differential expression of RNA-Seq data. I don't think the differences between the two methods are generally as great as you suggest. The main difference is that limma-voom is able to adapt to the amount of gene-specific dispersion heterogeneity in the data, whereas edgeR does not do this automatically. In other words, limma estimates df.prior whereas this is preset (but user changeable) in edgeR. So, if you have data where some genes show much greater inter-library variation than others, then limma will give relatively stronger preference to genes that are consistent between replicates, whereas edgeR will give relatively stronger preference to genes with larger fold changes. These are differences of degrees only. edgeR can be tuned to give behavior closer to limma, although the tuning will be different for each dataset. edgeR can separate biological from measurement variation, which is useful for interpretation and which limma-voom can't do. It is also potentially more statistically powerful when some of the counts are small. However limma-voom is a reliable choice and, after doing many simulations, I feel that I can recommend it with confidence. My lab is using both packages in our day to day work analysing RNA-Seq data for our collaborators. I will try to clarify the relationship between edgeR and voom more completely in a paper to be submitted. Best wishes Gordon > Date: Thu, 26 Apr 2012 22:44:05 +0000 > From: Paolo Guarnieri <pg2296 at=""""> > To: <bioconductor at=""""> > Subject: Re: [BioC] gene set enrichment analysis of RNA-Seq data > > Dear Gordon, > > We (my colleagues and myself) read your post/papers (TMM, limma User > Guide, edgeR user Guide and paper and voom vignette) with great > interest and we are glad you took the time to address this issue. > > We have a couple of additional questions. > In a previous email you said: > "However RNA-Seq counts for different libraries can be of very > different sizes, and hence will be heteroscedastic.". > Then the question is: why it is not sufficient to use the TMM > normalized data as it takes into account the different library size, > but instead you propose to follow the voom transformation procedure? > > Additionally we find that differentially expressed genes identified by > edgeR are substantially different from those identified by limma after > voom transformation. Whereas we expect this behavior, due to the > different statistical model and the transformation itself, it is > always a reason of concern. > > Best, > Paolo > Gordon K Smyth <smyth at="" ...=""> writes: >> >> Dear Julie, >> >> A good question. As far as I know, there is as yet no such method. >> What I am doing for this purpose for the time being is to use voom() >> in the limma package to transform the RNA-Seq counts to a scale on >> which microarray methods can be used, then using roast(). See page >> 104 of the limma User's Guide for examples of this: >> >> >> sersguide.pdf >> >> Note that roast() is a self-contained gene set test with the ability >> to use linear models and weights: >> >> >> >> Another gene set enrichment option that works fine with RNA-Seq data >> is camera(). This is a competitive test, but without the usual >> disadvantage of gene sampling in that it estimates and adjusts for >> inter-gene correlation. camera() is currently setup to automatically >> use the weights that come out of voom(), meaning that camera() >> respects the mean-variance relationship of RNA-Seq data. We have >> used it successfully on RNA-Seq data. >> >> Best wishes >> Gordon >> >> ------------ original message ------------------ [BioC] gene set >> enrichment analysis of RNA-Seq data Julie Leonard julie.leonard at >> Thu Apr 12 23:06:54 CEST 2012 >> >> I was wondering if anyone is aware of a gene set enrichment algorithm >> for RNA-Seq data that: >> >> 1) does not require a specification of differentially expressed (DE) >> genes need to use a hard p-value threshold cutoff for >> determining the DE gene >> list) >> >> 2) uses subject sampling instead of gene sampling to obtain the >> p-value (i.e.this would maintain gene-gene correlations) >> >> Basically, I'm looking for a >> self-contained/subject sampling method (e.g. >> SAM-GS for microarray data) or a "hybrid" method (e.g. GSEA for >> microarray data). The only gene set enrichment algorithm that I am >> aware of for RNA-Seq data is GOSeq, but it uses a competitive/gene >> sampling method (i.e. Fisher's Exact Test). >> Note, the ideas of self-contained vs competitive and subject sampling >> vs gene sampling come from the following paper: Goeman JJ, Bhlmann >> P.Analyzing gene expression data in terms of gene sets: >> methodological issues. Bioinformatics. 2007 Apr 15;23(8) >> >> Something like GSEA-SNP is close to what I want. >> It uses a test-statistic that is suitable for discrete data and uses >> subject sampling to calculate the p-values. >> >> Thanks, >> Julie ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD COMMENTlink written 6.0 years ago by Paolo Guarnieri20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 331 users visited in the last hour