gene set enrichment analysis of RNA-Seq data

0

Entering edit mode

Julie Leonard ▴ 110

@julie-leonard-5222

Last seen 10.2 years ago

I was wondering if anyone is aware of a gene set enrichment algorithm for RNA-Seq data that: 1) does not require a specification of differentially expressed (DE) genes i.e.no need to use a hard p-value threshold cutoff for determining the DE gene list) 2) uses subject sampling instead of gene sampling to obtain the p-value (i.e.this would maintain gene-gene correlations) Basically, I'm looking for a self-contained/subject sampling method (e.g. SAM-GS for microarray data) or a "hybrid" method (e.g. GSEA for microarray data). The only gene set enrichment algorithm that I am aware of for RNA-Seq data is GOSeq, but it uses a competitive/gene sampling method (i.e. Fisher's Exact Test). Note, the ideas of self-contained vs competitive and subject sampling vs gene sampling come from the following paper: Goeman JJ, B?hlmann P.Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007 Apr 15;23(8) Something like GSEA-SNP is close to what I want. It uses a test-statistic that is suitable for discrete data and uses subject sampling to calculate the p-values. Thanks, Julie

Microarray goseq Microarray goseq • 4.2k views

ADD COMMENT • link updated 12.6 years ago by Gordon Smyth 51k • written 12.6 years ago by Julie Leonard ▴ 110

2

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 15 minutes ago

WEHI, Melbourne, Australia

Dear Julie, Subject sampling (used by GSEA and other softwares) actually makes quite strong distributional assumptions, in that the units being permuted are treated as independent and exchangeable under the null hypothesis. In particular, this assumes that the units are homoscedastic. However RNA-Seq counts for different libraries can be of very different sizes, and hence will be heteroscedastic. And this will be so even under the null hypothesis when the sequencing depths are different. This one of the reasons why I prefer parametric gene set methods (roast, camera, romer), because heteroscedasticity and dependence between the samples can be unravelled. Heteroscedasticity will be less a problem when the library sizes are similar or if the counts are large. When the biological CV in the data is relatively large, the E-values from voom() for each gene become roughly homoscedastic relatively quickly at moderate counts sizes. The idea of voom() is that the output values can be treated as continuous. This can never be perfect for low counts, but genes with all low counts are never going to be significantly DE anyway. In my experience, getting the mean-variance relationship correct is more important than the exact distibutional law or distinguishing between discrete and continuous. Best wishes Gordon --------------- original message ---------------------- [BioC] gene set enrichment analysis of RNA-Seq data Julie Leonard julie.leonard at syngenta.com Fri Apr 13 22:52:27 CEST 2012 Thanks Gordon- I'll definitely look into using camera. I have a naive question though. If I use voom() to transform the RNA-Seq count data, does this transformation change the data from discrete to continuous? And if so, can I just use the E values in the EList object that is outputted from voom() as input to GSEA (Broad version)? I would assume I wouldn't need the weights b/c GSEA does not assume a specific distribution (it is empirically calculated via subject sampling), so I wouldn't need to adjust the data for heteroscedasticity to fit a "normal" or any other pre-set distribution?? Thanks, Julie ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.6 years ago Gordon Smyth 51k

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 15 minutes ago

WEHI, Melbourne, Australia

Dear Julie, A good question. As far as I know, there is as yet no such method. What I am doing for this purpose for the time being is to use voom() in the limma package to transform the RNA-Seq counts to a scale on which microarray methods can be used, then using roast(). See page 104 of the limma User's Guide for examples of this: http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc/us ersguide.pdf Note that roast() is a self-contained gene set test with the ability to use linear models and weights: http://www.ncbi.nlm.nih.gov/pubmed/20610611 Another gene set enrichment option that works fine with RNA-Seq data is camera(). This is a competitive test, but without the usual disadvantage of gene sampling in that it estimates and adjusts for inter-gene correlation. camera() is currently setup to automatically use the weights that come out of voom(), meaning that camera() respects the mean- variance relationship of RNA-Seq data. We have used it successfully on RNA-Seq data. Best wishes Gordon ------------ original message ------------------ [BioC] gene set enrichment analysis of RNA-Seq data Julie Leonard julie.leonard at syngenta.com Thu Apr 12 23:06:54 CEST 2012 I was wondering if anyone is aware of a gene set enrichment algorithm for RNA-Seq data that: 1) does not require a specification of differentially expressed (DE) genes i.e.no need to use a hard p-value threshold cutoff for determining the DE gene list) 2) uses subject sampling instead of gene sampling to obtain the p-value (i.e.this would maintain gene-gene correlations) Basically, I'm looking for a self-contained/subject sampling method (e.g. SAM-GS for microarray data) or a "hybrid" method (e.g. GSEA for microarray data). The only gene set enrichment algorithm that I am aware of for RNA-Seq data is GOSeq, but it uses a competitive/gene sampling method (i.e. Fisher's Exact Test). Note, the ideas of self-contained vs competitive and subject sampling vs gene sampling come from the following paper: Goeman JJ, Bhlmann P.Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007 Apr 15;23(8) Something like GSEA-SNP is close to what I want. It uses a test-statistic that is suitable for discrete data and uses subject sampling to calculate the p-values. Thanks, Julie ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.6 years ago Gordon Smyth 51k

0

Entering edit mode

Julie Leonard ▴ 110

@julie-leonard-5222

Last seen 10.2 years ago

Thanks Gordon- I'll definitely look into using camera. I have a naive question though. If I use voom() to transform the RNA-Seq count data, does this transformation change the data from discrete to continuous? And if so, can I just use the E values in the EList object that is outputted from voom() as input to GSEA (Broad version)? I would assume I wouldn't need the weights b/c GSEA does not assume a specific distribution (it is empirically calculated via subject sampling), so I wouldn't need to adjust the data for heteroscedasticity to fit a "normal" or any other pre-set distribution?? Thanks, Julie

ADD COMMENT • link 12.6 years ago Julie Leonard ▴ 110

0

Entering edit mode

Paolo Guarnieri ▴ 20

@paolo-guarnieri-5256

Last seen 9.9 years ago

United States

Gordon K Smyth <smyth at="" ...=""> writes: > > Dear Julie, > > A good question. As far as I know, there is as yet no such method. What > I am doing for this purpose for the time being is to use voom() in the > limma package to transform the RNA-Seq counts to a scale on which > microarray methods can be used, then using roast(). See page 104 of the > limma User's Guide for examples of this: > > http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc/ usersguide.pdf > > Note that roast() is a self-contained gene set test with the ability to > use linear models and weights: > > http://www.ncbi.nlm.nih.gov/pubmed/20610611 > > Another gene set enrichment option that works fine with RNA-Seq data is > camera(). This is a competitive test, but without the usual disadvantage > of gene sampling in that it estimates and adjusts for inter-gene > correlation. camera() is currently setup to automatically use the weights > that come out of voom(), meaning that camera() respects the mean- variance > relationship of RNA-Seq data. We have used it successfully on RNA- Seq > data. > > Best wishes > Gordon > > ------------ original message ------------------ > [BioC] gene set enrichment analysis of RNA-Seq data > Julie Leonard julie.leonard at syngenta.com > Thu Apr 12 23:06:54 CEST 2012 > > I was wondering if anyone is aware of a gene > set enrichment algorithm for RNA-Seq data that: > > 1) does not require a specification of differentially > expressed (DE) genes i.e.no need to use a hard > p-value threshold cutoff for determining the DE gene > list) > > 2) uses subject sampling instead of gene sampling > to obtain the p-value (i.e.this would maintain > gene-gene correlations) > > Basically, I'm looking for a > self-contained/subject sampling method (e.g. > SAM-GS for microarray data) or a "hybrid" method > (e.g. GSEA for microarray data). The only gene set > enrichment algorithm that I am aware of for RNA-Seq > data is GOSeq, but it uses a competitive/gene > sampling method (i.e. Fisher's Exact Test). > Note, the ideas of self-contained vs competitive and > subject sampling vs gene sampling come from the > following paper: Goeman JJ, Bhlmann P.Analyzing > gene expression data in terms of gene sets: > methodological issues. Bioinformatics. 2007 Apr 15;23(8) > > Something like GSEA-SNP is close to what I want. > It uses a test-statistic that is suitable for discrete data > and uses subject sampling to calculate the p-values. > > Thanks, > Julie > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:4}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at ... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > Dear Gordon, We (my colleagues and myself) read your post/papers (TMM, limma User Guide, edgeR user Guide and paper and voom vignette) with great interest and we are glad you took the time to address this issue. We have a couple of additional questions. In a previous email you said: "However RNA-Seq counts for different libraries can be of very different sizes, and hence will be heteroscedastic.". Then the question is: why it is not sufficient to use the TMM normalized data as it takes into account the different library size, but instead you propose to follow the voom transformation procedure? Additionally we find that differentially expressed genes identified by edgeR are substantially different from those identified by limma after voom transformation. Whereas we expect this behavior, due to the different statistical model and the transformation itself, it is always a reason of concern. Best, Paolo

ADD COMMENT • link 12.6 years ago Paolo Guarnieri ▴ 20

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 15 minutes ago

WEHI, Melbourne, Australia

Dear Paolo, Well, first of all, let me say that edgeR always takes into account the library sizes, whether or not you use TMM or calcNormFactors. The point though is that TMM is not a transformation. It is specific to the log-linear modelling approach used in edgeR. It does not transform the count data into a form that can be input to permutation-based statistical methods designed for pathway analysis of microarray data. The edgeR User's Guide says "These adjustments are offsets in the models used for testing DE and do not transform the counts in any way." Here is the point: the negative binomial approach is very powerful for genewise or transcriptwise differential expression analysis, but it is difficult to extend it to gene set analysis. That is not to say that such an extension can't be done, but it will be a lot of work. On the other hand, limma-voom gives a pipeline that works right now. You ask about the differences between edgeR and limma-voom for genewise differential expression of RNA-Seq data. I don't think the differences between the two methods are generally as great as you suggest. The main difference is that limma-voom is able to adapt to the amount of gene-specific dispersion heterogeneity in the data, whereas edgeR does not do this automatically. In other words, limma estimates df.prior whereas this is preset (but user changeable) in edgeR. So, if you have data where some genes show much greater inter-library variation than others, then limma will give relatively stronger preference to genes that are consistent between replicates, whereas edgeR will give relatively stronger preference to genes with larger fold changes. These are differences of degrees only. edgeR can be tuned to give behavior closer to limma, although the tuning will be different for each dataset. edgeR can separate biological from measurement variation, which is useful for interpretation and which limma-voom can't do. It is also potentially more statistically powerful when some of the counts are small. However limma-voom is a reliable choice and, after doing many simulations, I feel that I can recommend it with confidence. My lab is using both packages in our day to day work analysing RNA-Seq data for our collaborators. I will try to clarify the relationship between edgeR and voom more completely in a paper to be submitted. Best wishes Gordon > Date: Thu, 26 Apr 2012 22:44:05 +0000 > From: Paolo Guarnieri <pg2296 at="" c2b2.columbia.edu=""> > To: <bioconductor at="" stat.math.ethz.ch=""> > Subject: Re: [BioC] gene set enrichment analysis of RNA-Seq data > > Dear Gordon, > > We (my colleagues and myself) read your post/papers (TMM, limma User > Guide, edgeR user Guide and paper and voom vignette) with great interest > and we are glad you took the time to address this issue. > > We have a couple of additional questions. > In a previous email you said: > "However RNA-Seq counts for different libraries > can be of very different sizes, and hence will be heteroscedastic.". > Then the question is: why it is not sufficient to use the TMM normalized > data as it takes into account the different library size, but instead > you propose to follow the voom transformation procedure? > > Additionally we find that differentially expressed genes identified by > edgeR are substantially different from those identified by limma after > voom transformation. Whereas we expect this behavior, due to the > different statistical model and the transformation itself, it is always > a reason of concern. > > Best, > Paolo > Gordon K Smyth <smyth at="" ...=""> writes: >> >> Dear Julie, >> >> A good question. As far as I know, there is as yet no such method. What >> I am doing for this purpose for the time being is to use voom() in the >> limma package to transform the RNA-Seq counts to a scale on which >> microarray methods can be used, then using roast(). See page 104 of the >> limma User's Guide for examples of this: >> >> http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc /usersguide.pdf >> >> Note that roast() is a self-contained gene set test with the ability to >> use linear models and weights: >> >> http://www.ncbi.nlm.nih.gov/pubmed/20610611 >> >> Another gene set enrichment option that works fine with RNA-Seq data is >> camera(). This is a competitive test, but without the usual disadvantage >> of gene sampling in that it estimates and adjusts for inter-gene >> correlation. camera() is currently setup to automatically use the weights >> that come out of voom(), meaning that camera() respects the mean- variance >> relationship of RNA-Seq data. We have used it successfully on RNA- Seq >> data. >> >> Best wishes >> Gordon >> >> ------------ original message ------------------ >> [BioC] gene set enrichment analysis of RNA-Seq data >> Julie Leonard julie.leonard at syngenta.com >> Thu Apr 12 23:06:54 CEST 2012 >> >> I was wondering if anyone is aware of a gene >> set enrichment algorithm for RNA-Seq data that: >> >> 1) does not require a specification of differentially >> expressed (DE) genes i.e.no need to use a hard >> p-value threshold cutoff for determining the DE gene >> list) >> >> 2) uses subject sampling instead of gene sampling >> to obtain the p-value (i.e.this would maintain >> gene-gene correlations) >> >> Basically, I'm looking for a >> self-contained/subject sampling method (e.g. >> SAM-GS for microarray data) or a "hybrid" method >> (e.g. GSEA for microarray data). The only gene set >> enrichment algorithm that I am aware of for RNA-Seq >> data is GOSeq, but it uses a competitive/gene >> sampling method (i.e. Fisher's Exact Test). >> Note, the ideas of self-contained vs competitive and >> subject sampling vs gene sampling come from the >> following paper: Goeman JJ, Bhlmann P.Analyzing >> gene expression data in terms of gene sets: >> methodological issues. Bioinformatics. 2007 Apr 15;23(8) >> >> Something like GSEA-SNP is close to what I want. >> It uses a test-statistic that is suitable for discrete data >> and uses subject sampling to calculate the p-values. >> >> Thanks, >> Julie ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.6 years ago Gordon Smyth 51k

Login before adding your answer.