Question

Packages for GO and KEGG analysis on RNAseq data

0

Entering edit mode

Merienne Nicolas ▴ 120

@merienne-nicolas-6729

Last seen 6.4 years ago

Switzerland

Dear all,

We are two biologists (so very new in bioinformatic field...) working with RNAseq data and having little "troubles" with pathways analysis. We performed mRNA sequencing on 4 distinct cell populations to compare their transcriptional profile (platform Illumina HiSeq 2000). Row reads were mapped using TopHat and differential analysis was performed with edgeR+voom+limma packages. Our final output is a table (.txt file) for each contrast containing our 16058 expressed genes with respective log fold change, expression values (normalized) and adjusted p-values. We wish to perform pathway enrichment analysis (first GO for a global level and KEGG for a more precise analysis) to determine which pathways are enriched/depleted in specific cell population compared to the others in order to infer cell-type specific functional signatures. However, we have difficulties to find an optimal method to do this. We tried several packages (e.g gage, goseq) and web-based softwares (e.g GeneGO, AmiGO) and found different outputs (sometimes opposite results). What could be the more "validated" method/package for these analysis? In addition, we found differences considering the input data (raw reads, log FC, a list of differentially expressed genes?) and finally we don't understand what should be the input data for the analysis (we think that it is dependent of the package/method used?). So, does anyone of you experiences with GO/KEGG for RNAseq and maybe help us to use a good quantification method please?

Thank you very much for your help.

Best,
Nicolas

GO gage goseq KEGG • 7.8k views

ADD COMMENT • link updated 9.5 years ago by Gordon Smyth 50k • written 9.6 years ago by Merienne Nicolas ▴ 120

0

Entering edit mode

Dear Steve, Julie and Gordon,

Thank you for your advices. We will first go through goseq analysis in detail to try fully understand statistic concepts behind this king of analysis. In our case, it should be sufficient as we are just interested to illustrate some pathways/ontologies overrepresented in specific cell populations. We will come back to the post if we encounter problems...

Thank you all.

Very best,
Nicolas

ADD REPLY • link updated 9.5 years ago by Gordon Smyth 50k • written 9.6 years ago by Merienne Nicolas ▴ 120

score 3 · Answer 1 · 2014-09-14

Dear Nicolas,

I think you are suffering from the "paradox of choice"

http://en.wikipedia.org/wiki/The_Paradox_of_Choice

which says that too much choice leads to more unhappiness and less action even when all the choices are potentially good ones.

Pathway analysis is a very large and very active research area, and there are a large number of alternative methods. This is because there is no unique definition of what a pathway is, and because there are a large number of different ways to rank pathways, and because there is no consensus as to which ranking will best match biological significance. Perhaps no consensus is even possible, because different methods may give better results in different circumstances.

Some methods simply count the number of genes associated with each GO term in your DE list (overlap methods). Other methods work with the test statistics or fold changes for each gene (gene set tests). Some methods adjust for inter-correlations between genes and some don't. Some adjust RNA-seq specific biases like gene length. Some methods test each GO term in isolation whereas others test them relative to each other. It is not surprising that methods that use different information will potentially give different results.

You are asking which is the most validated method, but in truth there are many validated methods. I would however avoid methods which work purely on logFC, because low expression genes may have large fold changes from RNA-seq just by chance. Steve suggested roast and camera, which are implemented in both limma and edgeR. Roast and camera can be viewed as improvements on logFC methods in that they use logFC but also take into account the mean-variance trend as well as correlations between genes.

In the end though, it might be best for you to stick to a simple statistical approach that you can understand. This is better than trying out lots of different methods without fully understanding how they differ. Do you simply want to test for enrichment of publicly available GO and KEGG terms in your list of top DE genes? This is the oldest and simplest type of pathway analysis you can do, and it is still very popular. In that case, goseq is an obvious choice -- it does exactly this, but also takes into account some of the special features of RNA-seq data. Any other GO overlap method (like AmiGO, DAVID, GOstat or GOStats) will give similar results, but without adjusting for gene length bias. Beware however that commercial products like geneGO and IPA use their own proprietry databases, rather than the usual public GO databases, and hence will give different results.

Best wishes
Gordon

PS. In limma for Bioconductor 3.0 you can now run a goseq style analysis on a limma linear model fit very conveniently using the goana() function in limma if you have an Entrez Gene ID for each gene or feature.

Gordon Smyth · Answer 2 · 2014-09-12

Hi Nicolas,

Instead of using a "straight up" GO analysis, I'd try using doing GSEA using camera or roast (from edgeR, or limma (if you are using limma::voom)), and use the GO and KEGG categories as gene sets.

You can get the GO gene sets (among others) in an R consumable format from here:

http://bioinf.wehi.edu.au/software/MSigDB/index.html

You presumable already have the KEGG annotations in a consumable format (perhaps you are using the KEGG.db bioconductor package).

The nice thing about using camera and (m)roast is that you already have you can simply reuse your "setup", ie. design matrix + count/expression data to test for enrichment the same way you are already testing for differential expression. There will be some work to parse the gene sets into index vectors (you'll know what I mean when you start reading through the documentation and trying the example) -- it's not conceptually difficult to do, but you just need to be careful (in the bookkeeping sense) -- but it'll be worth it.

One thing to keep in mind, though, is that if there's no "signal" in the expression data, sometimes it means that there really is no signal from simply looking at expression data alone, so at some point you might have to resign to stop looking for the next method in hopes of finding something that pops.

HTH,
-steve

--
Steve Lianoglou
Computational Biologist
Genentech

Gordon Smyth · Answer 3 · 2014-09-12

0

Entering edit mode

Julie Zhu ★ 4.3k

@julie-zhu-3596

Last seen 5 months ago

United States

Dear Nicolas,

You are welcome to try getEnrichedGo and getEnrichedPath in ChIPpeakAnno package.

Best regards,
Julie

ADD COMMENT • link updated 9.5 years ago by Gordon Smyth 50k • written 9.6 years ago by Julie Zhu ★ 4.3k