Hello Forum,
I want to perform gene set enrichment analysis of whole genome. Can it be done or I have to have some selective genes? Thanks for any input in this regard.
Hello Forum,
I want to perform gene set enrichment analysis of whole genome. Can it be done or I have to have some selective genes? Thanks for any input in this regard.
For GSEA you don't specify a set of differentially expressed genes. Instead, you specify sets of genes and then decide if those genes appear in aggregate to be differentially expressed in your samples, where 'differentially expressed in aggregate' can take at least two forms; competitive or self-contained. You can also test a whole battery of gene sets (like the Broad c1-c6 gene sets) or just a few.
If you are already using limma, then you can do a competitive GSEA for a battery of gene sets using romer() or for a few sets using camera(), and the analogous self-contained GSEA using mroast() and roast(). Or basic GSEA using geneSetTest(). See 10.GeneSetTests for more information, as well as http://bioinformatics.oxfordjournals.org/content/23/8/980.short for more information about the various types of GSEA.
Hi Jim,
I am trying to use geneSetTest() with t-value as statistics from top table.
geneSetTest(index=, statistics=abs(tt$t), alternative="mixed", ranks.only=TRUE, nsim=9999)
What values goes in index argument? The documentation says it is selected gene sets but I don't know which gene sets are available.
Thank you.
The index argument is a vector of TRUE/FALSE, equal to the number of genes in your experiment, where the index is TRUE for genes that are in your gene set. How you get that vector is dependent on how you are defining your gene sets, etc. Without knowing more about the gene sets that you are interested in, I can't give much more help.
One way is to use the idConverter() function from AnnotationDbi, which will map from say human to rat. I wrote an Rscript a while ago that can be used to do human to rat or mouse:
## usage: R CMD Rscript parseBroad_nonhuman.R thedir thespecies wheretoput ## where thedir is the dir containing the broad entrez-gene based data (broad_by_eg) ## thespecies is mouse or rat (currently) ## wheretoput is the dir that the output should be saved in - no need to create first. ## the args are space delimited args <- commandArgs(TRUE) if(length(args) != 3) stop(paste("\nUsage: R CMD Rscript parseBroad_nonhuman.R", " <thedir> <thespecies> <wheretoput>\n", "<thedir> = the directory containing Entrez Gene based Gene sets", "(currently /data2/romer/broad_by_eg)\n", "<thespecies> = the species to map to, currently either mouse or rat\n", "<wheretoput> = the directory in which to output the mapped gene sets\n\n", "Note that these arguments are space-delimited, and the function is best run", " under nohup. Example:\n", "nohup R CMD Rscript parseBroad_nonhuman.R broad_by_eg mouse mouse2 &\n\n", sep = ""), call. = FALSE) makeGeneSets <- function(thedir, species, wheretoput){ mapper <- c("MUSMU","RATNO") names(mapper) <- c("mouse","rat") thespecies <- mapper[species] require("AnnotationDbi", quietly = TRUE, character.only = TRUE) fls <- paste(thedir, dir(thedir, pattern = "gmt$"), sep = "/") thenames <- c("Positional", "Chemical_genetic_perturbations","Biocarta", "KEGG", "Reactome","Canonical_pathway","microRNA", "Transcription_factor_targets","Cancer_gene_neighborhoods","Cancer_modules", "GO_biological_process","GO_cellular_component", "GO_molecular_function", "Oncogenic") parseBroad <- function(thefile){ len <- as.numeric(strsplit(system(paste("wc -l", thefile), intern=TRUE), " ")[[1]][1]) con <- file(thefile, "r") lst <- list(length = len) nam <- vector(length = len) for(i in seq_len(len)){ theline <- readLines(con, 1L) theline <- strsplit(theline, "\t") nam[i] <- theline[[1]][1] lst[[i]] <- theline[[1]][-c(1:2)] } close(con) names(lst) <- nam return(lst) } convertIt <- function(x, species){ tryCatch(idConverter(x, "HOMSA", species, "EG", "SYMBOL"), error = function(e) NULL) } broad <- lapply(fls, parseBroad) ## the canonical pathways include reactome, biocarta and KEGG, so subset ind <- !names(broad[[6]]) %in% c(names(broad[[3]]), names(broad[[4]]), names(broad[[5]])) broad[[6]] <- broad[[6]][ind] names(broad) <- thenames broad <- lapply(broad, function(x) lapply(x, convertIt, species = thespecies)) ## remove empty gene sets broad <- lapply(broad, function(x) x[!sapply(x, is.null)]) if(!file.exists(wheretoput)) dir.create(wheretoput) save(broad, file = paste(wheretoput, "/broad_", species, "_byset.Rdata", sep = "")) } makeGeneSets(args[1], args[2], args[3])
If you have one or more of the gmt files from the Broad, you can use that script to make an Rdata file for rat. I ran it yesterday, and the genes for the PI3K/AKT gene set for rat are:
structure(c("Gsk3a", "Mapkap1", "Rictor", "Tbc1d4", "Raf1", "Casp9", "Map3k5", "Foxo3", "Prkdc", "Ywhaq", "Ywhah", "Mtor", "Foxo1", "Ywhag", "Cdkn1a", "Ywhae", "Chuk", "Bad", "Kpna1", "Ywhaz", "Ywhab", "Akt1", "Akt2", "Gsk3b", "Mlst8"), .Names = c("2931", "79109", "253260", "9882", "5894", "842", "4217", "2309", "5591", "10971", "7533", "2475", "2308", "7532", "1026", "7531", "1147", "572", "3836", "7534", "7529", "207", "208", "2932", "64223"))
Which you can paste into an R session to create a named vector (names are Entrez Gene IDs, values are HUGO symbols). This has 25 of the 38 genes found in the human version of this pathway (REACTOME_PI3K_AKT_ACTIVATION).
I hadn't seen docopt. Thanks for the pointer.
I tend to keep lots of previous versions of R/Bioc laying about, for that special PI who invariably comes back like two years later, wanting me to add 'just one more thing' to their analysis. Rather than using the current R/Bioc, I usually opt to use whatever iteration was used for their original analysis (which I know because I always drop a sessionInfo() at the end of the Sweave/knitr document), so the only thing that will change will be the 'one more thing' that they wanted me to add.
Anyway, because I have lots of different R versions installed, I tend not to hard code anything to point to the current or system-wide installation of R because I may well need to use a particular version. So most of my scripts either use R CMD Rscript (or e.g., R-2.15.1 CMD Rscript), or I just write .R files and use R-3.0.1 --vanilla < mysuperl33tscript.R, etc.
I thought I was pretty clear, but perhaps not. I went here (http://www.broadinstitute.org/gsea/msigdb/index.jsp), and having registered already, went to the downloads page, and downloaded all the entrez.gmt files, for each of the gene sets (I actually did this a while ago - I have the 3.1 versions, whereas the current ones at the Broad are 4.0).
I then ran the script that I supplied in an earlier comment (see above), to convert those human Entrez Gene IDs to rat Entrez Gene IDs and symbols. This script does all of the gene sets at once, creating a .Rdata file that I loaded into R and then extracted the converted genes for the REACTOME_PI3K_AKT_ACTIVATION gene set, which is, I believe, the one you are interested in.
And to answer your question, no you cannot use these genes directly. As I mentioned in a (much) earlier comment, you need a logical (TRUE/FALSE) vector that is the same length as the number of rows in your MArrayLM object, where each TRUE or FALSE indicates whether or not the corresponding gene (in the MArrayLM object) is in (TRUE) or not in (FALSE) the gene set of interest.
So your homework is to annotate the genes in your MArrayLM object with either their Entrez Gene or symbol, and then use the genes I have supplied to generate the logical vector, which you can then use as input to geneSetTest() or camera().
The GeneSetEnrichment biocViews shows relevant packages, although James's curated list is probably more helpful:
http://www.bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment
I think you will have to map to human Entrez or HGNC gene symbols, unless you know of rat gene sets that you want to test.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Could you please update your question with more detail? Minimally adding information on what type of data do you have, for instance, would be a good start ... what experiment was run, under what conditions, etc.
It's hard to understand what you really want to do as the question currently stands.
Hello Steve,
I have gene expression data 28000 probes ( Affymetrix-rat) and 11 samples (case and control). I did run differential gene expression analysis using limma package but did not get significance based on the FDR values. Now I want to perform gene set enrichment analysis rather than individual gene analysis. My question is how to select the list of genes since I did not get any significant genes. Or if I could select all genes for performing GSEA?
Thanks !!!