Hello,
I'm trying to do some gene set enrichment analysis with the gage
package.
I am using the output of DESeq2
's variance stabilizing transformation as the expression matrix for gage
. Unlike the example(s) in the gage
vignette, I don't have a 'control' and 'target' sample but rather three biological replicates of four different cell types. Therefore, I would like to find enrichment of gene sets in each cell type using all of the samples as a reference.
Here is an example of the command I used to run gage
, in this case to compare the three replicates from the first cell type (columns 1–3 of exprs) to the 'background':
gage(exprs, gsets, ref = c(1:12), samp = c(1:3), compare = 'unpaired')
I get stat.mean
values that look realistic based on manual analysis of some of the gene sets. However, all of the p.val
s and q.val
s are NaN
. The results are similar whether I use raw counts, transformed counts or calculated TPM values.
I'd be grateful for any comments on what I'm doing wrong here. I'm not even certain that it's appropriate to use gage
like this (compare one sample to the background using transformed expression values).
It's not practical to share the data so I made a "simulation", it's pretty naff but I hope it shows what I mean anyway. In this case I simulate 3 biological replicates of 4 samples, a–d, and then try to compare 'a' to the 'background'.
Thanks for reading,
Tom
library(gage) # make up some gene IDs set.seed(1) GID = paste0('GID_', sample(c(1000:9999), 1000, replace = F)) # four "samples" a <- rnorm(1000, 7, 2.3) b <- rnorm(1000, 7, 2.3) c <- rnorm(1000, 7, 2.3) d <- rnorm(1000, 7, 2.3) # simulate three replicates from each sample exprs <- data.frame(row.names = GID) for (x in list(a, b, c, d)){ for (i in 1:3) { exprs <- cbind(exprs, sapply(x, function(y) rnorm(1, y, 0.2))) } } colnames(exprs) <- paste0(rep(letters[1:4], each = 3), c(1, 2, 3)) # simulate 10 gene sets of 50-100 genes l <- sample(c(50:100), 10, replace = T) gsets <- lapply(l, function(x) sample(GID, x, replace = F)) names(gsets) <- paste0('SET', 1:length(gsets)) # run gage on the simulated dataset, comparing e.g. sample 'a' to the background samp <- grep('a', colnames(exprs)) res <- gage(exprs = exprs, gsets = gsets, ref = c(1:dim(exprs)[2]), samp = samp, compare = 'unpaired') > res$greater[,'p.val'] SET1 SET2 SET3 SET4 SET5 SET6 SET7 SET8 SET9 SET10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN > sessionInfo() R version 3.2.0 (2015-04-16) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] gage_2.18.0 loaded via a namespace (and not attached): [1] IRanges_2.2.0 png_0.1-7 Biostrings_2.36.0 [4] GenomeInfoDb_1.4.0 DBI_0.3.1 stats4_3.2.0 [7] RSQLite_1.0.0 graph_1.46.0 httr_0.6.1 [10] zlibbioc_1.14.0 XVector_0.8.0 S4Vectors_0.6.0 [13] tools_3.2.0 stringr_0.6.2 Biobase_2.28.0 [16] parallel_3.2.0 BiocGenerics_0.14.0 AnnotationDbi_1.30.0 [19] KEGGREST_1.8.0
Thanks a lot for the reply. The reason I chose to compare each sample against all the samples ('background') rather than against only the other samples is that it is more interesting for my analysis to know if geneset X is 'more enriched' in sample A than it is in samples B, C and D than to know if geneset X is 'slightly enriched' in all samples. What would you recommend to make this comparison? Could I just scale the test statistics for each geneset across the four samples?
I tried using
ref = c(1:dim(exprs)[2])[ - samp]
but I decided that it didn't really pass the eyeball test. For example, I have already studied the expression of geneset Y by plotting transformed, scaled expression values on heatmaps, and looking at plots of normalised expression for members of the geneset. The members are highly expressed across the samples, but most prominently in sample D. However, usinggage
withref = c(1:dim(exprs)[2])[ - samp]
, for this geneset I see that sample D has the lowest test statistic, and that the test statistic is negative in all samples.Would it be better to use
DESeq2
to produce L2FC values for 'each sample against all the others', i.e. do 4 separate comparisons withDESeq2
, rather than used transformed expression values?The problem I'm having is that
gage
seems to be designed for a 'treated vs. untreated'-type design. It's not clear from the vignettes how to applygage
to compare several different samples, none of which are technically controls. I guess it's not the right tool for the question, but thanks again for your responses!Thankyou!
ref=samp=NULL
is what I was missing. Looks good now (using L2FC generated with DESeq2).