Gene set enrichment analysis with gage package returning NaN p-values
1
0
Entering edit mode
Tom • 0
@tom-7247
Last seen 5.7 years ago
IRD, Montpellier, France

Hello,

I'm trying to do some gene set enrichment analysis with the gage package.

I am using the output of DESeq2's variance stabilizing transformation as the expression matrix for gage. Unlike the example(s) in the gage vignette, I don't have a 'control' and 'target' sample but rather three biological replicates of four different cell types. Therefore, I would like to find enrichment of gene sets in each cell type using all of the samples as a reference.

Here is an example of the command I used to run gage, in this case to compare the three replicates from the first cell type (columns 1–3 of exprs) to the 'background':

gage(exprs, gsets, ref = c(1:12), samp = c(1:3), compare = 'unpaired')

I get stat.mean values that look realistic based on manual analysis of some of the gene sets. However, all of the p.vals and q.vals are NaN. The results are similar whether I use raw counts, transformed counts or calculated TPM values.

I'd be grateful for any comments on what I'm doing wrong here. I'm not even certain that it's appropriate to use gage like this (compare one sample to the background using transformed expression values).

It's not practical to share the data so I made a "simulation", it's pretty naff but I hope it shows what I mean anyway. In this case I simulate 3 biological replicates of 4 samples, a–d, and then try to compare 'a' to the 'background'.

Tom

library(gage)

# make up some gene IDs
set.seed(1)
GID = paste0('GID_', sample(c(1000:9999), 1000, replace = F))

# four "samples"
a <- rnorm(1000, 7, 2.3)
b <- rnorm(1000, 7, 2.3)
c <- rnorm(1000, 7, 2.3)
d <- rnorm(1000, 7, 2.3)

# simulate three replicates from each sample
exprs <- data.frame(row.names = GID)
for (x in list(a, b, c, d)){
for (i in 1:3) {
exprs <- cbind(exprs, sapply(x, function(y) rnorm(1, y, 0.2)))
}
}
colnames(exprs) <- paste0(rep(letters[1:4], each = 3), c(1, 2, 3))

# simulate 10 gene sets of 50-100 genes
l <- sample(c(50:100), 10, replace = T)
gsets <- lapply(l, function(x) sample(GID, x, replace = F))
names(gsets) <- paste0('SET', 1:length(gsets))

# run gage on the simulated dataset, comparing e.g. sample 'a' to the background
samp <- grep('a', colnames(exprs))
res <- gage(exprs = exprs, gsets = gsets, ref = c(1:dim(exprs)[2]), samp = samp, compare = 'unpaired')

> res\$greater[,'p.val']

SET1  SET2  SET3  SET4  SET5  SET6  SET7  SET8  SET9 SET10
NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
[1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] gage_2.18.0

loaded via a namespace (and not attached):
[1] IRanges_2.2.0        png_0.1-7            Biostrings_2.36.0
[4] GenomeInfoDb_1.4.0   DBI_0.3.1            stats4_3.2.0
[7] RSQLite_1.0.0        graph_1.46.0         httr_0.6.1
[10] zlibbioc_1.14.0      XVector_0.8.0        S4Vectors_0.6.0
[13] tools_3.2.0          stringr_0.6.2        Biobase_2.28.0
[16] parallel_3.2.0       BiocGenerics_0.14.0  AnnotationDbi_1.30.0
[19] KEGGREST_1.8.0      

gage gsea rnaseq • 1.7k views
0
Entering edit mode
Luo Weijun ★ 1.5k
@luo-weijun-1783
Last seen 3 months ago
United States
This is not a gage problem. Your assign all sample columns to ref (control) including the experiments (samp). This make no sense biologically. This is not good statistically, you will get some columns with all 0 in the pair wised comparison between experiments and controls because some ref and samp are the same column. You should do: ref = c(1:dim(exprs)[2])[ - samp] # instead of ref = c(1:dim(exprs)[2]) you res will be normal after that.
0
Entering edit mode

Thanks a lot for the reply. The reason I chose to compare each sample against all the samples ('background') rather than against only the other samples is that it is more interesting for my analysis to know if geneset X is 'more enriched' in sample A than it is in samples B, C and D than to know if geneset X is 'slightly enriched' in all samples. What would you recommend to make this comparison? Could I just scale the test statistics for each geneset across the four samples?

I tried using ref = c(1:dim(exprs)[2])[ - samp] but I decided that it didn't really pass the eyeball test. For example, I have already studied the expression of geneset Y by plotting transformed, scaled expression values on heatmaps, and looking at plots of normalised expression for members of the geneset. The members are highly expressed across the samples, but most prominently in sample D. However, using gage with ref = c(1:dim(exprs)[2])[ - samp], for this geneset I see that sample D has the lowest test statistic, and that the test statistic is negative in all samples.

Would it be better to use DESeq2 to produce L2FC values for 'each sample against all the others', i.e. do 4 separate comparisons with DESeq2, rather than used transformed expression values?

0
Entering edit mode
Not sure I understand you. I would suggest you: 1) make sure your data are normalized across samples, 2) go through gage tutorial and documentation so that you can use it correctly. If you want to compare one sample (D) against all other samples (A-C), you can do it in 2 ways with gage: samp.cn=4 ref.cn=1:3 #1-on-1 comparison res1=gage(exprs, gsets, ref = ref.cn, samp = samp.cn, compare = 'unpaired') #group-on-group comparison res2=gage(exprs, gsets, ref = ref.cn, samp = samp.cn, compare = 'as.group') -------------------------------------------- On Mon, 4/27/15, Tom [bioc] <noreply@bioconductor.org> wrote: Subject: [bioc] C: Gene set enrichment analysis with gage package returning NaN p-values To: luo_weijun@yahoo.com Date: Monday, April 27, 2015, 4:58 AM Activity on a post you are following on support.bioconductor.org User Tom wrote Comment: Gene set enrichment analysis with gage package returning NaN p-values: Thanks a lot for the reply. The reason I chose to compare each sample against all the samples ('background') rather than against only the other samples is that it is more interesting for my analysis to know if geneset X is 'more enriched' in sample A than it is in samples B, C and D than to know if geneset X is 'slightly enriched' in all samples. What would you recommend to make this comparison? Could I just scale the test statistics for each geneset across the four samples? I tried using ref = c(1:dim(exprs)[2])[ - samp] but I decided that it didn't really pass the eyeball test. For example, I have already studied the expression of geneset Y by plotting transformed, scaled expression values on heatmaps, and looking at plots of normalised expression for members of the geneset. The members are highly expressed across the samples, but most prominently in sample D. However, using gage with�ref = c(1:dim(exprs)[2])[ - samp], for this geneset I see that sample D has the lowest test statistic, and that the test statistic is negative in all samples. Would it be better to use DESeq2 to produce L2FC values for 'each sample against all the others', i.e. do 4 separate comparisons with DESeq2, rather than used transformed expression values? You may reply via email or visit C: Gene set enrichment analysis with gage package returning NaN p-values
0
Entering edit mode

The problem I'm having is that gage seems to be designed for a 'treated vs. untreated'-type design. It's not clear from the vignettes how to apply gage to compare several different samples, none of which are technically controls. I guess it's not the right tool for the question, but thanks again for your responses!

0
Entering edit mode
No, GAGE is equally applicable to multiple-state comparisons or more sophisticated analysis. You just need to do whatever test (like F-test) at on genes first, then feed the result statistics (a vector or matrix) into gage. In you case, if you want to compare each sample to a virtual reference state, i.e. the mean of all samples, you may do something like: ms=apply(exprs, 1, mean) exprs2=exprs-ms res=gage(exprs2, gsets, ref = NULL, samp = NULL) Notice ref=samp=NULL when the comparison or test was done on genes already.
0
Entering edit mode

Thankyou! ref=samp=NULL is what I was missing. Looks good now (using L2FC generated with DESeq2).