#### The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: confused about J-G statistic of GSEAlm
0
10.3 years ago by
Mark W Kimpel830
Mark W Kimpel830 wrote:
I'm confused about how the J-G statistic of Jiang and Gentleman (2007) is calculated. To check the results I was getting, I calculated the statistic with my own code and came up with a very different answer to that obtained using gsealmPerm (which uses GSNormalize). Before proceeding to my code, I also note that it looks like the statistic is calculated slightly differently in the paper of Oron, et al (2008) and in the GSEAlm vignette. The former does not subtract t statistics from the mean t stat of the entire dataset (t.ref) but the latter does. Am I interpreting that correctly and, if so, why is it presented differently in the 2 sources? In the code below, which is a self-contained example, I construct an artificial ExpressionSet with a fold change difference between groups A and B of phenotype y in probesets corresponding to a GeneSet. I then calculate the statistic using gsealmPerm and then using parametric assumptions (which should hold considering how the dataset was constructed) using my own code. As can be seen, very different answers are obtained. What am I misunderstanding? Here is the code followed by my output and sessionInfo(). First come a few necessary functions with a hack of mine to output the observedStatistic from gsealmPerm: ################################################### gsealmPerm.MWK <- function (eSet, formula = "", mat, nperm, na.rm = TRUE, ...) { nSamp = ncol(eSet) if (formula == "") { nvar <- 0 } else { xvarnames <- all.vars(formula) nvar <- length(xvarnames) } obsRaw <- lmPerGene(eSet = eSet, formula = formula, na.rm = na.rm) if (nvar > 0) { observedStats = GSNormalize(obsRaw$tstat[2, ], incidence = mat, ...) } else { observedStats = GSNormalize(t(obsRaw$tstat), incidence = mat, fun2 = identity, ...) } perm.eset = eSet i <- 1L if (nvar > 0) { permMat <- matrix(0, nrow = nrow(eSet), ncol = nperm) while (i < (nperm + 1)) { if (nvar >= 2) { splitter = pData(eSet)[, xvarnames[2]] if (nvar > 2) splitter = as.list(pData(eSet)[, xvarnames[2:nvar]]) label.perm = unsplit(lapply(split(1:nSamp, splitter), sample), splitter) pData(perm.eset)[, xvarnames[1]] <- pData(eSet)[label.perm, xvarnames[1]] } else if (nvar == 1) { pData(perm.eset)[, xvarnames[1]] <- pData(eSet)[sample(1:nSamp), xvarnames[1]] } temp.results <- lmPerGene(eSet = perm.eset, formula = formula, na.rm = na.rm) permMat[, i] <- temp.results$tstat[2, ] i <- i + 1L } permMat <- GSNormalize(permMat, incidence = mat, ...) rownames(permMat) = rownames(mat) } else if (nvar == 0) { permMat <- matrix(0, nrow = nrow(mat), ncol = nperm) rownames(permMat) = rownames(mat) for (i in 1:nperm) permMat[, i] = GSNormalize(t(obsRaw$tstat), incidence = mat[, sample(1:ncol(mat))], fun2 = identity, ...) } output <- data.frame(pvalFromPermMat(observedStats, permMat), observedStats) } #####################################3 pvalFromPermMat <- function(obs, perms) { #needed because this is an internal function of the GSEAlm package N <- ncol(perms) pvals <- matrix(as.double(NA), nr=nrow(perms), ncol=2) dimnames(pvals) <- list(rownames(perms), c("Lower", "Upper")) tempObs <- rep(obs, ncol(perms)) dim(tempObs) <- dim(perms) pvals[ , 1] <- rowSums(perms <= tempObs)/N pvals[ , 2] <- rowSums(perms >= tempObs)/N pvals } ######################### ## mat: A 0/1 incidence matrix with each row representing a gene set ## and each column representing a gene. A 1 indicates ## membership of a gene in a gene set. makeGSmat <- function(eset, geneSet.lst){ amat <- matrix(0, nrow= length(geneSet.lst), ncol = length(featureNames(eset))) for (i in 1:nrow(amat)){ amat[i,] <- featureNames(eset) %in% geneIds(geneSet.lst[[i]]) } rownames(amat) <- names(geneSet.lst) amat } #################################################### ########### # simulate non-parametric GSEA require(GSEABase); require(GSEAlm) set.seed(123) eset.mat <- matrix(rnorm(n = 12*12000, mean = 10, sd = 1), ncol = 12) rownames(eset.mat) <- paste("A", 1:12000, sep = "_") gs <- GeneSet(paste("A", unique(floor(runif(n = 50, min = 1, max = 300))), sep = "_"), setIdentifier = "test.gs") eset.mat[geneIds(gs),1:6] <- eset.mat[geneIds(gs),1:6] + 0.5 df <- data.frame(x = 1:12, y = c(rep("A", 6), rep("B", 6))) adf <- as(df, "AnnotatedDataFrame") exprsSet <- new("ExpressionSet", exprs = eset.mat, phenoData = adf) # gs.lst <- listtest.gs = gs) ##################################### amat <- makeGSmat(exprsSet, gs.lst) #################################################3 GSEAResults.nonpara <- gsealmPerm.MWK(eSet = exprsSet, formula = ~y, mat = amat, nperm = 1000) GSEAResults.nonpara ################################################### # simulate parametric GSEA gene.ts <- (lmPerGene(eSet = exprsSet, formula = ~y)$tstat)[2,] pos <- match(geneIds(gs), names(gene.ts)) gene.ts[pos] t.ref <- mean(gene.ts) summation <- 0 GSEAResults.para <- { for (i in 1:length(gene.ts[geneIds(gs)])){ summation <- summation + (gene.ts[geneIds(gs)][i] - t.ref) } GSEAResults.para <- summation/length(gene.ts[geneIds(gs)]) GSEAResults.para } GSEAResults.para GSEAResults.nonpara GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)]) #################################################### > GSEAResults.para A_121 -1.083127 > GSEAResults.nonpara Lower Upper observedStats test.gs 0 1 -7.435567 > GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)]) [1] -0.1582036 > sessionInfo() R version 2.9.0 Under development (unstable) (2008-10-28 r46793) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] splines tools stats graphics grDevices utils datasets [8] methods base other attached packages: [1] GSEAlm_1.3.0 GSEABase_1.5.0 affycoretools_1.15.0 [4] annaffy_1.15.0 KEGG.db_2.2.5 gcrma_2.15.1 [7] matchprobes_1.15.0 biomaRt_1.17.0 GOstats_2.9.0 [10] Category_2.9.7 genefilter_1.23.0 survival_2.34-1 [13] RBGL_1.19.0 annotate_1.21.1 xtable_1.5-4 [16] GO.db_2.2.5 RSQLite_0.7-1 DBI_0.2-4 [19] AnnotationDbi_1.5.5 limma_2.17.3 affy_1.21.0 [22] Biobase_2.3.3 graph_1.21.0 loaded via a namespace (and not attached): [1] affyio_1.11.2 cluster_1.11.11 preprocessCore_1.5.1 [4] RCurl_0.92-0 XML_1.98-1 > ------------------------------------------------------------ Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Home Skype: mkimpel "The real problem is not whether machines think but whether men do." -- B. F. Skinner ****************************************************************** [[alternative HTML version deleted]] go gsealm • 653 views ADD COMMENTlink modified 10.3 years ago by rgentleman5.5k • written 10.3 years ago by Mark W Kimpel830 Answer: confused about J-G statistic of GSEAlm 0 10.3 years ago by rgentleman5.5k United States rgentleman5.5k wrote: Hi Mark, Good questions, I am not sure if either Zhen or Assaf will be reading this, and if so they might have some different views/ideas. Mark Kimpel wrote: > I'm confused about how the J-G statistic of Jiang and Gentleman (2007) is > calculated. To check the results I was getting, I calculated the statistic > with my own code and came up with a very different answer to that obtained > using gsealmPerm (which uses GSNormalize). Before proceeding to my code, I I don't think that there is an absolute prescription here, but rather an algorithm, that probably needs a bit of tweaking in any given situation. That said, we also learned a bit between the two papers, and adjusting for the mean of the t-statistics seems like a good idea (in the parametric model). > also note that it looks like the statistic is calculated slightly > differently in the paper of Oron, et al (2008) and in the GSEAlm vignette. > The former does not subtract t statistics from the mean t stat of the entire > dataset (t.ref) but the latter does. Am I interpreting that correctly and, > if so, why is it presented differently in the 2 sources? I don't follow here - what do you mean "presented so differently", certainly the papers are different, and some details are different, as I said we learned some things in the intervening years. Probably most important, for parametric fits, is to remove the effect that arises if the t-statistics don't sum to zero. In those situations, what you will see in some cases is that larger gene sets are preferred (and the size/nature of this effect depends on how different the mean of the t-stats is from zero). It is of less interest in a permutational approach, as in that case all permutations have the same difference and it essentially cancels out. An easier way to the end, using your data is as follows: library(genefilter) rtt = rowttests(exprsSet, factor(exprsSet$y)) ##y should have been a factor > sum(rtt$statistic[amat[1,]])/sqrt(sum(amat)) [1] 7.031002 (the p-value is pretty close to that below) About what you get the non-para below. This one when you think of a null model where the individual t-stats are centered about zero. When the sample sizes are large-ish then the number of df is large, and the t distribution is similar to the Normal (0,1) and hence sum(t's)/sqrt(number) is approx N(0,1) under the null. But this assumption essentially breaks down when the sum of the t-stats is not zero. So we developed an approach that accounts for that (and allows for the variance to change as well). This second approach involves fitting a linear model with an intercept > lm(rtt$statistic~amat[1,]) Call: lm(formula = rtt$statistic ~ amat[1, ]) Coefficients: (Intercept) amat[1, ] -0.002797 1.087386 about what you get for the para below, but deciding if the gene set is predictive is not quite as simple as just looking at the coefficient: > summary(lm(rtt$statistic~amat[1,])) Call: lm(formula = rtt$statistic ~ amat[1, ]) Residuals: Min 1Q Median 3Q Max -4.818258 -0.711718 -0.002836 0.695385 6.999615 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.002797 0.010133 -0.276 0.783 amat[1, ] 1.087386 0.161908 6.716 1.95e-11 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 1.108 on 11998 degrees of freedom Multiple R-squared: 0.003745, Adjusted R-squared: 0.003662 F-statistic: 45.11 on 1 and 11998 DF, p-value: 1.951e-11 so the coefficient is really significant (and divided by its standard error comes close to the other stat, above) You can't really compare the observed statistics (as you seemed to), it is their standard errors (and the model) that tells you whether they are important. best wishes Robert > > In the code below, which is a self-contained example, I construct an > artificial ExpressionSet with a fold change difference between groups A and > B of phenotype y in probesets corresponding to a GeneSet. I then calculate > the statistic using gsealmPerm and then using parametric assumptions (which > should hold considering how the dataset was constructed) using my own code. > As can be seen, very different answers are obtained. > > What am I misunderstanding? > > Here is the code followed by my output and sessionInfo(). First come a few > necessary functions with a hack of mine to output the observedStatistic from > gsealmPerm: > > ################################################### > gsealmPerm.MWK <- function (eSet, formula = "", mat, nperm, na.rm = TRUE, > ...) > { > nSamp = ncol(eSet) > if (formula == "") { > nvar <- 0 > } else { > xvarnames <- all.vars(formula) > nvar <- length(xvarnames) > } > obsRaw <- lmPerGene(eSet = eSet, formula = formula, na.rm = na.rm) > if (nvar > 0) { > observedStats = GSNormalize(obsRaw$tstat[2, ], incidence = mat, > ...) > } > else { > observedStats = GSNormalize(t(obsRaw$tstat), incidence = mat, > fun2 = identity, ...) > } > perm.eset = eSet > i <- 1L > if (nvar > 0) { > permMat <- matrix(0, nrow = nrow(eSet), ncol = nperm) > while (i < (nperm + 1)) { > if (nvar >= 2) { > splitter = pData(eSet)[, xvarnames[2]] > if (nvar > 2) > splitter = as.list(pData(eSet)[, xvarnames[2:nvar]]) > label.perm = unsplit(lapply(split(1:nSamp, splitter), > sample), splitter) > pData(perm.eset)[, xvarnames[1]] <- pData(eSet)[label.perm, > xvarnames[1]] > } > else if (nvar == 1) { > pData(perm.eset)[, xvarnames[1]] <- > pData(eSet)[sample(1:nSamp), > xvarnames[1]] > } > temp.results <- lmPerGene(eSet = perm.eset, formula = formula, > na.rm = na.rm) > permMat[, i] <- temp.results$tstat[2, ] > i <- i + 1L > } > permMat <- GSNormalize(permMat, incidence = mat, ...) > rownames(permMat) = rownames(mat) > } > else if (nvar == 0) { > permMat <- matrix(0, nrow = nrow(mat), ncol = nperm) > rownames(permMat) = rownames(mat) > for (i in 1:nperm) permMat[, i] = GSNormalize(t(obsRaw$tstat), > incidence = mat[, sample(1:ncol(mat))], fun2 = identity, > ...) > } > output <- data.frame(pvalFromPermMat(observedStats, permMat), > observedStats) > } > #####################################3 > pvalFromPermMat <- function(obs, perms) { #needed because this is an > internal function of the GSEAlm package > N <- ncol(perms) > pvals <- matrix(as.double(NA), nr=nrow(perms), ncol=2) > dimnames(pvals) <- list(rownames(perms), c("Lower", "Upper")) > > tempObs <- rep(obs, ncol(perms)) > dim(tempObs) <- dim(perms) > pvals[ , 1] <- rowSums(perms <= tempObs)/N > pvals[ , 2] <- rowSums(perms >= tempObs)/N > pvals > } > ######################### > ## mat: A 0/1 incidence matrix with each row representing a gene set > ## and each column representing a gene. A 1 indicates > ## membership of a gene in a gene set. > makeGSmat <- function(eset, geneSet.lst){ > amat <- matrix(0, nrow= length(geneSet.lst), ncol = > length(featureNames(eset))) > for (i in 1:nrow(amat)){ > amat[i,] <- featureNames(eset) %in% geneIds(geneSet.lst[[i]]) > } > rownames(amat) <- names(geneSet.lst) > amat > } > #################################################### > ########### > # simulate non-parametric GSEA > require(GSEABase); require(GSEAlm) > set.seed(123) > eset.mat <- matrix(rnorm(n = 12*12000, mean = 10, sd = 1), ncol = 12) > rownames(eset.mat) <- paste("A", 1:12000, sep = "_") > gs <- GeneSet(paste("A", unique(floor(runif(n = 50, min = 1, max = 300))), > sep = "_"), setIdentifier = "test.gs") > eset.mat[geneIds(gs),1:6] <- eset.mat[geneIds(gs),1:6] + 0.5 > df <- data.frame(x = 1:12, y = c(rep("A", 6), rep("B", 6))) > adf <- as(df, "AnnotatedDataFrame") > exprsSet <- new("ExpressionSet", exprs = eset.mat, phenoData = adf) > # > gs.lst <- listtest.gs = gs) > ##################################### > amat <- makeGSmat(exprsSet, gs.lst) > #################################################3 > GSEAResults.nonpara <- gsealmPerm.MWK(eSet = exprsSet, formula = ~y, mat = > amat, nperm = 1000) > GSEAResults.nonpara > ################################################### > # simulate parametric GSEA > gene.ts <- (lmPerGene(eSet = exprsSet, formula = ~y)$tstat)[2,] > pos <- match(geneIds(gs), names(gene.ts)) > gene.ts[pos] > t.ref <- mean(gene.ts) > > summation <- 0 > GSEAResults.para <- { > for (i in 1:length(gene.ts[geneIds(gs)])){ > summation <- summation + (gene.ts[geneIds(gs)][i] - t.ref) > } > GSEAResults.para <- summation/length(gene.ts[geneIds(gs)]) > GSEAResults.para > } > GSEAResults.para > GSEAResults.nonpara > GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)]) > #################################################### >> GSEAResults.para > A_121 > -1.083127 >> GSEAResults.nonpara > Lower Upper observedStats > test.gs 0 1 -7.435567 >> GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)]) > [1] -0.1582036 >> sessionInfo() > R version 2.9.0 Under development (unstable) (2008-10-28 r46793) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] splines tools stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] GSEAlm_1.3.0 GSEABase_1.5.0 affycoretools_1.15.0 > [4] annaffy_1.15.0 KEGG.db_2.2.5 gcrma_2.15.1 > [7] matchprobes_1.15.0 biomaRt_1.17.0 GOstats_2.9.0 > [10] Category_2.9.7 genefilter_1.23.0 survival_2.34-1 > [13] RBGL_1.19.0 annotate_1.21.1 xtable_1.5-4 > [16] GO.db_2.2.5 RSQLite_0.7-1 DBI_0.2-4 > [19] AnnotationDbi_1.5.5 limma_2.17.3 affy_1.21.0 > [22] Biobase_2.3.3 graph_1.21.0 > > loaded via a namespace (and not attached): > [1] affyio_1.11.2 cluster_1.11.11 preprocessCore_1.5.1 > [4] RCurl_0.92-0 XML_1.98-1 > ------------------------------------------------------------ > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 399-1219 Home > Skype: mkimpel > > "The real problem is not whether machines think but whether men do." -- B. > F. Skinner > ****************************************************************** > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org
Robert, Thank you for the explanation, it gives me a better understanding of what is going on. After reading you email, I realized that the big discrepancies in the observed t-stats that I noted were in fact do to a computational error on my part. If you look at my code, you will see that I forgot to take the sqrt of the size of the GeneSet. I had stared at this for a couple of days and didn't see it. As for the differences in the algorithms, what I noted was not between the two published papers but between your latest and the current GSEAlm vignette. Although I understand the distinction you make between a parametric vs. permutation based approach, this doesn't come across clearly in the vignette, which appears to closely follow you latest paper. Perhaps a bit of explanation on this part could help future readers. As a non-statistician, clearly I have some studying to do to completely understand both your papers. Questions may follow. If they involved more theory than application (as in how to use packages), is it still appropriate to post to the BioC help list? This would seem to benefit your target audience, but may be a bit outside the posting guidelines for the list. Mark ------------------------------------------------------------ Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Home Skype: mkimpel "The real problem is not whether machines think but whether men do." -- B. F. Skinner ****************************************************************** On Wed, Nov 19, 2008 at 12:32 AM, Robert Gentleman <rgentlem@fhcrc.org>wrote: > Hi Mark, > Good questions, I am not sure if either Zhen or Assaf will be reading > this, > and if so they might have some different views/ideas. > > Mark Kimpel wrote: > > I'm confused about how the J-G statistic of Jiang and Gentleman (2007) is > > calculated. To check the results I was getting, I calculated the > statistic > > with my own code and came up with a very different answer to that > obtained > > using gsealmPerm (which uses GSNormalize). Before proceeding to my code, > I > > I don't think that there is an absolute prescription here, but rather an > algorithm, that probably needs a bit of tweaking in any given situation. > That > said, we also learned a bit between the two papers, and adjusting for the > mean > of the t-statistics seems like a good idea (in the parametric model). > > > also note that it looks like the statistic is calculated slightly > > differently in the paper of Oron, et al (2008) and in the GSEAlm > vignette. > > The former does not subtract t statistics from the mean t stat of the > entire > > dataset (t.ref) but the latter does. Am I interpreting that correctly > and, > > if so, why is it presented differently in the 2 sources? > > I don't follow here - what do you mean "presented so differently", > certainly > the papers are different, and some details are different, as I said we > learned > some things in the intervening years. Probably most important, for > parametric > fits, is to remove the effect that arises if the t-statistics don't sum to > zero. > In those situations, what you will see in some cases is that larger gene > sets > are preferred (and the size/nature of this effect depends on how different > the > mean of the t-stats is from zero). It is of less interest in a > permutational > approach, as in that case all permutations have the same difference and it > essentially cancels out. > > An easier way to the end, using your data is as follows: > > library(genefilter) > rtt = rowttests(exprsSet, factor(exprsSet$y)) ##y should have been a > factor > > > sum(rtt$statistic[amat[1,]])/sqrt(sum(amat)) > [1] 7.031002 > > (the p-value is pretty close to that below) > > About what you get the non-para below. This one when you think of a null > model > where the individual t-stats are centered about zero. When the sample sizes > are > large-ish then the number of df is large, and the t distribution is similar > to > the Normal (0,1) and hence sum(t's)/sqrt(number) is approx N(0,1) under the > null. But this assumption essentially breaks down when the sum of the > t-stats > is not zero. So we developed an approach that accounts for that (and > allows for > the variance to change as well). This second approach involves fitting a > linear > model with an intercept > > > lm(rtt$statistic~amat[1,]) > > Call: > lm(formula = rtt$statistic ~ amat[1, ]) > > Coefficients: > (Intercept) amat[1, ] > -0.002797 1.087386 > > about what you get for the para below, but deciding if the gene set is > predictive is not quite as simple as just looking at the coefficient: > > > summary(lm(rtt$statistic~amat[1,])) > > Call: > lm(formula = rtt$statistic ~ amat[1, ]) > > Residuals: > Min 1Q Median 3Q Max > -4.818258 -0.711718 -0.002836 0.695385 6.999615 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) -0.002797 0.010133 -0.276 0.783 > amat[1, ] 1.087386 0.161908 6.716 1.95e-11 *** > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > Residual standard error: 1.108 on 11998 degrees of freedom > Multiple R-squared: 0.003745, Adjusted R-squared: 0.003662 > F-statistic: 45.11 on 1 and 11998 DF, p-value: 1.951e-11 > > so the coefficient is really significant (and divided by its standard error > comes close to the other stat, above) > > You can't really compare the observed statistics (as you seemed to), it is > their > standard errors (and the model) that tells you whether they are important. > > best wishes > Robert > > > > > In the code below, which is a self-contained example, I construct an > > artificial ExpressionSet with a fold change difference between groups A > and > > B of phenotype y in probesets corresponding to a GeneSet. I then > calculate > > the statistic using gsealmPerm and then using parametric assumptions > (which > > should hold considering how the dataset was constructed) using my own > code. > > As can be seen, very different answers are obtained. > > > > > > What am I misunderstanding? > > > > Here is the code followed by my output and sessionInfo(). First come a > few > > necessary functions with a hack of mine to output the observedStatistic > from > > gsealmPerm: > > > > ################################################### > > gsealmPerm.MWK <- function (eSet, formula = "", mat, nperm, na.rm = TRUE, > > ...) > > { > > nSamp = ncol(eSet) > > if (formula == "") { > > nvar <- 0 > > } else { > > xvarnames <- all.vars(formula) > > nvar <- length(xvarnames) > > } > > obsRaw <- lmPerGene(eSet = eSet, formula = formula, na.rm = na.rm) > > if (nvar > 0) { > > observedStats = GSNormalize(obsRaw$tstat[2, ], incidence = mat, > > ...) > > } > > else { > > observedStats = GSNormalize(t(obsRaw$tstat), incidence = mat, > > fun2 = identity, ...) > > } > > perm.eset = eSet > > i <- 1L > > if (nvar > 0) { > > permMat <- matrix(0, nrow = nrow(eSet), ncol = nperm) > > while (i < (nperm + 1)) { > > if (nvar >= 2) { > > splitter = pData(eSet)[, xvarnames[2]] > > if (nvar > 2) > > splitter = as.list(pData(eSet)[, xvarnames[2:nvar]]) > > label.perm = unsplit(lapply(split(1:nSamp, splitter), > > sample), splitter) > > pData(perm.eset)[, xvarnames[1]] <- > pData(eSet)[label.perm, > > xvarnames[1]] > > } > > else if (nvar == 1) { > > pData(perm.eset)[, xvarnames[1]] <- > > pData(eSet)[sample(1:nSamp), > > xvarnames[1]] > > } > > temp.results <- lmPerGene(eSet = perm.eset, formula = > formula, > > na.rm = na.rm) > > permMat[, i] <- temp.results$tstat[2, ] > > i <- i + 1L > > } > > permMat <- GSNormalize(permMat, incidence = mat, ...) > > rownames(permMat) = rownames(mat) > > } > > else if (nvar == 0) { > > permMat <- matrix(0, nrow = nrow(mat), ncol = nperm) > > rownames(permMat) = rownames(mat) > > for (i in 1:nperm) permMat[, i] = GSNormalize(t(obsRaw$tstat), > > incidence = mat[, sample(1:ncol(mat))], fun2 = identity, > > ...) > > } > > output <- data.frame(pvalFromPermMat(observedStats, permMat), > > observedStats) > > } > > #####################################3 > > pvalFromPermMat <- function(obs, perms) { #needed because this is an > > internal function of the GSEAlm package > > N <- ncol(perms) > > pvals <- matrix(as.double(NA), nr=nrow(perms), ncol=2) > > dimnames(pvals) <- list(rownames(perms), c("Lower", "Upper")) > > > > tempObs <- rep(obs, ncol(perms)) > > dim(tempObs) <- dim(perms) > > pvals[ , 1] <- rowSums(perms <= tempObs)/N > > pvals[ , 2] <- rowSums(perms >= tempObs)/N > > pvals > > } > > ######################### > > ## mat: A 0/1 incidence matrix with each row representing a gene set > > ## and each column representing a gene. A 1 indicates > > ## membership of a gene in a gene set. > > makeGSmat <- function(eset, geneSet.lst){ > > amat <- matrix(0, nrow= length(geneSet.lst), ncol = > > length(featureNames(eset))) > > for (i in 1:nrow(amat)){ > > amat[i,] <- featureNames(eset) %in% geneIds(geneSet.lst[[i]]) > > } > > rownames(amat) <- names(geneSet.lst) > > amat > > } > > #################################################### > > ########### > > # simulate non-parametric GSEA > > require(GSEABase); require(GSEAlm) > > set.seed(123) > > eset.mat <- matrix(rnorm(n = 12*12000, mean = 10, sd = 1), ncol = 12) > > rownames(eset.mat) <- paste("A", 1:12000, sep = "_") > > gs <- GeneSet(paste("A", unique(floor(runif(n = 50, min = 1, max = > 300))), > > sep = "_"), setIdentifier = "test.gs") > > eset.mat[geneIds(gs),1:6] <- eset.mat[geneIds(gs),1:6] + 0.5 > > df <- data.frame(x = 1:12, y = c(rep("A", 6), rep("B", 6))) > > adf <- as(df, "AnnotatedDataFrame") > > exprsSet <- new("ExpressionSet", exprs = eset.mat, phenoData = adf) > > # > > gs.lst <- listtest.gs = gs) > > ##################################### > > amat <- makeGSmat(exprsSet, gs.lst) > > #################################################3 > > GSEAResults.nonpara <- gsealmPerm.MWK(eSet = exprsSet, formula = ~y, mat > = > > amat, nperm = 1000) > > GSEAResults.nonpara > > ################################################### > > # simulate parametric GSEA > > gene.ts <- (lmPerGene(eSet = exprsSet, formula = ~y)$tstat)[2,] > > pos <- match(geneIds(gs), names(gene.ts)) > > gene.ts[pos] > > t.ref <- mean(gene.ts) > > > > summation <- 0 > > GSEAResults.para <- { > > for (i in 1:length(gene.ts[geneIds(gs)])){ > > summation <- summation + (gene.ts[geneIds(gs)][i] - t.ref) > > } > > GSEAResults.para <- summation/length(gene.ts[geneIds(gs)]) > > GSEAResults.para > > } > > GSEAResults.para > > GSEAResults.nonpara > > GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)]) > > #################################################### > >> GSEAResults.para > > A_121 > > -1.083127 > >> GSEAResults.nonpara > > Lower Upper observedStats > > test.gs 0 1 -7.435567 > >> GSEAResults.nonpara\$observedStats/length(gene.ts[geneIds(gs)]) > > [1] -0.1582036 > >> sessionInfo() > > R version 2.9.0 Under development (unstable) (2008-10-28 r46793) > > x86_64-unknown-linux-gnu > > > > locale: > > > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > > > > attached base packages: > > [1] splines tools stats graphics grDevices utils datasets > > [8] methods base > > > > other attached packages: > > [1] GSEAlm_1.3.0 GSEABase_1.5.0 affycoretools_1.15.0 > > [4] annaffy_1.15.0 KEGG.db_2.2.5 gcrma_2.15.1 > > [7] matchprobes_1.15.0 biomaRt_1.17.0 GOstats_2.9.0 > > [10] Category_2.9.7 genefilter_1.23.0 survival_2.34-1 > > [13] RBGL_1.19.0 annotate_1.21.1 xtable_1.5-4 > > [16] GO.db_2.2.5 RSQLite_0.7-1 DBI_0.2-4 > > [19] AnnotationDbi_1.5.5 limma_2.17.3 affy_1.21.0 > > [22] Biobase_2.3.3 graph_1.21.0 > > > > loaded via a namespace (and not attached): > > [1] affyio_1.11.2 cluster_1.11.11 preprocessCore_1.5.1 > > [4] RCurl_0.92-0 XML_1.98-1 > > ------------------------------------------------------------ > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > > Indiana University School of Medicine > > > > 15032 Hunter Court, Westfield, IN 46074 > > > > (317) 490-5129 Work, & Mobile & VoiceMail > > (317) 399-1219 Home > > Skype: mkimpel > > > > "The real problem is not whether machines think but whether men do." -- > B. > > F. Skinner > > ****************************************************************** > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Robert Gentleman, PhD > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > PO Box 19024 > Seattle, Washington 98109-1024 > 206-667-7700 > rgentlem@fhcrc.org > [[alternative HTML version deleted]]