Question

PLAGE method fails in GSVA if an overlap between gene sets and genes in expressions is less than two

0

Entering edit mode

krassowski.michal ▴ 40

@krassowskimichal-19240

Last seen 5.0 years ago

I've been using GSVA with three methods: gsva, ssgsea and z-score for some time now, with success and without problems. However, when I tried to use the forth method - PLAGE - I found out that it does not work with some of the gene sets. Specifically I've discovered that using MSigDB Kegg collection (c2.cp.kegg); while all other methods proceed correctly, plage raises following error:

 Error in dimnames(x) <- dn :
  length of 'dimnames' [1] not equal to array extent

My traceback:

> traceback()
7: `rownames<-`(`*tmp*`, value = names(geneSets))
6: `rownames<-`(`*tmp*`, value = names(geneSets))
5: plage(expr, gset.idx.list, parallel.sz, parallel.type, verbose)
4: .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking,
       parallel.sz, parallel.type, mx.diff, tau, kernel, ssgsea.norm,
       verbose)
3: .local(expr, gset.idx.list, ...)
2: gsva(M, geneSet, method = "plage", parallel.sz = 1)
1: gsva(M, geneSet, method = "plage", parallel.sz = 1)

It only occurs with some combinations of gene sets, not with others. Here is a full code to reproduce my findings, based on the benchmarking function from GSVA documentation, combined with real data from MSigDB:

p <- 1000
n <- 10
gs.sz <- 30
S2N <- 0.5
fracDEgs <- 0.5

set.seed(1)

sizeDEgs <- round(fracDEgs * gs.sz)
group.n <- round(n / 2)

sampleEffect <- rnorm(n, mean=0, sd=1)
sampleEffectDE <- rnorm(n, mean=S2N, sd=0.5)
probeEffect <- rnorm(p, mean=0, sd=1)
noise <- matrix(rnorm(p*n, mean=0, sd=1), nrow=p, ncol=n)
noiseDE <- matrix(rnorm(p*n, mean=0, sd=1), nrow=p, ncol=n)
M <- outer(probeEffect, sampleEffect, "+") + noise
M2 <- outer(probeEffect, sampleEffectDE, "+") + noiseDE
M[1:sizeDEgs, 1:group.n] <- M2[1:sizeDEgs, 1:group.n]

rownames(M) <- as.character(1:nrow(M))

fromString <- function(str){
  strsplit(str, ' ')[[1]]
}

# 1. works
geneSets <- list(
  KEGG_GLYCOLYSIS_GLUCONEOGENESIS = fromString("55902 2645 5232 5230 5162 5160 5161 55276 7167 84532 2203 125 3099 126 3098 3101 127 5224 128 5223 124 230 501 92483 5313 160287 2023 5315 5214 669 5106 5105 219 217 218 10327 8789 5213 5211 3948 2597 2027 2026 441531 131 130 3945 220 221 222 223 224 130589 226 1738 1737 229 57818 3939 2538 5236 2821"),
  KEGG_CITRATE_CYCLE_TCA_CYCLE = fromString("3420 1743 5106 1431 5162 5105 5160 642502 5161 283398 2271 6392 4967 6390 3419 6391 3418 3417 48 47 4191 1738 4190 1737 55753 5091 6389 8802 8803 8801 3421 50")
)
es.plage <- gsva(M, geneSets, method = "plage", parallel.sz = 1)

# 2. works
geneSets <- list(
  KEGG_GLYCOLYSIS_GLUCONEOGENESIS = fromString("55902 2645 5232 5230 5162 5160 5161 55276 7167 84532 2203 125 3099 126 3098 3101 127 5224 128 5223 124 230 501 92483 5313 160287 2023 5315 5214 669 5106 5105 219 217 218 10327 8789 5213 5211 3948 2597 2027 2026 441531 131 130 3945 220 221 222 223 224 130589 226 1738 1737 229 57818 3939 2538 5236 2821"),
  KEGG_CITRATE_CYCLE_TCA_CYCLE = fromString("3420 1743 5106 1431 5162 5105 5160 642502 5161 283398 2271 6392 4967 6390 3419 6391 3418 3417 48 47 4191 1738 4190 1737 55753 5091 6389 8802 8803 8801 3421 50"),
  KEGG_PENTOSE_PHOSPHATE_PATHWAY = fromString("6120 22934 55276 25796 5634 8789 5213 5211 6888 7086 2203 84076 5226 64080 226 230 229 9563 729020 221823 5631 51071 2539 5236 8277 5214 2821")
)

es.plage <- gsva(M, geneSets, method = "plage", parallel.sz = 1)

# 3. works
geneSets <- list(
  KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS = fromString("54575 54576 6120 54577 54578 54490 54579 51084 7358 10941 2990 54600 51181 729020 27294 10720 7360 9942 7365 231 7364 7363 79799 54657 7367 54658 54659 7366")
)

es.plage <- gsva(M, geneSets, method = "plage", parallel.sz = 1)

# 4. does NOT work, raises the error mentioned above - why?
geneSets <- list(
  KEGG_GLYCOLYSIS_GLUCONEOGENESIS = fromString("55902 2645 5232 5230 5162 5160 5161 55276 7167 84532 2203 125 3099 126 3098 3101 127 5224 128 5223 124 230 501 92483 5313 160287 2023 5315 5214 669 5106 5105 219 217 218 10327 8789 5213 5211 3948 2597 2027 2026 441531 131 130 3945 220 221 222 223 224 130589 226 1738 1737 229 57818 3939 2538 5236 2821"),
  KEGG_CITRATE_CYCLE_TCA_CYCLE = fromString("3420 1743 5106 1431 5162 5105 5160 642502 5161 283398 2271 6392 4967 6390 3419 6391 3418 3417 48 47 4191 1738 4190 1737 55753 5091 6389 8802 8803 8801 3421 50"),
  KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS = fromString("54575 54576 6120 54577 54578 54490 54579 51084 7358 10941 2990 54600 51181 729020 27294 10720 7360 9942 7365 231 7364 7363 79799 54657 7367 54658 54659 7366")
)

es.plage <- gsva(M, geneSets, method = "plage", parallel.sz = 1)

Here are some observations:

- there are mismatches (the genes in the gene sets are not trimmed to include only the genes from `M`).
- when I increase the `p` to 10000 (so there will be more genes matching between gene sets and the expression matrix) the last example does not fail any longer.

Therefore, I've tried to trim the genes in the problematic gene sets:

trimmedGeneSets = sapply(geneSets, function(l) l[l %in% rownames(M)])

> trimmedGeneSets
$KEGG_GLYCOLYSIS_GLUCONEOGENESIS
[1] "125" "126" "127" "128" "124" "230" "501" "669" "219" "217" "218" "131" "130" "220" "221" "222" "223" "224" "226" "229"

$KEGG_CITRATE_CYCLE_TCA_CYCLE
[1] "48" "47" "50"

$KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS
[1] "231"

but it was still failing:

es.plage <- gsva(M, trimmedGeneSets, method = "plage", parallel.sz = 1)
Error in dimnames(x) <- dn :
  length of 'dimnames' [1] not equal to array extent

While it's obvious that a gene set with overlap of one would not constitute a meaningful pathway, it does work in the example 3: it is strange that the same gene sets do not fail when using KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS or (KEGG_GLYCOLYSIS_GLUCONEOGENESIS and KEGG_CITRATE_CYCLE_TCA_CYCLE - example two) separately.

Here is my session info:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GSVA_1.30.0          GSEABase_1.44.0      graph_1.60.0         annotate_1.58.0      XML_3.98-1.16       
 [6] AnnotationDbi_1.44.0 IRanges_2.14.10      S4Vectors_0.20.1     Biobase_2.40.0       BiocGenerics_0.26.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18       magrittr_1.5       bit_1.1-14         xtable_1.8-2       R6_2.2.2           blob_1.1.1        
 [7] shinythemes_1.1.2  tools_3.5.1        grid_3.5.1         DBI_1.0.0          htmltools_0.3.6    bit64_0.9-7       
[13] digest_0.6.16      shiny_1.2.0        RColorBrewer_1.1-2 geneplotter_1.60.0 later_0.7.5        promises_1.0.1    
[19] bitops_1.0-6       RCurl_1.95-4.11    mime_0.5           memoise_1.1.0      RSQLite_2.1.1      compiler_3.5.1    
[25] httpuv_1.4.5.1

gsva plage differential expression gene set enrichment • 1.6k views

ADD COMMENT • link 6.2 years ago krassowski.michal ▴ 40

score 1 · Accepted Answer · 2019-01-07

I hypothesized that the gene sets with overlap of one are the issue; I tried another such one - for entrez ids limited to 1000 one of such gene sets is KEGG_GALACTOSE_METABOLISM and the problem appeared again.

Then I filtered out all gene sets with overlap lower than two and tried to run my program again. This solved the issue for me.

Here is my final code:

# transform from GeneSetCollection class object to a named list
geneSets <- geneIds(c2.cp.kegg)

overlaps <- sapply(geneSets, function(genes) genes[genes %in% rownames(M)])
subset <- overlaps[sapply(overlaps, function(genes) length(genes) > 1)]

es.plage <- gsva(M, subset, method = "gsva", parallel.sz = 1)

While I have solved my issue while writing the question, I decided to share my post anyway so anyone else having the same error could solve it quickly. Hope that this solution is useful.