Question

GSVA: problem with chip annotation

0

Entering edit mode

q017mm • 0

@q017mm-6963

Last seen 10.2 years ago

Germany

Dear Bioconductors,

I recently started using Bioconductor (sorry to be a total newbie, but i'm just finding my feet with R...) and I would like to perform a geneset variation analysis (GSVA).

I have the following data:

RNA-expression-file with Gene Symbols in the first column and the corresponding expression values for the individual samples in the following columns (simplified sample):

NAME	sample 1	sample 2	sample 3	sample 4	sample 5
A1BG	5.045152135	3.675982862	4.034678849	1.586550605	1.416436753
A2BP1	1.216242033	26.73442081	19.98573476	2.379825908	6.041536357
A2LD1	0.720736019	7.351965723	2.796126272	0.741539957	1.040647411
A2ML1	7.83800421	26.5761249	9.458037859	9.467568285	9.134571716
A2M	793.6429722	914.6689604	1045.075652	619.8239331	624.4462602
A4GALT	3.941525105	8.143445287	2.402041361	1.879717565	6.590766935
AAAS	31.39706284	18.34473744	30.41960192	65.66939896	30.612378

Phenotype file (simplified sample):

	xy_mutation
sample 1	1
sample 2	0
sample 3	1
sample 4	0
sample 5	1

Geneset (gmt) file (simplified sample):

geneset_1	geneset_1	A1BG	A2BP1	A2LD1	A2ML1	A2M	A4GALT	AAAS
geneset_2	geneset_2	A2LD1	A2ML1	A2M	A4GALT	AAAS
geneset_3	geneset_3	A2BP1	A2LD1	A2ML1

The following code initially works fine for processing the above shown data:

#load mRNA expression data

dataDirectory <- system.file("extdata", package = "Biobase")
exprsFile <- "C:/expression.txt"
exprs <- as.matrix(read.table(exprsFile, header = TRUE,
sep = "\t", row.names = 1, as.is = TRUE))
 
#load phenotype data
pDataFile <- "C:/phenotype.txt"
pData <- read.table(pDataFile, row.names=1, header=TRUE, sep="\t")
phenoData <- new("AnnotatedDataFrame",data=pData)

#check
all(rownames(pData)==colnames(exprs))

#load genesets from gmt file
gsc <- getGmt("C:/GS.txt", geneIdType=SymbolIdentifier(), collectionType=NullCollection(), sep="\t"
)

#build expressionset
exampleSet <- ExpressionSet(assayData=exprs,
phenoData=phenoData)

#perform gsva
gsva(exampleSet, gsc, min.sz=4, max.sz=500, verbose=TRUE)

However here I get the following error:

Mapping identifiers between gene sets and feature names
Fehler in GSEABase::mapIdentifiers(gset.idx.list, GSEABase::AnnoOrEntrezIdentifier(annotation(expr))) :
Fehler bei der Auswertung des Argumentes 'to' bei der Methodenauswahl
für Funktion 'mapIdentifiers': Fehler in validObject(.Object) :
invalid class “ScalarCharacter” object: ScalarCharacter must have length one

So the problem starts when mapping the identifiers between gene sets and feature names - I guess the problem may be that I built my ExpressionSet object without the annotation argument. I thought this is the right way, since my expression file already contains gene symbols (and not a chip-ID (like Affy-ID or Illumina-ID)) which already corresponds to the gene symbols in my geneset. How can i fix this problem?

Thank you

Philipp

gsva • 5.0k views

ADD COMMENT • link updated 10.2 years ago by Robert Castelo ★ 3.4k • written 10.2 years ago by q017mm • 0

score 0 · Answer 1 · 2014-11-03

hi Philipp,

since identifiers in both, the input expression data and the input gene sets are of the same kind (gene symbols) you can directly pass to the gsva() function the expression data 'matrix', instead of the 'ExpressionSet' object. To do that you just use the 'exprs()' "get" method for 'ExpressionSet' objects:

library(Biobase)
gsva(exprs(exampleSet), gsc, min.sz=4, max.sz=500, verbose=TRUE)

If you want to know what combinations of expression data object and gene-set objects the function gsva() can take, just type:

showMethods("gsva")
Function: gsva (package GSVA)
expr="ExpressionSet", gset.idx.list="GeneSetCollection", annotation="missing"
expr="ExpressionSet", gset.idx.list="list", annotation="missing"
expr="matrix", gset.idx.list="GeneSetCollection", annotation="character"
expr="matrix", gset.idx.list="list", annotation="missing"

Note that while with 'ExpressionSet' objects there is no need to provide an 'annotation' argument, with a matrix object storing the expression values this is optional. In you case you need not to provide an 'annotation' argument because you are using the same type of gene identifier in both, the expression data matrix and in the collection of gene sets, but if they were different, you could use this third argument to inform 'gsva()' how to map the identifiers without having to build a fully-fledged 'ExpressionSet' object.

cheers,

robert.

score 0 · Answer 2 · 2014-11-03

Hi Robert,

Thanks for your super-fast reply and the explanation!

As suggested I now used the 'exprs()' "get" method for my 'ExpressionSet' object. It seems to work (at least the aforementioned error message disappeared) - however after performing gsva I now get a different error message:

gsva(exprs(exampleSet), gsc, min.sz=4, max.sz=500, verbose=TRUE) 

Error in base::match(x, table, nomatch = nomatch, incomparables = incomparables, : 'match' requires vector arguments

> traceback()
10: base::match(x, table, nomatch = nomatch, incomparables = incomparables, 
        ...)
9: match(x, y)
8: match(x, y)
7: na.omit(match(x, y))
6: FUN(X[[1L]], ...)
5: lapply(gset.idx.list, function(x, y) na.omit(match(x, y)), rownames(expr))
4: lapply(gset.idx.list, function(x, y) na.omit(match(x, y)), rownames(expr))
3: .local(expr, gset.idx.list, annotation, ...)
2: gsva(exprs(exampleSet), gsc, min.sz = 4, max.sz = 500, verbose = TRUE)
1: gsva(exprs(exampleSet), gsc, min.sz = 4, max.sz = 500, verbose = TRUE)
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252    LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                    LC_TIME=German_Austria.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sigPathway_1.32.1    GSVAdata_1.0.0       hgu95a.db_2.14.0     org.Hs.eg.db_2.14.0  RSQLite_1.0.0        DBI_0.3.1            GSVA_1.12.0          GSEABase_1.26.0      graph_1.42.0         annotate_1.42.1      AnnotationDbi_1.26.1
[12] GenomeInfoDb_1.0.2   RColorBrewer_1.0-5   limma_3.20.9         genefilter_1.46.1    Biobase_2.24.0       BiocGenerics_0.10.0 

loaded via a namespace (and not attached):
[1] IRanges_1.22.10 splines_3.1.1   stats4_3.1.1    survival_2.37-7 tools_3.1.1     XML_3.98-1.1    xtable_1.7-4   
>



Any further suggestions how I can fix this?

Best regards,

Philipp

score 0 · Answer 3 · 2014-11-04

hi Philipp,

the error you get is caused by the specific example expression data you are using. In the 'NAME' column containing the gene symbol there is an extra space right after each symbol and before the tab separator. If you remove that space, it should work fine.

However, the error provided by GSVA is certainly not informative of what was happening, so I have corrected the code to force the 'gsva()' function giving a more informative error. In the newer version, you would be getting the following error:

gsva(exprs(exampleSet), gsc, min.sz=4, max.sz=500, verbose=TRUE)
Error in .local(expr, gset.idx.list, ...) :
  No identifiers in the gene sets could be matched to the identifiers in the expression data.

To work with this newer version, you have to update to the latest release of Bioconductor 3.0. From the version numbers I see in your sessionInfo() output, they seem to correspond to an earlier version. See http://www.bioconductor.org/install for how to upgrade to the current release version.

The corrected version of GSVA in the current release will be version 1.14.1 and will be available via biocLite() on thursday.

cheers,

robert.

score 0 · Answer 4 · 2014-11-04

Hi Robert,

Thanks for your reply! I now updated to the latest release of Bioconductor 3.0. Furthermore I checked the NAME column of my expression-set as suggested, however there is no extra-space right after the symbol and before the tab seperator (therefore I´m still getting the aforementioned error message...).

Would you be so kind to double-check my sample data - you can find the data via https://www.dropbox.com/sh/i4l9cip0a6l2yun/AADbAQleayJZYcUkNRR1sL8Fa?dl=0

Would greatly appreciate your help!

Best regards,
Philipp

checked agin my files

score 0 · Answer 5 · 2014-11-05

hi Phillip,

since I copied & pasted the data from this web page I guess that extra space must have sneaked into the page somehow, an Unix 'od -c' command on the first two lines reveals it:

0000000   N   A   M   E      \t   s   a   m   p   l   e       1      \t
0000020   s   a   m   p   l   e       2      \t   s   a   m   p   l   e
0000040       3      \t   s   a   m   p   l   e       4      \t   s   a
0000060   m   p   l   e       5  \n   A   1   B   G      \t   5   .   0
0000100   4   5   1   5   2   1   3   5      \t   3   .   6   7   5   9
0000120   8   2   8   6   2      \t   4   .   0   3   4   6   7   8   8
0000140   4   9      \t   1   .   5   8   6   5   5   0   6   0   5    
0000160  \t   1   .   4   1   6   4   3   6   7   5   3

but anyway, the original data you put on the dropbox do not have these extra space characters and the script runs smoothly in my linux box (i do not have a windows machine to try it) with the last version of GSVA 1.14.1 which will become available tomorrow thursday via biocLite(), since it takes a couple of days at the BioC build machines to propagate changes to the software.

if you already updated to BioC 3.0 then tomorrow just do

biocLite("GSVA")

and answer "y" to the question whether you want to update any other package (if the question pops up). Verify that the installation upgraded GSVA to version 1.14.1 by doing:

library(GSVA)
sessionInfo()

and checking that the version is indeed 1.14.1.

please let me know if this works. Do not hesitate to report any other problem, users experience is what helps improving the software!

cheers,

robert.

score 0 · Answer 6 · 2014-11-06

0

Entering edit mode

q017mm • 0

@q017mm-6963

Last seen 10.2 years ago

Germany

Hi Robert,

just updated to GSVA 1.14.1 - now gsva() works without producing any errors!

Thanks for all your help!

Best regards,

Philipp

ADD COMMENT • link 10.2 years ago q017mm • 0

score 0 · Answer 7 · 2014-11-06

I now tried to continue my analysis (by adhering to the code of the leukemia sample data shown in the GSVA documentation (chapter 4.1))...

The Leukemia sample code from the documation works fine:

data(leukemia)
leukemia_eset
head(pData(leukemia_eset))
table(leukemia_eset$subtype)

cacheDir <- system.file("extdata", package="GSVA")
cachePrefix <- "cache4vignette_"
cache(leukemia_es <- gsva(leukemia_eset, c2BroadSets, min.sz=10, max.sz=500, verbose=TRUE), dir=cacheDir, prefix=cachePrefix)
adjPvalueCutoff <- 0.001
logFCcutoff <- log2(2)
design <- model.matrix(~ factor(leukemia_es$subtype))
colnames(design) <- c("ALL", "MLLvsALL")
fit <- lmFit(leukemia_es, design)
fit <- eBayes(fit)
allGeneSets <- topTable(fit, coef="MLLvsALL", number=Inf)
DEgeneSets <- topTable(fit, coef="MLLvsALL", number=Inf, p.value=adjPvalueCutoff, adjust="BH")
res <- decideTests(fit, p.value=adjPvalueCutoff)
summary(res)
GSVAsco <- exprs(leukemia_es[rownames(DEgeneSets), ])
colorLegend <- c("darkred", "darkblue")
names(colorLegend) <- c("ALL", "MLL")
sample.color.map <- colorLegend[pData(leukemia_es)[, "subtype"]]
names(sample.color.map) <- colnames(GSVAsco)
sampleClustering <- hclust(as.dist(1-cor(GSVAsco, method="spearman")), method="complete")
geneSetClustering <- hclust(as.dist(1-cor(t(GSVAsco), method="pearson")), method="complete")
heatmap(GSVAsco, ColSideColors=sample.color.map, xlab="samples", ylab="Gene sets and pathways", margins=c(2, 20), labRow=substr(gsub("_", " ", gsub("^KEGG_|^REACTOME_|^BIOCARTA_", "", rownames(GSVAsco))), 1, 35), labCol="", scale="row", Colv=as.dendrogram(sampleClustering), Rowv=as.dendrogram(geneSetClustering))
legend("topleft", names(colorLegend), fill=colorLegend, inset=0.01, bg="white")
dev.off()

when I try the code with my own data...

dataDirectory <- system.file("extdata", package = "Biobase")
exprsFile <- "D:/Dropbox/R_test/expression.txt" 
exprs <- as.matrix(read.table(exprsFile, header = TRUE, sep = "\t", row.names = 1, as.is = TRUE))

pDataFile <- "D:/Dropbox/R_test/phenotype.txt"
pData <- read.table(pDataFile, row.names=1, header=TRUE, sep="\t")
phenoData <- new("AnnotatedDataFrame",data=pData)

all(rownames(pData)==colnames(exprs))
[1] TRUE

gsc <- getGmt("D:/Dropbox/R_test/geneset.txt")

exampleSet <- ExpressionSet(assayData=exprs, phenoData=phenoData)

cacheDir <- system.file("extdata", package="GSVA")
cachePrefix <- "cache4vignette_"
cache(exampleSet_gsva <- gsva(exprs(exampleSet), gsc, min.sz=5, max.sz=500, verbose=TRUE), dir=cacheDir, prefix=cachePrefix)
adjPvalueCutoff <- 0.001
logFCcutoff <- log2(2)
design <- model.matrix(~ factor(exampleSet_gsva$xy_mut))

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels

This error results from to a problem with my phenotype data: although exampleSet$xy_mut returns "[1] 1 0 1 1 1 0 1 0", exampleSet_gsva$xy_mut returns "NULL" .... but why does my exampleSet_gsva not contain the information on my phenotype?

score 0 · Answer 8 · 2014-11-07

hi Phillip,

first note that the instruction you copied from the vignette:

cache(leukemia_es <- gsva(leukemia_eset, c2BroadSets, min.sz=10, max.sz=500, verbose=TRUE), dir=cacheDir, prefix=cachePrefix)

is incomplete, if you look up the vignette, pg. 8, you will see that it says

cache(leukemia_es <- gsva(leukemia_eset, c2BroadSets, min.sz=10, max.sz=500, verbose=TRUE)$es.obs, dir=cacheDir, prefix=cachePrefix)

so in principle, with your data you should also be doing:

cache(exampleSet_gsva <- gsva(exprs(exampleSet), gsc, min.sz=5, max.sz=500, verbose=TRUE)$es.obs, dir=cacheDir, prefix=cachePrefix)

Note that as described in page 6, the call to the function "cache()" is just for caching the results during the building of the vignette to save time during its computation. I understand this may be misleading and difficult to figure out, but you do not need to call this function, so for you it should be sufficient to do:

exampleSet_gsva <- gsva(exprs(exampleSet), gsc, min.sz=5, max.sz=500, verbose=TRUE)$es.obs

The next thing to realize is that gsva() only returns an 'ExpressionSet' object if the input expression data is also an 'ExpressionSet' object. In your case, however, we are not providing an 'ExpressionSet' object (because our initial discussion that the main identifiers were not Entrez IDs, etc.) but a plain matrix. Therefore, what you get in return is also al matrix. This is not bad, you can simply acces the phenotypic data through the original 'ExpressionSet' object by doing:

design <- model.matrix(~ factor(exampleSet$xy_mut))

if you would like to also have a fully-fledge 'ExpressionSet' object with the GSVA scores, then the input data matrix should be an 'ExpressionSet' object. For that to happen, you should replace the symbol identifiers in the input 'ExpressionSet' object by Entrez Gene identifiers and set the annotation slot to 'org.Hs.eg.db'. But as I point out before, using the phenotypic data from the input 'ExpressionSet' object should work fine as the set of samples is the same.

cheers,

robert.