GSVA: problem with chip annotation
8
0
Entering edit mode
q017mm • 0
@q017mm-6963
Last seen 6.9 years ago
Germany

Dear Bioconductors,

I recently started using Bioconductor (sorry to be a total newbie, but i'm just finding my feet with R...) and I would like to perform a geneset variation analysis (GSVA).

I have the following data:

RNA-expression-file with Gene Symbols in the first column and the corresponding expression values for the individual samples in the following columns (simplified sample):

 NAME sample 1 sample 2 sample 3 sample 4 sample 5 A1BG 5.045152135 3.675982862 4.034678849 1.586550605 1.416436753 A2BP1 1.216242033 26.73442081 19.98573476 2.379825908 6.041536357 A2LD1 0.720736019 7.351965723 2.796126272 0.741539957 1.040647411 A2ML1 7.83800421 26.5761249 9.458037859 9.467568285 9.134571716 A2M 793.6429722 914.6689604 1045.075652 619.8239331 624.4462602 A4GALT 3.941525105 8.143445287 2.402041361 1.879717565 6.590766935 AAAS 31.39706284 18.34473744 30.41960192 65.66939896 30.612378

Phenotype file (simplified sample):

 xy_mutation sample 1 1 sample 2 0 sample 3 1 sample 4 0 sample 5 1

Geneset (gmt) file (simplified sample):

 geneset_1 geneset_1 A1BG A2BP1 A2LD1 A2ML1 A2M A4GALT AAAS geneset_2 geneset_2 A2LD1 A2ML1 A2M A4GALT AAAS geneset_3 geneset_3 A2BP1 A2LD1 A2ML1

The following code initially works fine for processing the above shown data:

#load mRNA expression data

dataDirectory <- system.file("extdata", package = "Biobase")
exprsFile <- "C:/expression.txt"
sep = "\t", row.names = 1, as.is = TRUE))

#load phenotype data
pDataFile <- "C:/phenotype.txt"
phenoData <- new("AnnotatedDataFrame",data=pData)

#check
all(rownames(pData)==colnames(exprs))

#load genesets from gmt file
gsc <- getGmt("C:/GS.txt", geneIdType=SymbolIdentifier(), collectionType=NullCollection(), sep="\t"
)

#build expressionset
exampleSet <- ExpressionSet(assayData=exprs,
phenoData=phenoData)

#perform gsva
gsva(exampleSet, gsc, min.sz=4, max.sz=500, verbose=TRUE)

However here I get the following error:

Mapping identifiers between gene sets and feature names
Fehler in GSEABase::mapIdentifiers(gset.idx.list, GSEABase::AnnoOrEntrezIdentifier(annotation(expr))) :
Fehler bei der Auswertung des Argumentes 'to' bei der Methodenauswahl
für Funktion 'mapIdentifiers': Fehler in validObject(.Object) :
invalid class “ScalarCharacter” object: ScalarCharacter must have length one

So the problem starts when mapping the identifiers between gene sets and feature names - I guess the problem may be that I built my ExpressionSet object without the annotation argument. I thought this is the right way, since my expression file already contains gene symbols (and not a chip-ID (like Affy-ID or Illumina-ID)) which already corresponds to the gene symbols in my geneset. How can i fix this problem?

Thank you

Philipp

gsva • 2.3k views
0
Entering edit mode
Robert Castelo ★ 2.7k
@rcastelo
Last seen 11 weeks ago
Barcelona/Universitat Pompeu Fabra

hi Philipp,

since identifiers in both, the input expression data and the input gene sets are of the same kind (gene symbols) you can directly pass to the gsva() function the expression data 'matrix', instead of the 'ExpressionSet' object. To do that you just use the 'exprs()' "get" method for 'ExpressionSet' objects:

library(Biobase)
gsva(exprs(exampleSet), gsc, min.sz=4, max.sz=500, verbose=TRUE)

If you want to know what combinations of expression data object and gene-set objects the function gsva() can take, just type:

showMethods("gsva")
Function: gsva (package GSVA)
expr="ExpressionSet", gset.idx.list="GeneSetCollection", annotation="missing"
expr="ExpressionSet", gset.idx.list="list", annotation="missing"
expr="matrix", gset.idx.list="GeneSetCollection", annotation="character"
expr="matrix", gset.idx.list="list", annotation="missing"

Note that while with 'ExpressionSet' objects there is no need to provide an 'annotation' argument, with a matrix object storing the expression values this is optional. In you case you need not to provide an 'annotation' argument because you are using the same type of gene identifier in both, the expression data matrix and in the collection of gene sets, but if they were different, you could use this third argument to inform 'gsva()' how to map the identifiers without having to build a fully-fledged 'ExpressionSet' object.

cheers,

robert.

0
Entering edit mode
q017mm • 0
@q017mm-6963
Last seen 6.9 years ago
Germany

Hi Robert,

As suggested I now used the 'exprs()' "get" method for my 'ExpressionSet' object. It seems to work (at least the aforementioned error message disappeared) - however after performing gsva I now get a different error message:

gsva(exprs(exampleSet), gsc, min.sz=4, max.sz=500, verbose=TRUE)

Error in base::match(x, table, nomatch = nomatch, incomparables = incomparables, : 'match' requires vector arguments

> traceback()
10: base::match(x, table, nomatch = nomatch, incomparables = incomparables,
...)
9: match(x, y)
8: match(x, y)
7: na.omit(match(x, y))
6: FUN(X[[1L]], ...)
5: lapply(gset.idx.list, function(x, y) na.omit(match(x, y)), rownames(expr))
4: lapply(gset.idx.list, function(x, y) na.omit(match(x, y)), rownames(expr))
3: .local(expr, gset.idx.list, annotation, ...)
2: gsva(exprs(exampleSet), gsc, min.sz = 4, max.sz = 500, verbose = TRUE)
1: gsva(exprs(exampleSet), gsc, min.sz = 4, max.sz = 500, verbose = TRUE)
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252    LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                    LC_TIME=German_Austria.1252

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] sigPathway_1.32.1    GSVAdata_1.0.0       hgu95a.db_2.14.0     org.Hs.eg.db_2.14.0  RSQLite_1.0.0        DBI_0.3.1            GSVA_1.12.0          GSEABase_1.26.0      graph_1.42.0         annotate_1.42.1      AnnotationDbi_1.26.1
[12] GenomeInfoDb_1.0.2   RColorBrewer_1.0-5   limma_3.20.9         genefilter_1.46.1    Biobase_2.24.0       BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] IRanges_1.22.10 splines_3.1.1   stats4_3.1.1    survival_2.37-7 tools_3.1.1     XML_3.98-1.1    xtable_1.7-4
>
 Any further suggestions how I can fix this? Best regards, Philipp  
0
Entering edit mode
Robert Castelo ★ 2.7k
@rcastelo
Last seen 11 weeks ago
Barcelona/Universitat Pompeu Fabra

hi Philipp,

the error you get is caused by the specific example expression data you are using. In the 'NAME' column containing the gene symbol there is an extra space right after each symbol and before the tab separator. If you remove that space, it should work fine.

However, the error provided by GSVA is certainly not informative of what was happening, so I have corrected the code to force the 'gsva()' function giving a more informative error. In the newer version, you would be getting the following error:

gsva(exprs(exampleSet), gsc, min.sz=4, max.sz=500, verbose=TRUE)
Error in .local(expr, gset.idx.list, ...) :
No identifiers in the gene sets could be matched to the identifiers in the expression data.

To work with this newer version, you have to update to the latest release of Bioconductor 3.0. From the version numbers I see in your sessionInfo() output, they seem to correspond to an earlier version. See http://www.bioconductor.org/install for how to upgrade to the current release version.

The corrected version of GSVA in the current release will be version 1.14.1 and will be available via biocLite() on thursday.

cheers,

robert.

0
Entering edit mode
q017mm • 0
@q017mm-6963
Last seen 6.9 years ago
Germany

Hi Robert,

Thanks for your reply! I now updated to the latest release of Bioconductor 3.0. Furthermore I checked the NAME column of my expression-set as suggested, however there is no extra-space right after the symbol and before the tab seperator (therefore I´m still getting the aforementioned error message...).

Would you be so kind to double-check my sample data - you can find the data via https://www.dropbox.com/sh/i4l9cip0a6l2yun/AADbAQleayJZYcUkNRR1sL8Fa?dl=0

Best regards,
Philipp

checked agin my files

0
Entering edit mode
Robert Castelo ★ 2.7k
@rcastelo
Last seen 11 weeks ago
Barcelona/Universitat Pompeu Fabra

hi Phillip,

since I copied & pasted the data from this web page I guess that extra space must have sneaked into the page somehow, an Unix 'od -c' command on the first two lines reveals it:

0000000   N   A   M   E      \t   s   a   m   p   l   e       1      \t
0000020   s   a   m   p   l   e       2      \t   s   a   m   p   l   e
0000040       3      \t   s   a   m   p   l   e       4      \t   s   a
0000060   m   p   l   e       5  \n   A   1   B   G      \t   5   .   0
0000100   4   5   1   5   2   1   3   5      \t   3   .   6   7   5   9
0000120   8   2   8   6   2      \t   4   .   0   3   4   6   7   8   8
0000140   4   9      \t   1   .   5   8   6   5   5   0   6   0   5
0000160  \t   1   .   4   1   6   4   3   6   7   5   3

but anyway, the original data you put on the dropbox do not have these extra space characters and the script runs smoothly in my linux box (i do not have a windows machine to try it) with the last version of GSVA 1.14.1 which will become available tomorrow thursday via biocLite(), since it takes a couple of days at the BioC build machines to propagate changes to the software.

if you already updated to BioC 3.0 then tomorrow just do

biocLite("GSVA")

and answer "y" to the question whether you want to update any other package (if the question pops up). Verify that the installation upgraded GSVA to version 1.14.1 by doing:

library(GSVA)
sessionInfo()

and checking that the version is indeed 1.14.1.

please let me know if this works. Do not hesitate to report any other problem,  users experience is what helps improving the software!

cheers,

robert.

0
Entering edit mode
q017mm • 0
@q017mm-6963
Last seen 6.9 years ago
Germany

Hi Robert,

just updated to GSVA 1.14.1 - now gsva() works without producing any errors!

Best regards,

Philipp

0
Entering edit mode
q017mm • 0
@q017mm-6963
Last seen 6.9 years ago
Germany

I now tried to continue my analysis (by adhering to the code of the leukemia sample data shown in the GSVA documentation (chapter 4.1))...

The Leukemia sample code from the documation works fine:

data(leukemia)
leukemia_eset
table(leukemia_eset$subtype) cacheDir <- system.file("extdata", package="GSVA") cachePrefix <- "cache4vignette_" cache(leukemia_es <- gsva(leukemia_eset, c2BroadSets, min.sz=10, max.sz=500, verbose=TRUE), dir=cacheDir, prefix=cachePrefix) adjPvalueCutoff <- 0.001 logFCcutoff <- log2(2) design <- model.matrix(~ factor(leukemia_es$subtype))
colnames(design) <- c("ALL", "MLLvsALL")
fit <- lmFit(leukemia_es, design)
fit <- eBayes(fit)
allGeneSets <- topTable(fit, coef="MLLvsALL", number=Inf)
summary(res)
GSVAsco <- exprs(leukemia_es[rownames(DEgeneSets), ])
colorLegend <- c("darkred", "darkblue")
names(colorLegend) <- c("ALL", "MLL")
sample.color.map <- colorLegend[pData(leukemia_es)[, "subtype"]]
names(sample.color.map) <- colnames(GSVAsco)
sampleClustering <- hclust(as.dist(1-cor(GSVAsco, method="spearman")), method="complete")
geneSetClustering <- hclust(as.dist(1-cor(t(GSVAsco), method="pearson")), method="complete")
heatmap(GSVAsco, ColSideColors=sample.color.map, xlab="samples", ylab="Gene sets and pathways", margins=c(2, 20), labRow=substr(gsub("_", " ", gsub("^KEGG_|^REACTOME_|^BIOCARTA_", "", rownames(GSVAsco))), 1, 35), labCol="", scale="row", Colv=as.dendrogram(sampleClustering), Rowv=as.dendrogram(geneSetClustering))
legend("topleft", names(colorLegend), fill=colorLegend, inset=0.01, bg="white")
dev.off()

when I try the code with my own data...

dataDirectory <- system.file("extdata", package = "Biobase")
exprsFile <- "D:/Dropbox/R_test/expression.txt"
exprs <- as.matrix(read.table(exprsFile, header = TRUE, sep = "\t", row.names = 1, as.is = TRUE))

pDataFile <- "D:/Dropbox/R_test/phenotype.txt"
phenoData <- new("AnnotatedDataFrame",data=pData)

all(rownames(pData)==colnames(exprs))
[1] TRUE

gsc <- getGmt("D:/Dropbox/R_test/geneset.txt")

exampleSet <- ExpressionSet(assayData=exprs, phenoData=phenoData)

cacheDir <- system.file("extdata", package="GSVA")
cachePrefix <- "cache4vignette_"
cache(exampleSet_gsva <- gsva(exprs(exampleSet), gsc, min.sz=5, max.sz=500, verbose=TRUE), dir=cacheDir, prefix=cachePrefix)
logFCcutoff <- log2(2)
design <- model.matrix(~ factor(exampleSet_gsva$xy_mut)) Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels This error results from to a problem with my phenotype data: although exampleSet$xy_mut returns "[1] 1 0 1 1 1 0 1 0", exampleSet_gsva$xy_mut returns "NULL" .... but why does my exampleSet_gsva not contain the information on my phenotype? ADD COMMENT 0 Entering edit mode Robert Castelo ★ 2.7k @rcastelo Last seen 11 weeks ago Barcelona/Universitat Pompeu Fabra hi Phillip, first note that the instruction you copied from the vignette: cache(leukemia_es <- gsva(leukemia_eset, c2BroadSets, min.sz=10, max.sz=500, verbose=TRUE), dir=cacheDir, prefix=cachePrefix) is incomplete, if you look up the vignette, pg. 8, you will see that it says cache(leukemia_es <- gsva(leukemia_eset, c2BroadSets, min.sz=10, max.sz=500, verbose=TRUE)$es.obs, dir=cacheDir, prefix=cachePrefix)

so in principle, with your data you should also be doing:

cache(exampleSet_gsva <- gsva(exprs(exampleSet), gsc, min.sz=5, max.sz=500, verbose=TRUE)$es.obs, dir=cacheDir, prefix=cachePrefix) Note that as described in page 6, the call to the function "cache()" is just for caching the results during the building of the vignette to save time during its computation. I understand this may be misleading and difficult to figure out, but you do not need to call this function, so for you it should be sufficient to do: exampleSet_gsva <- gsva(exprs(exampleSet), gsc, min.sz=5, max.sz=500, verbose=TRUE)$es.obs

The next thing to realize is that gsva() only returns an 'ExpressionSet' object if the input expression data is also an 'ExpressionSet' object. In your case, however, we are not providing an 'ExpressionSet' object (because our initial discussion that the main identifiers were not Entrez IDs, etc.) but a plain matrix. Therefore, what you get in return is also al matrix. This is not bad, you can simply acces the phenotypic data through the original 'ExpressionSet' object by doing:

design <- model.matrix(~ factor(exampleSet\$xy_mut))

if you would like to also have a fully-fledge 'ExpressionSet' object with the GSVA scores, then the input data matrix should be an 'ExpressionSet' object. For that to happen, you should replace the symbol identifiers in the input 'ExpressionSet' object by Entrez Gene identifiers and set the annotation slot to 'org.Hs.eg.db'. But as I point out before, using the phenotypic data from the input 'ExpressionSet' object should work fine as the set of samples is the same.

cheers,

robert.