Creating custom geneset
Entering edit mode
pbachali ▴ 50
Last seen 4.5 years ago

I am a graduate student doing my masters in Bioinformatics. As part of my project I am performing micro array data analysis. I am trying to find unknown phenotypes using GSVA package. To find unknown phenotypes I would need to use some reference to compare with. In GSVA package by default it uses c2 (canonical gene sets) of Msig database. But in my analysis, I have list of genes which are differentially expressed in known phenotypes. Now, I am trying to generate my own geneset with the list of genes I have. I believe i need to use GSEA package to do this. But, I am not sure how to proceed in creating gene sets with list of genes. Any help would  be really appreciated. 


Thanks in advance.

gseabase • 4.0k views
Entering edit mode
Last seen 7 days ago
United States

It would probably be easiest to just use the camera function in limma.

Entering edit mode
Robert Castelo ★ 3.3k
Last seen 1 day ago
Barcelona/Universitat Pompeu Fabra


if I'm understanding correctly the question, you are just asking how to feed self-constructed gene sets to the 'gsva()' function. The help page of this function, which you can access by typing:


says that the first argument called 'expr' is the gene expression data provided either as an 'ExpressionSet' object or as a 'matrix' object and the second argument called 'gset.idx.list' are the gene sets provided either as a 'list' object or as a 'GeneSetCollection' object.

If the gene identifiers of your gene set of interest are in the same nomenclature as the gene identifiers in your expression data, then the simplest approach is to build a list of gene sets. See the following minimal example:



[1] "leukemia_eset"

## we collect a random sample of gene identifiers from our expression data
## to build a toy gene set for illustrative purposes only

geneids <- sample(featureNames(leukemia_eset), size=100, replace=FALSE)

res <- gsva(leukemia_eset, list(GS1=geneids))

if your gene set of interest is built with genes whose identifiers have a different nomenclature from the one of the expression data, which is usually the case with gene sets curated from literature using gene symbols while expression data may be based on probe, Entrez or Ensembl identifiers, then it is useful to build gene sets as a 'GeneSetCollection' object because then GSVA will use the BioC infrastructure to automagically map the two different types of gene identifiers. To build such an object please consult the documentation of the 'GSEABase' package.



Entering edit mode
pbachali ▴ 50
Last seen 4.5 years ago

Hi Robert,

That's a clear and neat explanation. Now I believe that if i have my gene identifiers same as my expression set I can make a geneset object. 

Let me explain my scenario clearly. I have three phenotypes like "active patients", "inactive patients" and "control patients" nothing  but healthy individuals. In one dataset I know my phenotypes (cohorts) as active patients and control patients. Now I have done microarray data analysis and found differentially expressed genes among "Active patients vs. Control patients". I have my differentially expressed genes with their probe_ids, p value and FDR corrected p value and Log fold change value for it as my output of my first dataset. The second dataset, which I have, have patient samples belonging to control samples and either "Active patients" or "Inactive patients". We do not know whether or not patient samples are active or inactive. I am assuming that GSVA can identify unknown phenotypes here.

My plan is to compare the expression set created using control samples and unknown phenotype (i.e active or inactive patient samples) with the differentially expressed genes identified, between "Control samples vs. Known phenotype (i.e Active patient samples)". My first question here is 1. Can I make the geneset object using the differentially expressed genes I have identified in the first dataset (with known phenotypes). From your previous answer, I believe that I can make geneset object using my output of my first dataset by matching the gene identifiers of my first data output with the gene identifiers of my expression set (built using control samples and unknown phenotype). Then I use gsva function and it creates the matrix of geneset enrichment scores. Here, I am unable to interpret my output after applying the gsva function on my eset with geneset object. Here is the sample one I have made using your previous example,

geneids <- sample(featureNames(leukemia_eset), size = 100, replace=FALSE)

res <- gsva(leukemia_eset, list(GS1=geneids))


            Length Class         Mode
es.obs      1      ExpressionSet S4  
bootstrap   2      -none-        list
p.vals.sign 0      -none-        NULL

Now how can I use res to find my unknow phenotypes? Is res is a matrix which has geneset enrichment scores giving some ranks to the genes? 

Is it possible to use res object to generate heatmap so that I can see how genes are expressed in unknown samples compared to the known samples?

I know I have asked many questions. I am really having hard time in figuring out this issue. I cannot move forward in my reserach until I figure out this step. Any suggestions/ideas are much appreciated.




Login before adding your answer.

Traffic: 742 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6