Question

GSEA using clusterProfiler

0

Entering edit mode

giuseppe0525 • 0

@giuseppe0525-14327

Last seen 5.6 years ago

Hello, everyone

I 'm working on GSEA using clusterProfiler and came across two problems.

1. How should I prepare geneList?

2. What does TERM2GENE mean?

I've worked out the differentially expressed genes from mouse microarray data.

Then I tried the following script but failed to obtain the results.

"

library(clusterProfiler)

geneList <- results$gene.id
geneList <- levels(geneList)
geneList = sort(geneList, decreasing = TRUE)
head(geneList)

term <- rep("ischemia24h",length.out = 844)
head(term)

TERM2GENE <- cbind(term, geneList)
head(TERM2GENE)

GSEA(geneList,
     exponent = 1,
     nPerm = 1000,
     minGSSize = 2,
     maxGSSize = 5000,
     pvalueCutoff = 0.05,
     pAdjustMethod = "BH",
     TERM2GENE,
     TERM2NAME = NA,
     verbose = TRUE,
     seed = FALSE,
     by = "fgsea")

"

##

preparing geneSet collections...
--> Expected input gene ID: 94094,223650,17926,12831,29875,68178
Error in check_gene_id(geneList, geneSets) : 
  --> No gene can be mapped....

Thank you!

GSEA clusterProfiler • 12k views

ADD COMMENT • link 6.5 years ago • updated 6.4 years ago giuseppe0525 • 0

1

Entering edit mode

Hi,

It seems that you have the same issue as in your previous question regarding clusterProfiler. Have you checked that your supplied gene list of the right type and from the right species? It would be helpful if you also showed the output of the function:

>head(geneList)

Regarding the TERM2GENE I found this section in the vignette to the package:

"TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional."

ADD REPLY • link 6.5 years ago thokall ▴ 160

score 0 · Answer 1 · 2017-11-20

0

Entering edit mode

giuseppe0525 • 0

@giuseppe0525-14327

Last seen 5.6 years ago

It seems that the problem resulted from the format of geneList, but I don't know how to create a data collection in the same format.

I checked the example provided by the developer of the package and found that the geneList was a numeric vector whose element contained two numbers. One is the ID of gene, and the other seems to indicate the difference of expression among groups.

The description of the geneList is shown as followed:

> data(geneList, package="DOSE")
> class(geneList)
# [1] "numeric"
> head(geneList)
# 4312 8318 10874 55143 55388 991
#4.572613 4.514594 4.418218 4.144075 3.876258 3.677857
> length(geneList)
[1] 12495

Would you tell me how to generate data in such format? I'd really appreciate your help.

ADD COMMENT • link 6.4 years ago giuseppe0525 • 0

0

Entering edit mode

Please specify how you prepared the data you would like to use.

ADD REPLY • link 6.4 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Thanks for your reply.

In brief, I obtained the differentially expressed genes from microarray data based on Limma.

The results are organized in a data frame with columns like gene ID, logFC, p value, adj. p value etc. In addition, I have the expression set from which I got the DEGs.

The following are parts of the results and expression set:

> head(results)
logFC AveExpr t P.Value adj.P.Val B ID gene.symbols gene.id change
1421207_at 4.818821 4.522985 44.58554 2.695409e-11 6.069522e-07 13.89607 1421207_at Lif 16878 UP
1449984_at 7.304190 6.107650 28.30740 1.205558e-09 8.434865e-06 11.91560 1449984_at Cxcl2 20310 UP
1416431_at 6.592971 5.841120 27.52500 1.522888e-09 8.434865e-06 11.76300 1416431_at Tubb6 67951 UP
1426063_a_at 3.948953 4.134480 27.31784 1.621853e-09 8.434865e-06 11.72128 1426063_a_at Gem 14579 UP
1418945_at 7.182747 6.057138 26.72206 1.948932e-09 8.434865e-06 11.59811 1418945_at Mmp3 17392 UP
1457644_s_at 7.437478 6.163691 26.26859 2.247499e-09 8.434865e-06 11.50108 1457644_s_at <NA> <NA> UP

head(eset)
GSM805714.CEL GSM805715.CEL GSM805716.CEL GSM805717.CEL GSM805741.CEL GSM805742.CEL GSM805743.CEL GSM805744.CEL
1415670_at 11.02038 11.16112 10.97783 11.08388 10.83234 10.94515 10.81384 10.92230
1415671_at 12.74677 12.72211 12.62761 12.64847 12.67121 12.51231 12.52843 12.66846
1415672_at 11.72221 11.68903 11.75651 11.78890 12.03472 11.91181 12.04743 11.95583
1415673_at 10.17084 10.15074 10.22884 10.11947 10.42166 10.68337 10.17406 10.55455
1415674_a_at 10.71065 10.70821 10.75942 10.60990 10.36245 10.51380 10.63063 10.61658
1415675_at 11.45855 11.48796 11.40529 11.41186 11.19422 11.13612 11.30387 11.20107

Although I could not use the gseGO() and GSEA() functions correctly, I managed to use the enrichGO() and enrichKEGG() functions, which require gene as input. So I speculate the reason why I failed to operate gseGO() and GSEA() functions is that I cannot create a geneList in the same format as the example of clusterprofiler.

Looking forward to your reply.

ADD REPLY • link 6.4 years ago giuseppe0525 • 0

1

Entering edit mode

I always use the limma output (which is of class MArrayLM) directly by subsetting/extracting the contrast of interest. It should work very similar when having your data in a data frame.

#check comparisons/contrasts that were made (and are stored in the fit object)
colnames(fit2$contrasts)

#[1] "A7"    "M19"    "MvsA.delta"

# Extract contrast of interest.
# I like to use the moderated t-value, but you can also use logFC or log(pvalue).
# In case of latter, be sure to use 'signed-ranked' list, so up/down is clear.
# Gene symbols are also stored in limma object, and used as rownames. This is done
# because in the list of gene sets also symbols are used. 

A7.t <- fit2$t[,"A7"]
names(A7.t) <- fit2$genes$ID
A7.t<- sort(A7.t, decreasing = TRUE)


> head(A7.t)
       ARNTL        NFIL3      FAM124B LOC105369382         SGK1         NRG2
   14.261763     9.442942     8.845451     8.795412     8.523587     8.460763
>


class(fit2)
#[1] "MArrayLM"
#attr(,"package")
#[1] "limma"

Run GSEA:
# I use a subset of information from the KEGG databases (as *.gmt; 239 gene sets) that I obtained outside of clusterProfiler.

gsea.A <- GSEA(geneList=A7.t, nPerm=10000, minGSSize = 15, maxGSSize = 500, pvalueCutoff = 0.05, pAdjustMethod = "BH", TERM2GENE=pathways, TERM2NAME=GS2name.hs)

#> dim(pathways)
[1] 19836     2
# content pathways
> head(pathways)
                                        ont   gene
1 hsa00970.Aminoacyl.tRNA.biosynthesis.KEGG  FARSB
2 hsa00970.Aminoacyl.tRNA.biosynthesis.KEGG  WARS2
3 hsa00970.Aminoacyl.tRNA.biosynthesis.KEGG  FARS2
4 hsa00970.Aminoacyl.tRNA.biosynthesis.KEGG   PSTK
5 hsa00970.Aminoacyl.tRNA.biosynthesis.KEGG  MTFMT
6 hsa00970.Aminoacyl.tRNA.biosynthesis.KEGG TARSL2
>

# conversion of KEGG descriptions to 'nicer' descriptions
> dim(GS2name.hs)
[1] 239   2
> head(GS2name.hs)
     names.set.hs                                      descr.set.hs                      
[1,] "hsa00970.Aminoacyl.tRNA.biosynthesis.KEGG"       "Aminoacyl-tRNA biosynthesis"     
[2,] "hsa02010.ABC.transporters.KEGG"                  "ABC transporters"                
[3,] "hsa03008.Ribosome.biogenesis.in.eukaryotes.KEGG" "Ribosome biogenesis in eukaryotes"
[4,] "hsa03010.Ribosome.KEGG"                          "Ribosome"                        
[5,] "hsa03013.RNA.transport.KEGG"                     "RNA transport"                   
[6,] "hsa03015.mRNA.surveillance.pathway.KEGG"         "mRNA surveillance pathway"       
>

> gsea.A <- GSEA(geneList=A7.t, nPerm=10000, minGSSize = 15, maxGSSize = 500, pvalueCutoff = 0.05, pAdjustMethod = "BH", TERM2GENE=pathways, TERM2NAME=GS2name.hs)
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
>