How can I limit the number of displayed genes in heatplot
0
0
Entering edit mode
Melanie • 0
@47c2ea1e
Last seen 15 months ago
Germany

I would like to limit the number of displayed genes (so not the categories, which is easily done with showCategory) in the enrichplot heatplot, but have not found a way to do it. I am running the gseGO Function to get my enriched gene sets and then passing that to heatplot. I don't want to limit the number of genes before the gse analysis itself and I also don't want to change the number of the maxGeneSetSize, but would like to only limit the heatplot to the top n (for example 10) genes per gene set to be plotted. If I plot it as it is now, it is impossible to read any of the gene names, because there are just way too many per set.


gse <- gseGO(geneList=gene_list,
ont ="ALL",
keyType = "ENSEMBL",
pvalueCutoff = config_pvalueCutoff,
verbose = TRUE,
OrgDb = organism)

readable_gene_names <- setReadable(gse, 'org.Hs.eg.db', 'ENSEMBL')
heatplot(readable_gene_names, foldChange=gene_list, showCategory=5)

sessionInfo( )

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6.2

heatplot genes overlapping

heatplot enrichplot • 2.0k views
ADD COMMENT
0
Entering edit mode

You asked a rather straightforward question, but in my opinion the answer is less straightforward...

Let me start by saying that from a 'technical' perspective it is indeed possible to limit the number of genes that should be plotted'. To do that you have to limit the core enrichment genes present in results object. See code below in which I apply a small 'hack' to replace for each GO category the full set of core enrichment genes in thecore_enrichment column of the gseaResult object by 5 randomly selected core enrichment genes present in that category. These 5 randomly selected genes are then plotted.

The difficult question is: which genes would you like to show in the heatplot? In other words, how exactly will you define which of the core enriched genes to show, and which ones not? The nice thing of the heatplot is that is shows the overlap of core enriched genes between the GO categories. However, if you, for example, select only 5 random genes, the information on genes present in multiple gene sets is lost. This may dramatically affect the interpretation of the results... (which also becomes apparent when comparing the heatplots below).

Just my 2 cents....

> ## run gseGO using sample data set.
> ## note that these use entrez ids, and not ensembl ids.
>
> library(clusterProfiler)
> library(org.Hs.eg.db)
> data(geneList, package="DOSE")
> 
> ego3 <- gseGO(geneList     = geneList,
+               OrgDb        = org.Hs.eg.db,
+               ont          = "ALL",
+               minGSSize    = 100,
+               maxGSSize    = 500,
+               eps          = 0,
+               pvalueCutoff = 0.05,
+               verbose      = FALSE)
> 
> 
> readable_gene_names <- setReadable(ego3, 'org.Hs.eg.db', 'ENTREZID')
> ## show heatplot
> heatplot(readable_gene_names, foldChange=geneList, showCategory=5)

enter image description here

> ## limit number of genes to be plotted; below per category 5 random genes are selected.
> ## This is done in 3 steps: split content on forward slashes (using function str_split),
> ## for each list element select 5 random genes (using lapply with the function sample and the size argument = 5),
> ## and then collapse the 5 genes into a single string separated by a forward slash
> ## (using sapply with the function paste and the collapse argument = "/" )
> library(stringr)
> random.core.genes  <- sapply(
+                         lapply(
+                            str_split(as.data.frame(readable_gene_names)[ ,"core_enrichment"] , "/") ,
+                         sample, size=5),      ## 5=number of random genes to select
+                       paste, collapse="/")
>
>
> ## Perform 'hack' by replacing core_enrichment in the gseaResult object.
> readable_gene_names@result$core_enrichment <- random.core.genes
>
> ## show heatplot
> heatplot(readable_gene_names, foldChange=geneList, showCategory=5)

enter image description here

ADD REPLY
0
Entering edit mode

Hi Guido,

thank you so much for taking the time to look into this and your proposed solution (Sorry it took me a bit to reply)! I see the issue with actually selecting the relevant genes to display from each set. I am still pretty new with this stuff (trying to figure it out for my master thesis), so am I understanding correctly that within each core enrichment set the result object does not actually have any ordering of how much each of the genes in the set contributes to the set? I was thinking if there was such an order, then I could select the top 5 contributing genes to that gene set. Maybe taking aside how this is kind of a hacky solution, how do people usually deal with this issue? Is the heatmap plot only meant for much smaller list of input genes? I would assume a lot of people would run into a similar issue.

Thank you so much!

ADD REPLY
0
Entering edit mode

AFAIK the core-enriched genes are indeed ordered based on their ranking metric.

In other words, the first gene listed is the one belonging to the gene set tested that is most at the top (or bottom) of the ranked input list. So if you would select the first 5 genes, then these would correspond to the 5 most changed genes (based on the ranking metric).

For fine-tuning of the heatplot you should realize 2 things;

  1. the function heatplot has an argument called label_format (default = 30) that sets the space used for plotting the descriptions of the GO categories. Lowering this number generates more space for plotting the genes

  2. in essence a ggplot2-based graph is generated when executing the function heatplot. You can thus use the ggplot2 syntax to modify the heatplot. Below some code to get you inspired.

Lastly, I noticed you used the argument ont ="ALL" when calling gseGO. You could also consider limiting the GO categories to BP, MF or CC separately.

Code using the unmodified output from gseGO, and reducing the space for plotting the description of the GO categories to 15:

> p1 <- heatplot(readable_gene_names,
+   showCategory = 5,
+   foldChange = geneList,
+   label_format = 15
+ )
> p1  ## print the plot to the screen 
>

enter image description here

> ## rotate the plot
> p2 <- p1 + ggplot2::coord_flip()
> p2   ## print the plot to the screen
>

enter image description here

> ## reduce names of genes at y-axis
> p3 <- p2 + theme(axis.text.y = element_text(size = 5))
> p3   ## print the plot to the screen

enter image description here

> ## example of changing the color scale
> p4 <- p3 + viridis::scale_fill_viridis()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
> p4    ## print the plot to the screen
>

enter image description here

> ## combine last 2 plots in a single plot with A and B panel using function plot_grid from package cowplot.
> cowplot::plot_grid(p3, p4, ncol=1, labels=LETTERS[1:2])

enter image description here

ADD REPLY
0
Entering edit mode

Thank you for the explanation! I was having the same problem except I am using heatplot for KEGG analysis, the thing is while I am trying to use size >5 I get error like the following, could you please explain what might be the problem and how to fix it?

 random.core.genes <- sapply(
  lapply(str_split(as.data.frame(keggx)[ ,"geneID"] , "/"),
          sample, size=10 , replace=TRUE),
  paste, collapse="/")

#output
Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

thank you in advance!

Regards,

Ruba

ADD REPLY
0
Entering edit mode

it worked again.

ADD REPLY

Login before adding your answer.

Traffic: 1009 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6