Hi there guys.
I have gene expression data collected from nCounter with a panel of around 700 genes (Pancancer immune profiling panel). I performed differential expression analysis and now I want to see which immune-related pathways are over or under expressed in my samples. In my nCounter samples I do not have normal tissue so I can´t perform a differential expression analysis between tumor and non-tumor samples (instead I can make between tumor subgroups, but that is not of interest for the discussion).
So, in order to overcome this I downloaded a public dataset that consists of several tumor and non-tumor datasets that were joined through batch-correction. The data is on Log2 RUV-normalized format.
So in order to perform GSEA analysis between a subgroup of the tumor samples and the non-tumor samples I tried to use both the fgsea and clusterProfiler packages in R, but both packages either provide me with p-significant values but no significant adjusted p-values or with only one pathway being significant. I am using the C7 ImmuneSigDB genesets for both packages. Below I send you my data and script for the gsea analysis with both packages.
I am wondering if the fact that I am using a small and already "enriched" (for immune pathways) panel can have an impact on my results. Also if the data format can have something to do with it.
Should I define the background genes of the fgsea() function to the genes present in my panel? If so how can I do that?
Hope you can help me, Thank you in advance
> head(data_limma, c(6L, 6L))
Patient0 Patient1 Patient2 Patient3 Patient4 Patient5
A2M 0.2895 1.2463 -0.1309 -0.8332 -0.5526 0.0017
ABCB1 0.5023 1.2426 -0.5671 -0.5815 -0.1716 1.1230
ABCF1 0.0848 0.5559 0.1182 0.0412 -0.0118 0.1299
ACKR2 -0.0767 0.2744 0.3907 -0.3495 0.0176 0.2393
ADA -0.0226 -0.0375 0.7206 -0.0117 -0.3979 -0.1945
ADGRE2 0.1158 0.1684 -0.2642 0.1162 -0.3081 0.0663
fgsea script
genes <- rownames(de_genes_SHH_NORMAL)
de_genes_SHH_NORMAL$symbol <- genes
library(tidyverse)
res2 <- de_genes_SHH_NORMAL %>%
dplyr::select(symbol, logFC) %>%
na.omit() %>%
distinct() %>%
group_by(symbol) %>%
arrange(desc(logFC))
library(msigdbr)
hs_immune_df <- msigdbr(species = "Homo sapiens",
category = "C7",
subcategory = "IMMUNESIGDB")
ranks <- deframe(res2)
head(ranks, 20)
fgsea_SHH <- fgsea(pathways = hs_immune_df, stats=ranks, nperm = 1000)
fgsea_SHH2 <- fgseaMultilevel(pathways = hs_immune_df, stats = ranks, maxSize = 500)
fgseaResTidy <- fgsea_SHH %>%
as_tibble() %>%
arrange(desc(NES))
# Show in a nice table:
fgseaResTidy %>%
dplyr::select(-leadingEdge, -ES) %>%
arrange(padj) %>%
DT::datatable()
ClusterProfiler script
library(clusterProfiler)
hs_immune_df <- msigdbr(species = "Homo sapiens",
category = "C7",
subcategory = "IMMUNESIGDB")
Pre-Ranked gene list
The GSEA() function takes a pre-ranked (sorted) named vector of statistics, where the names in the vector are gene identifiers. This is step 1 â gene-level statistics.
lfc_vector <- de_genes_SHH_NORMAL %>%
# Extract a vector of `log2FoldChange` named by `gene_symbol`
dplyr::pull(logFC, name = symbol)
lfc_vector <- sort(lfc_vector, decreasing = TRUE)
# Look at first entries of the log2 fold change vector
head(lfc_vector)
# Look at the last entries of the log2 fold change vector
tail(lfc_vector)
Run GSEA
Now for the analysis!
gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list
minGSSize = 25, # minimum gene set size
maxGSSize = 500, # maximum gene set set
pvalueCutoff = 0.05,
pAdjustMethod = "fdr", # correction for multiple hypothesis testing
TERM2GENE = dplyr::select(hs_immune_df,
gs_name,
gene_symbol))
#see the results
View(gsea_results@result %>%
dplyr::arrange(dplyr::desc(NES))
)