Compute fgsea with my own database with R version 4.0.2
1
0
Entering edit mode
@a287dbc4
Last seen 18 days ago
France

I try to run an fgsea analyses. So I create my own database looks like this :

df_db <- read.csv('pathway_ecnumber.csv',sep=",")

df_db
path:map00010  3.2.1.86
path:map00500  3.2.1.86
path:map00010   4.1.1.1
path:map01100   4.1.1.1
path:map01110   4.1.1.1
path:map01130   4.1.1.1
path:map00010  4.1.1.32
path:map00020  4.1.1.32
path:map00620  4.1.1.32
path:map01100  4.1.1.32
path:map01110  4.1.1.32

df_db$enzyme<-gsub("ec:","",df_db$enzyme)
db_final<-df_db %>% dlply( "pathway", [[, "enzyme" ) %>% c
database_pathway <- db_final[!duplicated(names(db_final))]

database_pathway

$path:map05410 [1] "2.7.11.11" "3.4.15.1" "2.7.11.1"$path:map05414
[1] "4.6.1.1"   "2.7.11.11" "2.7.11.1"

\$path:map05416
[1] "3.4.22.56" "3.4.22.61" "3.4.22.62" "2.7.10.2"


And I create my rank like this :

   df_select <- df_data %>% dplyr::select(ECNUMBER, log2FoldChange)
df_na <- df_select %>% drop_na()
df_split <- df_na %>% mutate(ECNUMBER = strsplit(as.character(ECNUMBER), ",")) %>% unnest(ECNUMBER)
df_split <- as.data.frame(df_split)
df_unique <- unique(df_split)
df_na <- na.omit(df_unique)
df1 <- filter(df_na, log2FoldChange != 0)
geneList <- df1[,2]
names(geneList) <- as.character(df1[,1])
geneList2 = sort(geneList, decreasing = T)

geneList2
3.4.21.102   2.7.1.221    2.7.7.13     1.1.1.3    1.1.1.42   1.14.19.9
3.217       3.217       3.217       3.217       3.217       3.217


At the end I got two warnings, and I don't know how to deal with it :

Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
There are ties in the preranked stats (81.54% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
There are duplicate gene names, fgsea may produce unexpected results.


The second warning I try to resolve with this line but it is not working :

df_database <- database_pathway[!duplicated(names(database_pathway))]

R fgsea • 133 views
1
Entering edit mode
assaron ▴ 200
@assaron
Last seen 5 weeks ago
St Petersburg

I believe you should think over your design, as these warnings illustrate the problems in it.

It is very suspicious that you have multiple EC entries having exact same log2FC. This is either an error, or a flaw in the design, as the same gene can have multiple enzyme functions and single gene logFC goes to multiple EC numbers. However, GSEA assumes independence of the gene ranks, as it tests whether gene set looks randomly selected or not.

For the second warning, similarly, I expect you have a single enzyme can be represented by multiple genes, so you have multiple entries with the same EC number, which triggers the second warning.

0
Entering edit mode

Thank you so much for your clear explanation !!

I used an orthologs database (eggNOG) so this explain my results ; one logFC corresponds to multiple ECnumber ;

So with your explanation; I will use only the first ECnumber of the list from the annotation's results to get only one ECnumber corresponds to an unique logFC.

For the construction of the database, is-it possible to get multiple ECnumber corresponding to different pathway ?.

1
Entering edit mode

I'm not sure what would be the best course of action in your case, if there is one at all, but I'd suggest to keep the ranking on the gene-level, and construct pathways consisting of genes. Genes can be associated with multilple pathways, there is no problem in that.