I try to run an fgsea analyses. So I create my own database looks like this :
df_db <- read.csv('pathway_ecnumber.csv',sep=",")
df_db
path:map00010 3.2.1.86
path:map00500 3.2.1.86
path:map00010 4.1.1.1
path:map01100 4.1.1.1
path:map01110 4.1.1.1
path:map01130 4.1.1.1
path:map00010 4.1.1.32
path:map00020 4.1.1.32
path:map00620 4.1.1.32
path:map01100 4.1.1.32
path:map01110 4.1.1.32
df_db$enzyme<-gsub("ec:","",df_db$enzyme)
db_final<-df_db %>% dlply( "pathway", `[[`, "enzyme" ) %>% c
database_pathway <- db_final[!duplicated(names(db_final))]
database_pathway
$`path:map05410`
[1] "2.7.11.11" "3.4.15.1" "2.7.11.1"
$`path:map05414`
[1] "4.6.1.1" "2.7.11.11" "2.7.11.1"
$`path:map05416`
[1] "3.4.22.56" "3.4.22.61" "3.4.22.62" "2.7.10.2"
And I create my rank like this :
df_select <- df_data %>% dplyr::select(ECNUMBER, log2FoldChange)
df_na <- df_select %>% drop_na()
df_split <- df_na %>% mutate(ECNUMBER = strsplit(as.character(ECNUMBER), ",")) %>% unnest(ECNUMBER)
df_split <- as.data.frame(df_split)
df_unique <- unique(df_split)
df_na <- na.omit(df_unique)
df1 <- filter(df_na, log2FoldChange != 0)
geneList <- df1[,2]
names(geneList) <- as.character(df1[,1])
geneList2 = sort(geneList, decreasing = T)
geneList2
3.4.21.102 2.7.1.221 2.7.7.13 1.1.1.3 1.1.1.42 1.14.19.9
3.217 3.217 3.217 3.217 3.217 3.217
At the end I got two warnings, and I don't know how to deal with it :
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (81.54% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are duplicate gene names, fgsea may produce unexpected results.
The second warning I try to resolve with this line but it is not working :
df_database <- database_pathway[!duplicated(names(database_pathway))]
Thank you so much for your clear explanation !!
I used an orthologs database (eggNOG) so this explain my results ; one logFC corresponds to multiple ECnumber ;
So with your explanation; I will use only the first ECnumber of the list from the annotation's results to get only one ECnumber corresponds to an unique logFC.
For the construction of the database, is-it possible to get multiple ECnumber corresponding to different pathway ?.
I'm not sure what would be the best course of action in your case, if there is one at all, but I'd suggest to keep the ranking on the gene-level, and construct pathways consisting of genes. Genes can be associated with multilple pathways, there is no problem in that.