Question

Compute fgsea with my own database with R version 4.0.2

0

Entering edit mode

julie.hardy • 0

@a287dbc4

Last seen 4.6 years ago

France

I try to run an fgsea analyses. So I create my own database looks like this :

df_db <- read.csv('pathway_ecnumber.csv',sep=",")

df_db
path:map00010  3.2.1.86
path:map00500  3.2.1.86
path:map00010   4.1.1.1
path:map01100   4.1.1.1
path:map01110   4.1.1.1
path:map01130   4.1.1.1
path:map00010  4.1.1.32
path:map00020  4.1.1.32
path:map00620  4.1.1.32
path:map01100  4.1.1.32
path:map01110  4.1.1.32

df_db$enzyme<-gsub("ec:","",df_db$enzyme)
db_final<-df_db %>% dlply( "pathway", `[[`, "enzyme" ) %>% c
database_pathway <- db_final[!duplicated(names(db_final))]

database_pathway

$`path:map05410`
[1] "2.7.11.11" "3.4.15.1"  "2.7.11.1" 

$`path:map05414`
[1] "4.6.1.1"   "2.7.11.11" "2.7.11.1" 

$`path:map05416`
[1] "3.4.22.56" "3.4.22.61" "3.4.22.62" "2.7.10.2"

And I create my rank like this :

   df_select <- df_data %>% dplyr::select(ECNUMBER, log2FoldChange)
    df_na <- df_select %>% drop_na()
    df_split <- df_na %>% mutate(ECNUMBER = strsplit(as.character(ECNUMBER), ",")) %>% unnest(ECNUMBER)
    df_split <- as.data.frame(df_split)
    df_unique <- unique(df_split)  
    df_na <- na.omit(df_unique)
    df1 <- filter(df_na, log2FoldChange != 0)
    geneList <- df1[,2]
    names(geneList) <- as.character(df1[,1])
    geneList2 = sort(geneList, decreasing = T)

geneList2
3.4.21.102   2.7.1.221    2.7.7.13     1.1.1.3    1.1.1.42   1.14.19.9 
      3.217       3.217       3.217       3.217       3.217       3.217

At the end I got two warnings, and I don't know how to deal with it :

Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (81.54% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are duplicate gene names, fgsea may produce unexpected results.

The second warning I try to resolve with this line but it is not working :

df_database <- database_pathway[!duplicated(names(database_pathway))]

R fgsea • 2.7k views

ADD COMMENT • link updated 4.6 years ago by alserg ▴ 280 • written 4.7 years ago by julie.hardy • 0

score 1 · Answer 1 · 2021-05-06

1

Entering edit mode

alserg ▴ 280

@assaron

Last seen 24 days ago

St Louis, MO

I believe you should think over your design, as these warnings illustrate the problems in it.

It is very suspicious that you have multiple EC entries having exact same log2FC. This is either an error, or a flaw in the design, as the same gene can have multiple enzyme functions and single gene logFC goes to multiple EC numbers. However, GSEA assumes independence of the gene ranks, as it tests whether gene set looks randomly selected or not.

For the second warning, similarly, I expect you have a single enzyme can be represented by multiple genes, so you have multiple entries with the same EC number, which triggers the second warning.

ADD COMMENT • link 4.6 years ago alserg ▴ 280

0

Entering edit mode

Thank you so much for your clear explanation !!

I used an orthologs database (eggNOG) so this explain my results ; one logFC corresponds to multiple ECnumber ;

So with your explanation; I will use only the first ECnumber of the list from the annotation's results to get only one ECnumber corresponds to an unique logFC.

For the construction of the database, is-it possible to get multiple ECnumber corresponding to different pathway ?.

ADD REPLY • link 4.6 years ago julie.hardy • 0

1

Entering edit mode

I'm not sure what would be the best course of action in your case, if there is one at all, but I'd suggest to keep the ranking on the gene-level, and construct pathways consisting of genes. Genes can be associated with multilple pathways, there is no problem in that.

ADD REPLY • link 4.6 years ago alserg ▴ 280