Battery gene sets for CAMERA limma
Entering edit mode
brov.olia • 0
Last seen 9 weeks ago

Hi everyone,

I'm confused with the results of my CAMERA analysis. For building indexes, I used the battery of gene sets from MSigDb. I transformed the gmt files to list and built indexes. The initial count matrixes contained hgnc symbols as row names, which include protein-coding genes, as well as lncRNA, miRNA and etc. names. MSigDb allows users to download two types of sets: entrez ids and hgnc symbols. When I transform symbols to entrez and build indexes the result completely differs from the case when I use symbols for building indexes

Example with symbols to build indexes


                                                  NGenes Direction       PValue          FDR
RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY     98        Up 7.358697e-12 2.267950e-08
KEGG_RIBOSOME                                         85        Up 2.507014e-11 3.863308e-08
EUKARYOTIC_TRANSLATION_ELONGATION                     90        Up 2.086557e-10 1.862321e-07
SELENOAMINO_ACID_METABOLISM                          105        Up 2.417029e-10 1.862321e-07
CELLULAR_RESPONSE_TO_STARVATION                      149        Up 1.376982e-09 8.487719e-07

Example with entrez ids for indexing


                                                      NGenes Direction      PValue       FDR
SIGNALLING_TO_RAS                                         20      Down 0.000617136 0.9982597
PLASMA_LIPOPROTEIN_REMODELING                             19      Down 0.001480280 0.9982597
ACTIVATION_OF_TRKA_RECEPTORS                               2      Down 0.002682050 0.9982597
NABA_ECM_AFFILIATED                                       85      Down 0.003973647 0.9982597

In the first case(with symbols) I had a larger list of pathways (3082), in the second it was 3041. What result is more relevant? Do non-protein-coding RNAs play such a crucial role in pathway significance?

CAMERA • 186 views
Entering edit mode
Last seen 2 hours ago
United States

Without providing code, it's not possible to say exactly why there are differences. That said, if you want the most accurate results you should use NCBI IDs (what used to be called Entrez Gene IDs) rather than symbols, as the gene IDs are way more likely to uniquely identify a given gene.

Entering edit mode

Dear James,

thank you for the reply. Please, find my code below

#download geneset with symbol and entrez
hs.c2.cp.l <- gmt_to_list("Msig_entrez/c2.cp.v2022.1.Hs.entrez.gmt", cutoff = 0,
                               sep = "\t*?\t")
hs.c2.cp.symb.l <- gmt_to_list("Msig_symbols/c2.cp.v2023.1.Hs.symbols.gmt", cutoff = 0,
                      sep = "\t*?\t")
#transform symbols  to entrezID and create indexes
my_entrez<-mget(voom_out$genes$symbol, org.Hs.egSYMBOL2EG,ifnotfound=NA)    
entrez_ind <-  ids2indices(hs.c2.cp.l, my_entrez)
symbol_ind <- ids2indices(hs.c2.cp.symb.l, voom_out$genes$symbol)
camera_res1<- camera(voom_out$E, index = symbol_ind,
         weights = voom_out$weights,
         design = mydesign, contrast =mycontrast)

camera_res2<- camera(voom_out$E, index = entrez_ind,
             weights = voom_out$weights,
             design = mydesign, contrast =mycontrast)

Thank you for the suggestion about Entrez ID. Although, with such approach, I do not see any pathway with FDR < 0.1

Entering edit mode

If you do this

my_entrez<-mget(voom_out$genes$symbol, org.Hs.egSYMBOL2EG,ifnotfound=NA)

The result will be a list of IDs. And ids2indices won't work as you expect. The second argument for that function is


gene.sets: list of character vectors, each vector containing the gene
          identifiers for a set of genes.

identifiers: character vector of gene identifiers.

And a list is not a character vector. If you don't have NCBI IDs in your 'genes' data.frame, then you should use symbols.


Login before adding your answer.

Traffic: 315 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6