Dealing with duplicated genes in camera gene set testing
Entering edit mode
Last seen 21 months ago
Washington University in St. Louis


After conducting differential expression analysis on some RNA-Seq data using limma, I’m trying to apply the camera method for gene set testing. I want to use gene sets from MSigDB. Unfortunately, as far as I’m aware, those gene sets are only available as HGNC gene symbols (e.g., IDS) or Entrez identifiers. My RNA-Seq reads were mapped to ensembl gene identifiers (e.g., ENSG00000010404). I tried converting my ensembl gene identifiers to HGNC gene symbols using biomaRt, but it is not a 1:1 mapping, so I ended up with duplicate HGNC gene symbols (there were no duplicated ensembl identifiers to begin with). Additionally, some ensembl identifiers could not be mapped to HGNC symbols. I ended up leaving the duplicated HGNC symbols in my data, and I also included the ensembl identifiers of the genes that I could not map to HGNC symbols. In summary, all of my original genes (and their expression values) are still represented in my data, but there are duplicate identifiers now (HGNC symbols), and some genes are still identified by an ensembl identifier instead of a HGNC symbol (which I am aware cannot match to any of my gene sets). I downloaded some MSigDB gene sets (HGNC version) and ran camera. However, I am not sure if I can rely on the results. My primary questions are these:

1) Is it possible to get the MSigDB gene sets in ensembl format so that I can skip any conversions?

2) If not, do I need to alter my above approach (does including duplicated genes or genes that can’t possibly be in the gene sets interfere with camera?)? If so, how should I alter it?

Thank you!

limma camera gene set testing ensembl • 460 views
Entering edit mode

This seems inadvisable to me (and you as well, apparently), but, the broad says

Now using ENSEMBL as the platform annotation authority

Beginning in MSigDB 7.0, identifiers for genes are mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from ENSEMBL's BioMart data service, and will be updated at each MSigDB release with the latest available version of ENSEMBL. This change mitigates a previous issue where retired gene symbols and symbol aliases that did not reflect the current annotation of the human genome were retained in MSigDB as a result of outdated microarray and transcriptome annotations. This issue resulted in symbols being excluded from some gene sets and GSEA analyses due to the potential presence of multiple symbols for the same gene in different gene sets as a result of differing source annotations for those gene sets, and mismatches between the symbols present in the user supplied dataset and those included in MSigDB.

Gene annotations supplied in the MSigDB 7.0 release are derived from ENSEMBL version 97 corresponding to GENCODE release 31 and reflect the HGNC Gene Symbols as of the GENCODE 31 freeze date of February 2019.

It's not clear from this wiki what they are really doing, and getting the NCBI Gene ID from a non-NCBI source seems suboptimal, but what do I know?

But you could probably get the GENCODE release 31 data and get the mappings from there?

Entering edit mode
Last seen 1 minute ago
WEHI, Melbourne, Australia

Is it possible to get the MSigDB gene sets in ensembl format so that I can skip any conversions?

MSigDB doesn't provide an Ensembl version, so no.

Even if they did, they would have to go through the same process that you have done with all the same problems.

If not, do I need to alter my above approach (does including duplicated genes or genes that can’t possibly be in the gene sets interfere with camera?)?

Duplicate gene symbols don't present any problems for camera. If a duplicated gene symbol belongs to an MSigDB set, then camera will simply include all the rows of your data with that symbol in that set. Given that Ensembl provides more gene IDs than there are accepted gene symbols, this is the best that can be done.

On the other hand, you should filter any rows from your data that can't be mapped to gene symbols before running camera. Having done this, you're ready to go.

If you're using voom you code might look like this:

v <- voom(y, design)
has.symbol <- !$genes$Symbol)
v.has.symbol <- v[has.symbol, ]
idx <- ids2indices(gene.sets=msidb, identifiers=v.has.symbol$genes$Symbol)
cam.res <- camera(v.has.symbol, idx, design)

In the above code, y is a DGEList object and msigdb is a list of gene sets using symbols.

As an aside, I personally avoid this problem for my own analyses by summarizing my RNA-seq data by Entrez Gene Ids in the first place.

Entering edit mode

Thank you for the informative discussion here. A quick follow-up question. If I am normalizing my data (e.g., TMM normalization), should I remove the rows that can't be mapped to gene symbols before or after this normalization?

My data after normalization is in a DGEList object (suppose it is called dge). If you recommend removing those rows after normalization, after modifying dge$counts, will I also need to modify dge$samples (and if so, what will I need to change?)?

Entering edit mode

Removing rows is only for the camera analysis. There is no need to redo the TMM normalization or dispersion estimation.

DGEList object handle subsetting automatically. You never need to subset components individually, just subset the whole object:

has.symbol <- !$genes$Symbol)
dge.filtered <- dge[has.symbol,]
Entering edit mode

Perfect. Thank you!!

Entering edit mode

Just to make sure I implement everything correctly:

1) Should I remove the rows before or after calling the voom function (location 1 or location 2 below)?

2) Does the subsetting below appear to be correct in that location?

3) To use HGNC IDs instead of Ensembl IDs, do I simply need to rename rownames(v) as shown below?

I just want to make sure that I correctly implement your suggestions for running camera and document it in case future users have the same question. Thank you so much!

y <- calcNormFactors(y); # perform TMM normalization (after filtering genes with low expression)

# Location 1
# y <- y[keep, , keep.lib.sizes=FALSE]

v <- voom(y, design, plot=F)

# Location 2
# v <- v[keep, ]

# Replace Ensembl IDs with HGNC IDs, ensuring that the correct HGNC ID goes  
# with the correct Ensembl ID after the additional filtering step (not shown)
rownames(v) <- hgnc.identifiers 

cam.res <- camera(v,idx,design,contrast=ncol(design)) # use last column of design matrix to specify comparison
Entering edit mode
  1. After. As I said in my last post, the subsetting is only for camera.

  2. I can't comment on your code because I don't know what your vector keep is. Anyway, the subsetting is only for camera.

  3. No, you do not need to change the row names -- use ids2indices instead.

I haved edited my answer above to show complete code that you might use.

Entering edit mode
Robert Castelo ★ 2.7k
Last seen 11 weeks ago
Barcelona/Universitat Pompeu Fabra


an alternative is to use the GSEABase sorcery. Let's say you downloaded the GMT file for the MSigDB C2 gene set collection using HGNC gene symbols from:

The Gene Matrix Transposed (GMT) file format can be imported into R using the getGmt() function from the GSEABase package, as follows:


c2sym <- getGmt("c2.all.v7.0.symbols.gmt", geneIdType=SymbolIdentifier())
  unique identifiers: ACSS2, GCK, ..., RNU1-1 (20690 total)
  types in collection:
    geneIdType: SymbolIdentifier (1 total)
    collectionType: NullCollection (1 total)

Note that i'm calling getGmt() with the argument geneIdType=SymbolIdentifier() because i know i downloaded HGNC gene symbols and this metadata is included in the returned object called c2sym of class GeneSetCollection. Now we can map the identifiers in this GeneSetCollection object from HGNC gene symbols to Ensembl identifiers using the GSEABase function mapIdentifiers(), as follows:

c2ensembl <- mapIdentifiers(c2sym, ENSEMBLIdentifier(""))
  unique identifiers: ENSG00000131069, ENSG00000106633, ..., ENSG00000206652 (21956 total)
  types in collection:
    geneIdType: ENSEMBLIdentifier (1 total)
    collectionType: NullCollection (1 total)

where i'm using the function ENSEMBLIdentifier("") as second argument to tell the mapIdentifiers() function that I want the map to Ensembl identifiers and that the mapping information can be found in the organism-level gene-centric package

finally, some Bioconductor software packages, such as GSVA, may take directly as input such a GeneSetCollection object, but for other packages as limma, where the camera() algorithm is implemented, you need those gene sets as a list object. You can extract a list object from a GeneSetCollection object using the geneIds() getter function:

c2ensembllist <- geneIds(c2ensembl)
[1] 5501
head(lapply(c2ensembllist, head))
[1] "ENSG00000131069" "ENSG00000106633" "ENSG00000170950" "ENSG00000102144"
[5] "ENSG00000168291" "ENSG00000131828"

[1] "ENSG00000101365" "ENSG00000119689" "ENSG00000100889" "ENSG00000285241"
[5] "ENSG00000062485" "ENSG00000168291"

[1] "ENSG00000197713" "ENSG00000153574" "ENSG00000169299" "ENSG00000101911"
[5] "ENSG00000130957" "ENSG00000152556"

[1] "ENSG00000242515" "ENSG00000242366" "ENSG00000197713" "ENSG00000244122"
[5] "ENSG00000167165" "ENSG00000135226"

[1] "ENSG00000178802" "ENSG00000140650" "ENSG00000100417" "ENSG00000130957"
[5] "ENSG00000152556" "ENSG00000112699"

[1] "ENSG00000106633" "ENSG00000108479" "ENSG00000170266" "ENSG00000117308"
[5] "ENSG00000086062" "ENSG00000169299"

note that this infrastructure already takes care of duplicated entries

any(sapply(lapply(c2ensembllist, duplicated), any))

and because you're converting from gene sets to gene sets, it's not so much of a concern whether your mapping is not 1-to-1, while you don't need to do anything with your molecular data.



Entering edit mode

For future readers of this post, I'd like to make the argument that on the basis of this answer the original question

1) Is it possible to get the MSigDB gene sets in ensembl format so that I can skip any conversions?

can be answered positively, by interpreting "to get the MSSigDB gene sets in ensembl format" as converting by yourself the gene identifiers in the MSigDB gene sets to Ensembl identifiers with the procedure specified above.

Entering edit mode

I would argue that OP doesn't actually need to do any conversions at all. It seems to me that anyone doing an RNA-seq analysis will want to get gene symbols because viewing a results table of DE genes with just Ensembl Ids and no gene names is not very informative. Having got a gene symbol for each row, the OP can use MSigDb directly without any identifier conversions.

There could well be a long-term convenience benefit for other users in making an Ensembl Id version of MSigDb, but it can't be better than the above procedure in terms of results. The way that camera treats duplicate symbols is already equivalent to expanding out the gene set in the way that you have suggested.

The way that OP has done it is actually safer than using an Ensembl version of MSigDb because it provides identification of which rows (Ensembl Ids) don't have a symbol and hence could never be in an MSigDb set. Using an Ensembl Id version of MSigDb and working purely with Ensembl Ids would make that invisible.

Have you checked that converting to the Ensembl Ids using is as comprehensive as using biomaRt? I haven't tested it myself, but I worry that, being Entrez Id based, might not include all the Ensembl to symbol mappings that biomaRt does. Have you tried making Ensembl gene sets both ways and confirmed that they give the same results?

Entering edit mode

Maybe others in this forum know better about these issues but my impression on the gene id business is that annotation sources, such as Entrez (Maglott et al., 2011) or Ensembl (Yates et al., 2020), take care about assigning the corresponding gene symbol in collaboration with the HGNC and these assignments are fully resolved only in within each specific release of their data. Being able to use a specific release of annotations allows one to resolve gene symbols and report them without having to deal with ambiguous id mappings.

Because RNA-seq pipelines end up summarizing expression values to a specific annotation, I'd say best practice is to adhere to that annotation all the way throughout the pipeline. This guarantees that you do not need to change your best guess on what are the expression profiles of your RNA-seq data based on that annotation.

Personally, if I have to choose between mapping gene ids from expression profiles or mapping gene ids from gene sets, I prefer to choose the latter because i avoid altering my expression profiles in the way i produced them.

Going to your specific question, i have not tried to compare those Ensembl mappings but i'd expect that is not as comprehensive as biomaRt because it is based on Entrez annotations that are known to be more conservative than Ensembl ones. This means to me that using Entrez you end up with fewer genes than using Ensembl, and therefore, fewer mappings. What's best, i guess it depends on the specific application and question you're working on.

For gene-level analyses that end up with interpretations based on protein function, i prefer to use Entrez annotations. For transcript-level analyses that aim to investigate RNA processing questions, I prefer to use Gencode/Ensembl, which is much more comprehensive at transcript level. In any case, I think any option should work smoothly as long as molecular and functional data can be mapped and retrieved within the same specific release of the data. Bioconductor annotation packages facilitate that goal.


Login before adding your answer.

Traffic: 311 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6