Hi everyone,
I'm running a GSEA procedure on R (in particular I am using the GSVA
package).
I have downloaded the lists of genes composing the gene sets from the MSigDB using the package msigdbr
.
Anyway, if I extract the gene symbols for a gene set from the downloaded object, they are not unique. It happens for example that genes with the same symbol and EntrezID have different EnsemblID, hence they are listed as different.
How should I deal with these when using the gsva
function? I have gene symbols as rows in my expression matrix, thus I can't match the different EnsemblIDs. If I keep duplicates this would be a sort of increased weight for the considered genes, leading to a slightly skew distribution for the enrichment scores. Is it safe to delete duplicates or am I losing relevant information?