I started using mgeneSim function, and i passed an array of 11000+ EntrezID genes. It gave me a similarity matrix with some columns and rows all containing value "1". After mapping the EntrezID genes to GO, I noticed that some set of Go ID of two different genes have a GO ID in common. This should be the reason why I have these columns and rows set at 1.
How can I have consistent results from mgeneSim? Is it possible not to consider the GO ID two genes share?
A little example, with some of the gene pairs that make exception:
mgeneSim(c("3613", "83541", "5651", "23492", "157310"), semData=hsGO, measure="Wang", combine = "BMA", verbose = TRUE)
The output it returns is the following:
You could change how to combine results from BMA to other methods, but the goal of GOSemSim is to provide a measure between two genes based on the Gene Ontology. That means that the GO ID of each genes should be used and if shared they should be dealt with it somehow. There are other measures for semantic similarity of genes based on GO, I am not sure if you picked Wang for a reason or just because it is the default.
Or you could use some other distance metric (not semantic similarity). I created a package for pathways and gene sets that uses the Dice similarity. If you provided the GO terms it could be used to calculate the dice similarity with the GO terms).