I have used geneSim to do a pairwise comparison between between 5 genes (using Jiang and BMA). I have also done this using another package called Dnet (using their equivalent of Jiang and BMA). The results are:
Gene1 __ | Gene2__ | dnetjiang __ | gosemjiang |
---|---|---|---|
19290 __ | 18292 __ | 0.720816389695057 __ | 0.328 |
19290 __ | 320273 __ | 0.574659391641716 __ | NA |
19290 __ | 71754 __ | 0.304979369938494 __ | NA |
229512 __ | 18292 __ | 0.601915157555228 __ | 0.017 |
229512 __ | 19290 __ | 0.547589061848823 __ | 0.287 |
229512 __ | 320273 __ | 0.608984962607243 __ | NA |
229512 __ | 71754 __ | 0.449988765865462 __ | NA |
320273 __ | 18292 __ | 0.600484954352922 __ | NA |
320273 __ | 71754 __ | 0.589507141856847 __ | NA |
71754 __ | 18292 __ | 0.364414240306719 __ | NA |
(apologies for the formatting)
GoSemSim returns a lot of NAs compared to Dnet. What does NA mean and why does Dnet create a score when GoSemSim returns NA?
I know that Dnet propagates annotations from children to all parents and ensures that a general term has a lower IC than a specific term which has a higher IC. Does GoSemSim differ in this which leads to NAs?
Edit:
The code I have used to generate the above table results:
library("dnet")
library("org.Mm.eg.db")
library("reshape")
library("GOSemSim")
library("GenomicFeatures")
library(AnnotationHub)
library("org.Mm.eg.db")
##Dnet
ig.GOBP <-dRDataLoader(RData='ig.GOBP')
g <- ig.GOBP
org.Mm.egGOBP <- dRDataLoader(RData='org.Mm.egGOBP')
dag <- dDAGannotate(g, annotations=org.Mm.egGOBP,
path.mode="all_paths", verbose=TRUE)
allgenes <- unique(unlist(V(dag)$annotations))
genes <- sample(allgenes,5)
dnetsimjiang <- dDAGgeneSim(g=dag, genes=genes, method.gene="BM.average",
method.term="Jiang", parallel=FALSE, verbose=TRUE,force = FALSE)
dnetsimjiang=as.matrix(dnetsimjiang)
dnetsimjiang[upper.tri(dnetsimjiang,diag = T)] <- "top"
dnetsimjiang=reshape2::melt(dnetsimjiang)
dnetsimjiang=dnetsimjiang[dnetsimjiang$value != "top",]
colnames(dnetsimjiang)[colnames(dnetsimjiang)=="Var1"] <- "X2"
colnames(dnetsimjiang)[colnames(dnetsimjiang)=="Var2"] <- "X1"
colnames(dnetsimjiang)[colnames(dnetsimjiang)=="value"] <- "dnetjiang"
#GoSemSim
hub <- AnnotationHub()
q <- query(hub, "mus musculus")
id <- q$ah_id[length(q)]
mouse <- hub[[id]]
mmGO <- godata('org.Mm.eg.db', ont="BP", keytype = "ENTREZID")
calculategosim=function(y) {
out=geneSim(y[1], y[2],
semData=mmGO, measure="Jiang",combine="BMA")
if(length(out)==1){
c(y[1],y[2],out)
}else {
c(y[1],y[2],out$geneSim)
}
}
test=data.frame(t(combn(genes, 2, function(y) (calculategosim(y)))))
colnames(test)[colnames(test)=="X3"] <- "gosemjiang"
bothpackagesjiang=merge(dnetsimjiang, test, by=c("X1","X2"))
Unless there is some difference (bug or not) on the implementation of the algorithm the most probable difference is the source of data. Without knowing which functions and data you used I cannot see were come the differences. However, it can be that GOSemSim filtered the annotations with annotation IEA and Dnet not, that would explain the large amount of values from Dnet.
Thank you for your reply. I have edited my post to include the code I used to generate the example table. When GoSemSim returns NA does this mean that one or both genes have no GO terms? Does GoSemSim exclude IEA?
The data seems different the dag contains 4212625 vertex, while the mmGO data just 163654, probably because the dag annotates all the genes to all the parent GO terms. But mainly the differences come from having different source of data