I used the Rchemcpp package to find structural analogs for some compounds with the DrugBank database (using the latest structure-data file, SDF, available from the DrugBank website).
I used the sd2gramSpectrum function with the default settings.
I wonder why I am not able to find exact matches with a similarity score of 1 for some of the input compounds as reported by the web-service performing a search against DrugBank: http://shiny.bioinf.jku.at/Analoging/
Rchemcpp returns scores below 1.
It also seems that there is a memory leak when the sd2gramSpectrum function is used within a loop?
Thanks for posting this to the community. I will quickly summarize our email-conversation:
a) Why do identical compounds have a score below 1?
The problem lies in the different preprocessing of the compounds: in the SDF filesfrom Drugbank there still were some "H" atoms explicitly coded. I used openbabel with the flags "-d -b" which removes all hydrogens and makes
sure that dative bonds are consistently written, e.g: obabel -isdf DB04149.sdf -osdf -O DB04149_obabel.sdf -d -b
Another problem are bonds in aromatic rings: The double bonds can be coded at "arbitrary" positions in the aromatic ring, therefore we code aromatic bonds as bond type "4". The R package Rchemcpp does this automaticall if you use "detectArom=TRUE".
I have tested your SDF files and now the similarity between your test molecule and the molecule in the database is "1" as expected. The web-service "Rchemcpp" does this preprocessing automatically.
b) Potential memory leaks?
Thanks for bringing this up - we will thoroughly check our package. Meanwhile you should consider using one SDF file with all query molecules as first argument and another SDF file with all molecules from the database as second argument for the function "sd2gramSpectrum". This is very efficient w.r.t. computation time and memory.