Dear list,
I'm trying to run protein-protein-interaction network analysis (PPI) out of R using the package STRINGdb. The package STRINGdb allows me to run some 'minimal' analysis, however I have some more specific needs : When I extract the interactions between given proteins, only the combined score is returned. In the context of some specific projects, however, I don't want not use the max of all types of (sub-)scores (like 'Textmining', 'Experiments', 'Co‑expression', etc), but to ignore some of them (like 'Textmining').
Does anyone have a hint how to access these specific scores (as it is possible when running an analysis site https://string-db.org/ by manually clicking in 'Settings' in the fiels 'active interaction sources:' ) ?
Many thanks in advance, Wolfgang Raffelsberger
Here a tiny exmaple / minimal code :
library(BiocManager)
library("STRINGdb")
string_db <- STRINGdb::STRINGdb$new(version="11.5", species=9606, score_threshold=200, input_directory="") # most recent
data1 <- data.frame(Gene.name=c("tp53","atm","egfr"), STRING_id=c("9606.ENSP00000269305","9606.ENSP00000278616","9606.ENSP00000275493"))
netwInt1 <- string_db$get_interactions(data1$STRING_id)
head(netwInt1) # combined score only
STRINGdb$help("get_interactions") # thus, the function operates with single argument only, no way to specify which scores I'd like to use
## My sessionInfo gives:
sessionInfo( )
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] STRINGdb_2.4.1 BiocManager_1.30.16 EnsDb.Hsapiens.v79_2.99.0 ensembldb_2.16.4 AnnotationFilter_1.16.0 GenomicFeatures_1.44.1
[7] GenomicRanges_1.44.0 GenomeInfoDb_1.28.1 org.Hs.eg.db_3.13.0 AnnotationDbi_1.54.1 IRanges_2.26.0 S4Vectors_0.30.0
[13] Biobase_2.52.0 BiocGenerics_0.38.0
At the moment it's not possible.
STRINGdb R package doesn't know about specific channels scores. The file with sub-scores tended to time-out during bioconductor testing, due to its size.
For now the only way to get these scores is to use flat-files. Sorry for the inconvenience.
Also the combined score is not calculates as max of subscores, so if we want (and we would have to for consistency sake) to reproduce how website works it is much more involved (and slower). We are looking into it.
Dear Damian, thanks for the explanations. if one day it will be possible to access the various sub-scores I'd be glad to use them. You are right, when looking at some examples I realized the combined score is not a simple maximum...
Since one of my collaborators is very much interested in the co-expression part I've identified some co-expression data-bases to use for this purpose and I'm considering building some tools for this specific 'component' and making a small R package. However, some preliminary checks on a few proteins revealed that these scores/results may differ quite a bit to the co-expression component of String. Of course at this point I can't tell how is better "representative". I'll post here if I'll have more...
Wolfgang
Dear Wolfgang,
Thanks for the update. STRING coexpression is build on all available array data from GEO, RNASeq from Expression Atlas and proteomics data from ProteomeHD (Rappsilber lab). It should be quite comprehensive with little noise for high-scoring edges.
The sub-score eventually will come, but first there are other issues to tackle with the STRINGdb R package. Meantime I would encourage you to use download files or the API (HELP->API) The explanation how to combine the script together with python script that does that is available Help->FAQ.
Best, Damian.