How to download All Protein Protein Interactions of an organism with Bioconductor
hkarakurt ▴ 20
Last seen 7 months ago

Hello everyone,

I need to download whole protein-protein interaction network of Mus musculus. I downloaded from STRING but the file is too big to manipulate (about 12 million lines). I need to delete some interactions with low confidence score but I cannot open this file with Matlab, R, Python or Excel. 

I really need to find this network. Has anyone ever download such a network with a package of R?

ppi stringdb mus musculus • 422 views
damian.szk ▴ 20
Last seen 17 days ago

You are right, it's impossible to open such a large file in Excel or any other visual editing tool.

However R, Python or Matlab can easily parse the file, because for simple parsing the size does not matter. Just read it line by line and output the lines you actually need. Here is a simple python script that outputs only high confidence scores:

import gzip

fh_out = open("pruned_file.tsv", "w")

header = True
for line in"10090.protein.links.v10.5.txt.gz"):
    if header: # skip the first line
        header = False

    row = line.strip().split("\t")
    score = int(row[-1])

    if score >= 700: # only high confidence interactions 


This will considerably cut the size of the file. Probably still not enough to open it in Excel but enough to load all of the remaining interactions into memory in just a few seconds. 

Hope this somehow helps.




