I am trying to use RpsiXML to parse human interaction data downloaded from IntAct with the goal of building a human PPI network. I have downloaded the file human.zip from ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psi25/species/ The archive unzips to 172 individual files.
For example, human01.xml is a 43.2 Mb file;
grep "<primaryRef db=\"uniprotkb\"" human_01.xml | wc -l
... gives me 1187 lines. This already seems low - but I don't even get that with RpsiXML:
library("RpsiXML")
intact01xml <- parsePsimi25Interaction("./human/human_01.xml",
                                       INTACT.PSIMI25,
                                       verbose=FALSE)
length(interactions(intact01xml))  # 2
intactGraph <- psimi25XML2Graph("./human/human_01.xml",
                                INTACT.PSIMI25,
                                type = "interaction",
                                verbose=FALSE)
length(nodes(intactGraph))  # 87
length(edges(intactGraph))  # 87 ... |nodes| == |edges| ???
table(degree(intactGraph))  
#           outDegree
# inDegree  0  1  2  3  5
#       0   0 37  5  0  0
#       1   9 12  4  1  0
#       2   4  4  4  0  0
#       3   1  2  0  0  0
#       4   0  0  1  0  0
#       5   1  0  0  0  0
#       7   0  0  0  0  1
#       16  1  0  0  0  0
87 interactions in 43.2 MB of data? Something seems amiss. I might misunderstand what to expect in this set of IntAct files, or how to properly parse the file. Help appreciated.
