[SOLVED] Networkx: How to create the incidence matrix from a huge graph

Issue

I am reading a very big graph specifically the wiki-topcats (http://snap.stanford.edu/data/wiki-topcats.html) and while I can read and create in acceptable time the graph with:

graph = nx.read_edgelist("C:/.../wiki-topcats.txt", nodetype=int)

Then I need to extract the incidence matrix in order to create the linegraph (that’s my goal) , but when I run:

C = nx.incidence_matrix(graph)

I am getting memory errors, and I cannot find any efficient way of dealing with this or maybe a way to work around this by creating the incidence matrix from the adjancency, but I must work with scipy matrices due to magnitude of the graph e.g. this adjacency matrix works.

A = nx.to_scipy_sparse_matrix(graph)

. Any help would be greatly appreciated.

Solution

NetworkX is clearly not designed to compute big graphs. It is not very performant either (for both memory consumption and speed). Graph databases are meant to solve this problem by storing graphs on a storage device. Neo4j is an example of such database available in Python. That being said, the graph you want to compute is not so big and you could try to use alternative Python libraries that are more efficient like graph-tool.


Assuming you really want/need to use NetworkX, then be aware that the graph takes about 12 GiB of RAM and many NetworkX function will return a full graph resulting in several dozen of GiB of RAM allocated which is a complete waste of resources. Not to mention that most machine does not have so much memory. The graph file take only about 400 MiB and compact representations should be able to store in RAM in less than 100 MiB. Thus, using 12 GiB for that require >120 times more resources than expected. Not to mention that NetworkX take a very long time to fill such memory space.

Loading the file directly in a sparse matrix is much more efficient assuming the sparse matrix type is carefully chosen. Indeed, CSR matrix can store the adjacency matrix pretty efficiently while a DOK matrix is very inefficient (it takes 6 GiB of RAM and is as slow as NetworkX).

Note that np.genfromtxt (which is intended to load such kind of file) goes crazy by using 45-50 GiB of RAM on my machine. This is insane as it is actually ~200 times bigger than what is strictly needed! Hopefully, df = pd.read_table('wiki-topcats.txt', dtype=np.int32, header=None, sep=' ') from Pandas is able to load the file using less than 400 MiB. Based on that, you can easily convert the dataframe to Numpy with edges = df.to_numpy() and then build a relatively fast CSR matrix.

Note that you can build directly the (sparse CSR) incidence matrix with Numpy. One solution is to use idx = np.arange(len(df), dtype=np.int32) so to set an ID to each edge. Then, you can directly know all the position of the 1 in the sparse matrix (m[edges[:,0], idx] and m[edges[:,1], idx]) which should be enough to build it (be careful about possible duplicates that could be due to a node connected to itself).

Answered By – Jérôme Richard

Answer Checked By – Pedro (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *