paint-brush
Karate Club a Python library for graph representation learningby@benitorosenberg
8,760 reads
8,760 reads

Karate Club a Python library for graph representation learning

by Benedek RozemberczkiMarch 7th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Karate Club is an unsupervised machine learning extension library for the NetworkX Python package. It provides network embedding techniques at the node and graph level. It includes a variety of overlapping and non-overlapping community detection methods. It is a Swiss Army knife for small-scale graph mining research. It makes the use of modern community detection techniques quite easy (see here for the accompanying tutorial). The code snippet below uses an overlapping community detection algorithm on a synthetic graph. In this section we discuss these ideas and their apparent advantages with appropriate illustrative examples in great detail.

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Karate Club a Python library for graph representation learning
Benedek Rozemberczki HackerNoon profile picture

Karate Club is an unsupervised machine learning extension library for the NetworkX Python package. See the documentation here.

Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) conferences, workshops, and pieces from prominent journals.

A simple example

Karate Club makes the use of modern community detection techniques quite easy (see here for the accompanying tutorial). The snippet below uses an overlapping community detection algorithm on a synthetic graph.

import networkx as nx
from karateclub import EgoNetSplitter

g = nx.newman_watts_strogatz_graph(1000, 20, 0.05)

splitter = EgoNetSplitter(1.0)

splitter.fit(g)

print(splitter.get_memberships())

Design principles

When we created Karate Club we used an API oriented machine learning system design point of view in order to make an end-user friendly machine learning tool. This API oriented design principle entails a few simple ideas. In this section we discuss these ideas and their apparent advantages with appropriate illustrative examples in great detail.

Encapsulated model hyperparameters and inspection

An unsupervised Karate Club model instance is created by using the constructor of the appropriate Python object. This constructor has a default hyperparameter setting which allows for sensible out-of-the-box model usage. In simple terms this means that the end user does not need to understand the inner model mechanics in great detail to use the methods implemented in our framework.

We set these default hyperparameters to provide a reasonable learning and runtime performance. If needed, these model hyperparameters can be modified at the model creation time with the appropriate parametrization of the constructor. The hyperparameters are stored as public attributes to allow the inspection of model settings.

import networkx as nx
from karateclub import DeepWalk
   
graph = nx.gnm_random_graph(100, 1000)

model = DeepWalk()
print(model.dimensions)

model = DeepWalk(dimensions=64)
print(model.dimensions)

We demonstrate the encapsulation of hyperparameters by the code snippet above. First, we want to create an embedding for a NetworkX generated Erdos-Renyi graph with the standard hyperparameter settings.

When the model is constructed we do not change these default hyperparameters and we can print the standard setting of the dimensions hyperparameter. Second, we decided to set a different number of dimensions, so we created a new model and we can still access the dimensions hyperparameter publicly.

Consistency and non-proliferation of classes

Each unsupervised machine learning model in Karate Club is implemented as a separate class which inherits from the Estimator class. Algorithms implemented in our framework have a limited number of public methods as we do not assume that the end user is particularly interested in the algorithmic details related to a specific technique.

All models are fitted by the use of the fit() method which takes the inputs (graph, node features) and calls the appropriate private methods to learn an embedding or clustering. Node and graph embeddings are returned by the get_embedding() public method and cluster memberships are retrieved by calling get_memberships().

import networkx as nx
from karateclub import DeepWalk
   
graph = nx.gnm_random_graph(100, 1000)

model = DeepWalk()
model.fit(graph)
embedding = model.get_embedding()

In the snippet above we create a random graph, and DeepWalk model with the default hyperparameters, we fit this model using the public fit() method and return the embedding by calling the public get_embedding() method.

This example can be modified to create a Walklets embedding with minimal effort by changing the model import and the constructor - these modifications result in the snippet below.

import networkx as nx
from karateclub import Walklets
    
graph = nx.gnm_random_graph(100, 1000)

model = Walklets()
model.fit(graph)
embedding = model.get_embedding()

Looking at these two snippets the advantage of the API driven design is evident as we only needed to do a few modifications. First, one had to change the import of the embedding model. Second, we needed to change the model construction and the default hyperparameters were already set.

Third, the public methods provided by the DeepWalk and Walklets classes behave the same way. An embedding is learned with fit() and it is returned by get_embedding(). This allows for quick and minimal changes to the code when an upstream unsupervised model used for feature extraction performs poorly.

Standardized dataset ingestion
We designed Karate Club to use standardized dataset ingestion when a model is fitted. Practically this means that algorithms which have the same purpose use the same data types for model training. In detail:

  1. Neighbourhood based and structural node embedding techniques use a single NetworkX graph as input for the fit method.
  2. Attributed node embedding procedures take a NetworkX graph as input and the features are represented as a NumPy array or as a SciPy sparse matrix. In these matrices rows correspond to nodes and columns to features.
  3. Graph level embedding methods and statistical graph fingerprints take a list of NetworkX graphs as an input.
  4. Community detection methods use a NetworkX graph as an input.

High performance model mechanics

The underlying mechanics of the graph mining algorithms were implemented using widely available Python libraries which are not operation system dependent and do not require the presence of other external libraries like TensorFlow or PyTorch does. The internal graph representations in Karate Club use NetworkX.

Dense linear algebra operations are done with NumPy and their sparse counterparts use SciPy. Implicit matrix factorization techniques utilize the GenSim package and methods which rely on graph signal processing use PyGSP.

Standardized output generation and interfacing

The standardized output generation of Karate Club ensures that unsupervised learning algorithms which serve the same purpose always return the same type of output with a consistent data point ordering.

There is a very important consequence of this design principle. When a certain type of algorithm is replaced with the same type of algorithm, the downstream code which uses the output of the upstream unsupervised model does not have to be changed. Specifically the outputs generated with our framework use the following data structures:

  1. Node embedding algorithms (neighbourhood preserving, attributed and structural) always return a NumPy float array when the get_embedding() method is called. The number of rows in the array is the number of vertices and the row index always corresponds to the vertex index. Furthermore, the number of columns is the number of embedding dimensions.
  2. Whole graph embedding methods (spectral fingerprints, implicit matrix factorization techniques) return a Numpy float array when the get_embedding() method is called. The row index corresponds to the position of a single graph in the list of graphs inputted. In the same way, columns represent the embedding dimensions.
  3. Community detection procedures return a dictionary when the get_memberships() method is called. Node indices are keys and the values corresponding to the keys are the community memberships of vertices. Certain graph clustering techniques create a node embedding in order to find vertex clusters. These return a NumPy float array when the get_embedding() method is called. This array is structured like the ones returned by node embedding algorithms.

We demonstrate the standardized output generation and interfacing by the code fragment below. We create clusterings of a random graph and return dictionaries containing the cluster memberships. Using the external community library we can calculate the modularity of these clusterings.

This shows that the standardized output generation makes interfacing with external graph mining and machine learning libraries easy.

import community
import networkx as nx
from karateclub import LabelPropagation, SCD

graph = nx.gnm_random_graph(100, 1000)

model = SCD()
model.fit(graph)
scd_memberships = model.get_memberships()

model = LabelPropagation()
model.fit(graph)
lp_memberships = model.get_memberships()

print(community.modularity(scd_memberships, graph))
print(community.modularity(lp_memberships, graph))

Limitations

The current design of Karate Club has certain limitations and we make assumptions about the input. We assume that that the NetworkX graph is undirected and consists of a single strongly connected component. All algorithms assume that nodes are indexed with integers consecutively and the starting node index is 0. Moreover, we assume that the graph is not multipartite, nodes are homogeneous and the edges are unweighted (each edge has a unit weight).

In case of the whole graph embedding algorithms all graphs in the set of graphs must amend the previously listed requirements with respect to the input. The Weisfeiler-Lehman feature based embedding techniques allow nodes to have a single string feature which can be accessed with the feature key. Without the presence of this key these algorithms default to the use of degree centrality as a node feature.