Karate Club a Python library for graph representation learning by@benitorosenberg

March 7th 2020 7,897 reads

PhD candidate at The University of Edinburgh working on machine learning

**Karate Club*** is an unsupervised machine learning extension library for the NetworkX Python package. See the documentation ***here.**

*Karate Club* consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science *(NetSci, Complenet)*, data mining *(ICDM, CIKM, KDD)*, artificial intelligence *(AAAI, IJCAI)* and machine learning *(NeurIPS, ICML, ICLR)* conferences, workshops, and pieces from prominent journals.

*Karate Club* makes the use of modern community detection techniques quite easy (see here for the accompanying tutorial). The snippet below uses an overlapping community detection algorithm on a synthetic graph.

```
import networkx as nx
from karateclub import EgoNetSplitter
g = nx.newman_watts_strogatz_graph(1000, 20, 0.05)
splitter = EgoNetSplitter(1.0)
splitter.fit(g)
print(splitter.get_memberships())
```

When we created Karate Club we used an API oriented machine learning system design point of view in order to make an end-user friendly machine learning tool. This API oriented design principle entails a few simple ideas. In this section we discuss these ideas and their apparent advantages with appropriate illustrative examples in great detail.

**Encapsulated model hyperparameters and inspection**

An unsupervised Karate Club model instance is created by using the constructor of the appropriate Python object. This constructor has a default hyperparameter setting which allows for sensible out-of-the-box model usage. In simple terms this means that the end user does not need to understand the inner model mechanics in great detail to use the methods implemented in our framework.

We set these default hyperparameters to provide a reasonable learning and runtime performance. If needed, these model hyperparameters can be modified at the model creation time with the appropriate parametrization of the constructor. The hyperparameters are stored as public attributes to allow the inspection of model settings.

```
import networkx as nx
from karateclub import DeepWalk
graph = nx.gnm_random_graph(100, 1000)
model = DeepWalk()
print(model.dimensions)
model = DeepWalk(dimensions=64)
print(model.dimensions)
```

We demonstrate the encapsulation of hyperparameters by the code snippet above. First, we want to create an embedding for a *NetworkX *generated *Erdos-Renyi graph *with the standard hyperparameter settings.

When the model is constructed we do not change these default hyperparameters and we can print the standard setting of the dimensions hyperparameter. Second, we decided to set a different number of dimensions, so we created a new model and we can still access the dimensions hyperparameter publicly.

**Consistency and non-proliferation of classes**

Each unsupervised machine learning model in Karate Club is implemented as a separate class which inherits from the * Estimator class*. Algorithms implemented in our framework have a limited number of public methods as we do not assume that the end user is particularly interested in the algorithmic details related to a specific technique.

All models are fitted by the use of the* **fit()** *method which takes the inputs (graph, node features) and calls the appropriate private methods to learn an embedding or clustering. Node and graph embeddings are returned by the ** get_embedding()** public method and cluster memberships are retrieved by calling

```
import networkx as nx
from karateclub import DeepWalk
graph = nx.gnm_random_graph(100, 1000)
model = DeepWalk()
model.fit(graph)
embedding = model.get_embedding()
```

In the snippet above we create a random graph, and **DeepWalk*** *model with the default hyperparameters, we fit this model using the public

This example can be modified to create a **Walklets*** *embedding with minimal effort by changing the model import and the constructor - these modifications result in the snippet below.

```
import networkx as nx
from karateclub import Walklets
graph = nx.gnm_random_graph(100, 1000)
model = Walklets()
model.fit(graph)
embedding = model.get_embedding()
```

Looking at these two snippets the advantage of the* API driven design* is evident as we only needed to do a few modifications. First, one had to change the import of the embedding model. Second, we needed to change the model construction and the default hyperparameters were already set.

Third, the public methods provided by the * DeepWalk *and

**Standardized dataset ingestion**

We designed Karate Club to use standardized dataset ingestion when a model is fitted. Practically this means that algorithms which have the same purpose use the same data types for model training. In detail:

- Neighbourhood based and structural node embedding techniques use a single
as input for the fit method.**NetworkX graph** - Attributed node embedding procedures take a
as input and the features are represented as a**NetworkX graph**or as a**NumPy array**. In these matrices rows correspond to nodes and columns to features.**SciPy sparse matrix** - Graph level embedding methods and statistical graph fingerprints take a
as an input.**list of NetworkX graphs** - Community detection methods use a
as an input.**NetworkX graph**

**High performance model mechanics**

The underlying mechanics of the graph mining algorithms were implemented using widely available Python libraries which are not operation system dependent and do not require the presence of other external libraries like * TensorFlow *or

Dense linear algebra operations are done with * NumPy *and their sparse counterparts use

**Standardized output generation and interfacing**

The standardized output generation of * Karate Club* ensures that unsupervised learning algorithms which serve the same purpose always return the same type of output with a consistent data point ordering.

There is a very important consequence of this design principle. When a certain type of algorithm is replaced with the same type of algorithm, the downstream code which uses the output of the upstream unsupervised model does not have to be changed. Specifically the outputs generated with our framework use the following data structures:

(neighbourhood preserving, attributed and structural) always return a**Node embedding algorithms**when the**NumPy float array**method is called. The number of rows in the array is the number of vertices and the row index always corresponds to the vertex index. Furthermore, the number of columns is the number of embedding dimensions.**get_embedding()**(spectral fingerprints, implicit matrix factorization techniques) return a**Whole graph embedding methods**when the**Numpy float array**method is called. The row index corresponds to the position of a single graph in the list of graphs inputted. In the same way, columns represent the embedding dimensions.**get_embedding()**return a**Community detection procedures**when the**dictionary**method is called. Node indices are keys and the values corresponding to the keys are the community memberships of vertices. Certain graph clustering techniques create a node embedding in order to find vertex clusters. These return a**get_memberships()**when the**NumPy float array**method is called. This array is structured like the ones returned by node embedding algorithms.**get_embedding()**

We demonstrate the standardized output generation and interfacing by the code fragment below. We create clusterings of a random graph and return dictionaries containing the cluster memberships. Using the external community library we can calculate the modularity of these clusterings.

This shows that the standardized output generation makes interfacing with external graph mining and machine learning libraries easy.

```
import community
import networkx as nx
from karateclub import LabelPropagation, SCD
graph = nx.gnm_random_graph(100, 1000)
model = SCD()
model.fit(graph)
scd_memberships = model.get_memberships()
model = LabelPropagation()
model.fit(graph)
lp_memberships = model.get_memberships()
print(community.modularity(scd_memberships, graph))
print(community.modularity(lp_memberships, graph))
```

**Limitations**

The current design of Karate Club has certain limitations and we make assumptions about the input. We assume that that the * NetworkX *graph is undirected and consists of a single

In case of the whole graph embedding algorithms all graphs in the set of graphs must amend the previously listed requirements with respect to the input. The **Weisfeiler-Lehman** feature based embedding techniques allow nodes to have a single string feature which can be accessed with the * feature *key. Without the presence of this key these algorithms default to the use of degree centrality as a node feature.