PhD candidate at The University of Edinburgh working on machine learning
Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) conferences, workshops, and pieces from prominent journals.
Karate Club makes the use of modern community detection techniques quite easy (see here for the accompanying tutorial). The snippet below uses an overlapping community detection algorithm on a synthetic graph.
import networkx as nx from karateclub import EgoNetSplitter g = nx.newman_watts_strogatz_graph(1000, 20, 0.05) splitter = EgoNetSplitter(1.0) splitter.fit(g) print(splitter.get_memberships())
When we created Karate Club we used an API oriented machine learning system design point of view in order to make an end-user friendly machine learning tool. This API oriented design principle entails a few simple ideas. In this section we discuss these ideas and their apparent advantages with appropriate illustrative examples in great detail.
Encapsulated model hyperparameters and inspection
An unsupervised Karate Club model instance is created by using the constructor of the appropriate Python object. This constructor has a default hyperparameter setting which allows for sensible out-of-the-box model usage. In simple terms this means that the end user does not need to understand the inner model mechanics in great detail to use the methods implemented in our framework.
We set these default hyperparameters to provide a reasonable learning and runtime performance. If needed, these model hyperparameters can be modified at the model creation time with the appropriate parametrization of the constructor. The hyperparameters are stored as public attributes to allow the inspection of model settings.
import networkx as nx from karateclub import DeepWalk graph = nx.gnm_random_graph(100, 1000) model = DeepWalk() print(model.dimensions) model = DeepWalk(dimensions=64) print(model.dimensions)
We demonstrate the encapsulation of hyperparameters by the code snippet above. First, we want to create an embedding for a NetworkX generated Erdos-Renyi graph with the standard hyperparameter settings.
When the model is constructed we do not change these default hyperparameters and we can print the standard setting of the dimensions hyperparameter. Second, we decided to set a different number of dimensions, so we created a new model and we can still access the dimensions hyperparameter publicly.
Consistency and non-proliferation of classes
Each unsupervised machine learning model in Karate Club is implemented as a separate class which inherits from the Estimator class. Algorithms implemented in our framework have a limited number of public methods as we do not assume that the end user is particularly interested in the algorithmic details related to a specific technique.
All models are fitted by the use of the fit() method which takes the inputs (graph, node features) and calls the appropriate private methods to learn an embedding or clustering. Node and graph embeddings are returned by the get_embedding() public method and cluster memberships are retrieved by calling get_memberships().
import networkx as nx from karateclub import DeepWalk graph = nx.gnm_random_graph(100, 1000) model = DeepWalk() model.fit(graph) embedding = model.get_embedding()
In the snippet above we create a random graph, and DeepWalk model with the default hyperparameters, we fit this model using the public fit() method and return the embedding by calling the public get_embedding() method.
This example can be modified to create a Walklets embedding with minimal effort by changing the model import and the constructor - these modifications result in the snippet below.
import networkx as nx from karateclub import Walklets graph = nx.gnm_random_graph(100, 1000) model = Walklets() model.fit(graph) embedding = model.get_embedding()
Looking at these two snippets the advantage of the API driven design is evident as we only needed to do a few modifications. First, one had to change the import of the embedding model. Second, we needed to change the model construction and the default hyperparameters were already set.
Third, the public methods provided by the DeepWalk and Walklets classes behave the same way. An embedding is learned with fit() and it is returned by get_embedding(). This allows for quick and minimal changes to the code when an upstream unsupervised model used for feature extraction performs poorly.
Standardized dataset ingestion
We designed Karate Club to use standardized dataset ingestion when a model is fitted. Practically this means that algorithms which have the same purpose use the same data types for model training. In detail:
High performance model mechanics
The underlying mechanics of the graph mining algorithms were implemented using widely available Python libraries which are not operation system dependent and do not require the presence of other external libraries like TensorFlow or PyTorch does. The internal graph representations in Karate Club use NetworkX.
Dense linear algebra operations are done with NumPy and their sparse counterparts use SciPy. Implicit matrix factorization techniques utilize the GenSim package and methods which rely on graph signal processing use PyGSP.
Standardized output generation and interfacing
The standardized output generation of Karate Club ensures that unsupervised learning algorithms which serve the same purpose always return the same type of output with a consistent data point ordering.
There is a very important consequence of this design principle. When a certain type of algorithm is replaced with the same type of algorithm, the downstream code which uses the output of the upstream unsupervised model does not have to be changed. Specifically the outputs generated with our framework use the following data structures:
We demonstrate the standardized output generation and interfacing by the code fragment below. We create clusterings of a random graph and return dictionaries containing the cluster memberships. Using the external community library we can calculate the modularity of these clusterings.
This shows that the standardized output generation makes interfacing with external graph mining and machine learning libraries easy.
import community import networkx as nx from karateclub import LabelPropagation, SCD graph = nx.gnm_random_graph(100, 1000) model = SCD() model.fit(graph) scd_memberships = model.get_memberships() model = LabelPropagation() model.fit(graph) lp_memberships = model.get_memberships() print(community.modularity(scd_memberships, graph)) print(community.modularity(lp_memberships, graph))
The current design of Karate Club has certain limitations and we make assumptions about the input. We assume that that the NetworkX graph is undirected and consists of a single strongly connected component. All algorithms assume that nodes are indexed with integers consecutively and the starting node index is 0. Moreover, we assume that the graph is not multipartite, nodes are homogeneous and the edges are unweighted (each edge has a unit weight).
In case of the whole graph embedding algorithms all graphs in the set of graphs must amend the previously listed requirements with respect to the input. The Weisfeiler-Lehman feature based embedding techniques allow nodes to have a single string feature which can be accessed with the feature key. Without the presence of this key these algorithms default to the use of degree centrality as a node feature.