Karate Club is an unsupervised machine learning extension library for the NetworkX Python package. See the documentation here. consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science , data mining , artificial intelligence and machine learning conferences, workshops, and pieces from prominent journals. Karate Club (NetSci, Complenet) (ICDM, CIKM, KDD) (AAAI, IJCAI) (NeurIPS, ICML, ICLR) A simple example makes the use of modern community detection techniques quite easy (see for the accompanying tutorial). The snippet below uses an on a synthetic graph. Karate Club here overlapping community detection algorithm networkx nx karateclub EgoNetSplitter g = nx.newman_watts_strogatz_graph( , , ) splitter = EgoNetSplitter( ) splitter.fit(g) print(splitter.get_memberships()) import as from import 1000 20 0.05 1.0 Design principles When we created Karate Club we used an API oriented machine learning system design point of view in order to make an end-user friendly machine learning tool. This API oriented design principle entails a few simple ideas. In this section we discuss these ideas and their apparent advantages with appropriate illustrative examples in great detail. Encapsulated model hyperparameters and inspection An unsupervised Karate Club model instance is created by using the constructor of the appropriate Python object. This constructor has a default hyperparameter setting which allows for sensible out-of-the-box model usage. In simple terms this means that the end user does not need to understand the inner model mechanics in great detail to use the methods implemented in our framework. We set these default hyperparameters to provide a reasonable learning and runtime performance. If needed, these model hyperparameters can be modified at the model creation time with the appropriate parametrization of the constructor. The hyperparameters are stored as public attributes to allow the inspection of model settings. networkx nx karateclub DeepWalk graph = nx.gnm_random_graph( , ) model = DeepWalk() print(model.dimensions) model = DeepWalk(dimensions= ) print(model.dimensions) import as from import 100 1000 64 We demonstrate the encapsulation of hyperparameters by the code snippet above. First, we want to create an embedding for a generated with the standard hyperparameter settings. NetworkX Erdos-Renyi graph When the model is constructed we do not change these default hyperparameters and we can print the standard setting of the dimensions hyperparameter. Second, we decided to set a different number of dimensions, so we created a new model and we can still access the dimensions hyperparameter publicly. Consistency and non-proliferation of classes Each unsupervised machine learning model in Karate Club is implemented as a separate class which inherits from the . Algorithms implemented in our framework have a limited number of public methods as we do not assume that the end user is particularly interested in the algorithmic details related to a specific technique. Estimator class All models are fitted by the use of the method which takes the inputs (graph, node features) and calls the appropriate private methods to learn an embedding or clustering. Node and graph embeddings are returned by the public method and cluster memberships are retrieved by calling fit() get_embedding() get_memberships() . networkx nx karateclub DeepWalk graph = nx.gnm_random_graph( , ) model = DeepWalk() model.fit(graph) embedding = model.get_embedding() import as from import 100 1000 In the snippet above we create a random graph, and model with the default hyperparameters, we fit this model using the public method and return the embedding by calling the public method. DeepWalk fit() get_embedding() This example can be modified to create a embedding with minimal effort by changing the model import and the constructor - these modifications result in the snippet below. Walklets networkx nx karateclub Walklets graph = nx.gnm_random_graph( , ) model = Walklets() model.fit(graph) embedding = model.get_embedding() import as from import 100 1000 Looking at these two snippets the advantage of the is evident as we only needed to do a few modifications. First, one had to change the import of the embedding model. Second, we needed to change the model construction and the default hyperparameters were already set. API driven design Third, the public methods provided by the and classes behave the same way. An embedding is learned with and it is returned by This allows for quick and minimal changes to the code when an upstream unsupervised model used for feature extraction performs poorly. DeepWalk Walklets fit() get_embedding(). We designed Karate Club to use standardized dataset ingestion when a model is fitted. Practically this means that algorithms which have the same purpose use the same data types for model training. In detail: Standardized dataset ingestion Neighbourhood based and structural node embedding techniques use a single as input for the fit method. NetworkX graph Attributed node embedding procedures take a as input and the features are represented as a or as a . In these matrices rows correspond to nodes and columns to features. NetworkX graph NumPy array SciPy sparse matrix Graph level embedding methods and statistical graph fingerprints take a as an input. list of NetworkX graphs Community detection methods use a as an input. NetworkX graph High performance model mechanics The underlying mechanics of the graph mining algorithms were implemented using widely available Python libraries which are not operation system dependent and do not require the presence of other external libraries like or does. The internal graph representations in Karate Club use . TensorFlow PyTorch NetworkX Dense linear algebra operations are done with and their sparse counterparts use . Implicit matrix factorization techniques utilize the package and methods which rely on graph signal processing use NumPy SciPy GenSim PyGSP. Standardized output generation and interfacing The standardized output generation of ensures that unsupervised learning algorithms which serve the same purpose always return the same type of output with a consistent data point ordering. Karate Club There is a very important consequence of this design principle. When a certain type of algorithm is replaced with the same type of algorithm, the downstream code which uses the output of the upstream unsupervised model does not have to be changed. Specifically the outputs generated with our framework use the following data structures: (neighbourhood preserving, attributed and structural) always return a when the method is called. The number of rows in the array is the number of vertices and the row index always corresponds to the vertex index. Furthermore, the number of columns is the number of embedding dimensions. Node embedding algorithms NumPy float array get_embedding() (spectral fingerprints, implicit matrix factorization techniques) return a when the method is called. The row index corresponds to the position of a single graph in the list of graphs inputted. In the same way, columns represent the embedding dimensions. Whole graph embedding methods Numpy float array get_embedding() return a when the method is called. Node indices are keys and the values corresponding to the keys are the community memberships of vertices. Certain graph clustering techniques create a node embedding in order to find vertex clusters. These return a when the method is called. This array is structured like the ones returned by node embedding algorithms. Community detection procedures dictionary get_memberships() NumPy float array get_embedding() We demonstrate the standardized output generation and interfacing by the code fragment below. We create clusterings of a random graph and return dictionaries containing the cluster memberships. Using the external community library we can calculate the modularity of these clusterings. This shows that the standardized output generation makes interfacing with external graph mining and machine learning libraries easy. community networkx nx karateclub LabelPropagation, SCD graph = nx.gnm_random_graph( , ) model = SCD() model.fit(graph) scd_memberships = model.get_memberships() model = LabelPropagation() model.fit(graph) lp_memberships = model.get_memberships() print(community.modularity(scd_memberships, graph)) print(community.modularity(lp_memberships, graph)) import import as from import 100 1000 Limitations The current design of Karate Club has certain limitations and we make assumptions about the input. We assume that that the graph is undirected and consists of a single . All algorithms assume that nodes are and the starting node index is 0. Moreover, we assume that the graph is not multipartite, nodes are homogeneous and the edges are unweighted (each edge has a unit weight). NetworkX strongly connected component indexed with integers consecutively In case of the whole graph embedding algorithms all graphs in the set of graphs must amend the previously listed requirements with respect to the input. The feature based embedding techniques allow nodes to have a single string feature which can be accessed with the key. Without the presence of this key these algorithms default to the use of degree centrality as a node feature. Weisfeiler-Lehman feature