paint-brush
Predicting Links in Graphs with Graph Neural Networks and DGL.aiby@andrei9735

Predicting Links in Graphs with Graph Neural Networks and DGL.ai

by AndreiNovember 6th, 2024
Read on Terminal Reader

Too Long; Didn't Read

Learn to build a model for predicting new links in a social graph, demonstrating the use of Graph Neural Networks and DGL for link prediction.
featured image - Predicting Links in Graphs with Graph Neural Networks and DGL.ai
Andrei HackerNoon profile picture


Graphs are a powerful way to represent data, uniquely capturing relationships between entities across a wide variety of applications. Whether you're modeling social networks, protein interactions, transportation systems, or recommendation engines, graphs naturally represent and analyze these complex interdependencies. In today’s data-driven world, understanding relationships between entities is often just as important as understanding the entities themselves - this is where graphs truly shine.


Link prediction is one of the fundamental tasks in graph analytics, involving the prediction of connections (or links) between nodes (the entities represented in a graph). Imagine recommending new friends in a social network, predicting potential collaborations in an academic citation graph, or forecasting future interactions between users and products in an e-commerce setting - these are all examples of link prediction in action. This task helps expand networks, infer missing information, and detect anomalies. For applications ranging from enhancing user experience to improving fraud detection, link prediction is a key ingredient for success.


To illustrate link prediction, we will use the Twitch Social Network dataset from the Stanford Network Analysis Project (SNAP). This dataset captures the social connections between users on the Twitch streaming platform, where nodes represent Twitch users, and edges represent friendships between them. The dataset is well-structured, making it easy to preprocess and work with.


By following along, you'll learn how to set up a project, preprocess data, build a model, and evaluate it for link prediction on a real-world dataset.

Graph Neural Networks and Deep Graph Library

Working with graph-structured data presents unique challenges, and this is where Graph Neural Networks (GNNs) come in. GNNs are a type of neural network specifically designed to work with graph data. Unlike traditional neural networks that operate on fixed-size input, GNNs can handle arbitrary graph sizes and leverage the connectivity patterns within the data. By aggregating information from a node's neighbors, GNNs learn representations that capture both the node attributes and the structure of the graph, making them highly effective for tasks like node classification, link prediction, and graph classification.


The Deep Graph Library (DGL.ai) is a powerful toolkit for building GNNs with ease and efficiency. With DGL, developers can leverage state-of-the-art GNN architectures to tackle a variety of tasks, including link prediction. DGL provides a range of utilities for working with both homogeneous and heterogeneous graphs, making it a versatile tool for researchers and practitioners alike. By simplifying the implementation of GNNs, DGL allows you to focus more on developing innovative solutions rather than getting bogged down in the underlying technical complexities.


With this foundation in mind, let’s dive into building a link prediction model using GNNs and DGL.ai.

Setting up the project

The first step is to set up our project by importing the required libraries:

import json
import numpy as np
import pandas as pd
import dgl
from dgl.data import DGLDataset
from dgl.nn import SAGEConv
import torch
import torch.nn as nn
from torch.nn.functional import binary_cross_entropy_with_logits, relu, dropout
from torch.nn.utils import clip_grad_norm_
from torch.optim.lr_scheduler import ReduceLROnPlateau
import itertools
import scipy.sparse as sp
from sklearn.metrics import roc_auc_score

Data preprocessing and train-test split

To prepare the data for training, we’ll first load the Twitch dataset, represent it as a graph, and then split it into training and testing sets. We’ll create a custom class that inherits from DGLDataset, which will help structure the data-loading process and streamline graph-related operations.


  1. Loading the Graph Data: In the process function, we start by reading the list of edges from the dataset. Since the edges in the original dataset are undirected (representing mutual friendships), we add reverse edges to ensure that connections are bidirectional in our graph.
  2. Loading Node Features: Next, we load node features from a JSON file. Each node has a list of features that may vary in length, so we pad shorter feature lists with zeros to maintain a consistent length across all nodes.


Here’s the code to create and preprocess the dataset:

# create a dataset that inherits DGLDataset
class SocialNetworkDataset(DGLDataset):
   def __init__(self):
       super().__init__(name='social_network')

   def process(self):
       # load edges
       edges_df = pd.read_csv('./twitch/ENGB/musae_ENGB_edges.csv')
       # ensure edges are bidirectional
       edges_df_rev = edges_df.copy()
       edges_df_rev.columns = ['to', 'from']
       edges_df_rev = edges_df_rev[['from', 'to']]
       edges_df = pd.concat([edges_df, edges_df_rev], ignore_index=True)
       edges_df.drop_duplicates(inplace=True)
       # create a graph using DGL
       max_node_id = max(edges_df['from'].max(), edges_df['to'].max())
       edges_src = torch.from_numpy(edges_df['from'].to_numpy())
       edges_dst = torch.from_numpy(edges_df['to'].to_numpy())
       self.graph = dgl.graph(
           (edges_src, edges_dst), num_nodes=max_node_id + 1,
       )
       # load and node features
       with open('./twitch/ENGB/musae_ENGB_features.json') as f:
           node_features_dict = json.load(f)
           # feature lists have various lengths, pad them with zeros
           max_feature_list_len = max([len(l) for l in node_features_dict.values()])
           for k in node_features_dict:
               if len(node_features_dict[k]) < max_feature_list_len:
                   node_features_dict[k] += [0] * (max_feature_list_len - len(node_features_dict[k]))
           # set node features in graph
           node_features_df = pd.DataFrame.from_dict(node_features_dict).T.astype('float64')
           node_features_np = node_features_df.to_numpy()
           self.graph.ndata['feat'] = torch.from_numpy(node_features_np).float()

   def __len__(self):
       return 1  # only the whole graph is returned

   def __getitem__(self, idx):
       return self.graph


We now initialize our dataset to load the graph data.

# init the dataset
dataset = SocialNetworkDataset()
g = dataset[0]


The next step is to create the training and testing sets. We'll split the edges into an 80/20 ratio for training and testing. We generate both positive (existing edges) and negative samples (nonexistent edges) for both sets. For larger graphs, DGL's dgl.sampling utilities can be helpful, but here, the entire graph fits in memory. Here’s the code to create training and testing sets:


# pick edges for train and test sets (80/20 split)
# (for larger graphs, we can use dgl.sampling.negative etc)
u, v = g.edges()

edge_ids = np.random.permutation(g.num_edges())
test_set_size = int(len(edge_ids) * 0.2)
train_set_size = len(edge_ids) - test_set_size

# positive samples: existing edges
test_positive_u, test_positive_v = u[edge_ids[:test_set_size]], v[edge_ids[:test_set_size]]
train_positive_u, train_positive_v = u[edge_ids[test_set_size:]], v[edge_ids[test_set_size:]]

# negative samples: nonexistent edges
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj_negative = 1 - adj.todense() - np.eye(g.num_nodes())
negative_u, negative_v = np.where(adj_negative != 0)

negative_edge_ids = np.random.choice(len(negative_u), g.num_edges())
test_negative_u, test_negative_v = (
    negative_u[negative_edge_ids[:test_set_size]],
    negative_v[negative_edge_ids[:test_set_size]],
)
train_negative_u, train_negative_v = (
    negative_u[negative_edge_ids[test_set_size:]],
    negative_v[negative_edge_ids[test_set_size:]],
)

# create a training graph by copying the original graph and removing test edges
train_g = dgl.remove_edges(g, edge_ids[:test_set_size])

# define positive and negative graphs for train and test sets
train_positive_g = dgl.graph((train_positive_u, train_positive_v), num_nodes=g.num_nodes())
train_negative_g = dgl.graph((train_negative_u, train_negative_v), num_nodes=g.num_nodes())
test_positive_g = dgl.graph((test_positive_u, test_positive_v), num_nodes=g.num_nodes())
test_negative_g = dgl.graph((test_negative_u, test_negative_v), num_nodes=g.num_nodes())

Model: GraphSAGE Convolutional Layers and MLP Predictor

We’ll use a Graph Sample and Aggregate (GraphSAGE) convolutional neural network to learn node representations, also known as embeddings, that capture both the structure and features of each node within the graph. GraphSAGE operates by aggregating feature information from each node's neighbors to create a meaningful representation for each node. This process, known as neighbor aggregation, enables the model to learn rich, localized patterns in the graph.


In each GraphSAGE layer, the model applies an aggregation function (in this case, the "mean" function) to collect information from neighboring nodes, which is then combined with the node’s own features. Stacking multiple convolutional layers allows the model to capture information from increasingly distant nodes, effectively expanding each node’s view within the graph.


To enhance model performance and reduce overfitting, we’ll apply dropout after each layer.


Let’s now build the GraphSAGE model with three convolutional layers, along with a forward function to define how data flows through it:


# define the GraphSAGE model with 3 convolutional layers
class GraphSAGE(nn.Module):
    def __init__(
            self, 
            input_features, 
            hidden_features, 
            output_features, 
            dropout_probability=0.3,
        ):
        super(GraphSAGE, self).__init__()
        self.conv_layer_1 = SAGEConv(input_features, hidden_features, "mean")
        self.conv_layer_2 = SAGEConv(hidden_features, hidden_features, "mean")
        self.conv_layer_3 = SAGEConv(hidden_features, output_features, "mean")
        self.dropout_probability = dropout_probability

    def forward(self, graph, input_features):
        # first layer with ReLU activation and dropout
        h = relu(self.conv_layer_1(graph, input_features))
        h = dropout(h, p=self.dropout_probability)
        # second layer with ReLU activation and dropout
        h = relu(self.conv_layer_2(graph, h))
        h = dropout(h, p=self.dropout_probability)
        # third layer without dropout
        h = self.conv_layer_3(graph, h)
        return h


The output after the third layer (h) contains the node embeddings. To predict the likelihood of an edge (or link) between any two nodes, we’ll use a Multi-Layer Perceptron (MLP) predictor. This MLP takes the embeddings of two nodes as input and computes a score indicating the probability of an edge existing between them.


# define the MLP predictor
class MLPPredictor(nn.Module):
    def __init__(self, hidden_features):
        super().__init__()
        # first linear layer to combine node embeddings
        self.W1 = nn.Linear(hidden_features * 2, hidden_features)
        # second linear layer to produce a single score output
        self.W2 = nn.Linear(hidden_features, 1)

    def apply_edges(self, edges):
        # concatenate source and destination node embeddings
        h = torch.cat([edges.src["h"], edges.dst["h"]], dim=1)
        # pass through MLP layers to get the edge score
        score = self.W2(relu(self.W1(h))).squeeze(1)
        return {'score': score}

    def forward(self, g, h):
        with g.local_scope():
            g.ndata["h"] = h
            g.apply_edges(self.apply_edges)
            return g.edata["score"]


The MLP predictor works as follows:

  • Input Size: The predictor’s input size is based on the learned node embeddings. Since each edge connects two nodes, we concatenate their embeddings (each of size hidden_features), resulting in an input size of hidden_features * 2.
  • Layer 1 (W1): This layer processes the concatenated embeddings and reduces the output dimension back to hidden_features, preserving important relational information between the nodes.
  • Layer 2 (W2): This final layer produces a single scalar score for each pair of nodes, which represents the likelihood of an edge existing between them.


This layered approach allows the predictor to capture complex relationships between pairs of nodes and compute an edge score that can be interpreted as the probability of an edge's existence.

Loss function and AUC

To train our model effectively, we need a loss function that can quantify the model’s performance on link prediction. Since this task is a binary classification problem - where each link either exists or doesn’t - we use binary cross-entropy (BCE) as our loss function. Binary cross-entropy measures the discrepancy between the model’s predicted scores and the actual labels (1 for an existing link, 0 for no link). We use the _with_logits version because our model outputs raw scores (logits) rather than probabilities. This version of BCE is more stable when working with logits, as it combines the sigmoid function and cross-entropy into one step.


Here’s the code that computes loss:

def compute_loss(positive_logits, negative_logits):
    # concatenate positive and negative scores
    y_predicted = torch.cat([positive_logits, negative_logits])
    # create true labels (1 for existing links, 0 for nonexistent)
    y_true = torch.cat([torch.ones(positive_logits.shape[0]), torch.zeros(negative_logits.shape[0])])
    return binary_cross_entropy_with_logits(y_predicted, y_true)


To evaluate the model, we use the Area Under the ROC Curve (AUC) metric. AUC is well-suited for link prediction because it handles imbalanced data effectively, where negative samples (nonexistent edges) are much more common than positive samples. The AUC score gives us a sense of how well the model ranks existing links higher than nonexistent ones.


Here’s the code to compute AUC:

def compute_auc(positive_logits, negative_logits):
    y_predicted = torch.cat([positive_logits, negative_logits]).detach().numpy()
    y_true = torch.cat([torch.ones(positive_logits.shape[0]), torch.zeros(negative_logits.shape[0])]).detach().numpy()
    return roc_auc_score(y_true, y_predicted)

Note: We use detach() to remove the tensors from the computation graph, allowing us to calculate the AUC without affecting gradients.

Training the model

Now we're ready to train the model. To start, we’ll instantiate the model, the predictor, and the optimizer, and define the training loop. We’ll also specify the learning rate, hidden layer size, and dropout rate, among other hyperparameters. While we won’t cover hyperparameter optimization here, we’ll use a learning rate scheduler to adjust the learning rate if the loss plateaus—meaning it stops decreasing for a set number of epochs (in this case, 25). The scheduler halves the learning rate when this happens, which can help the model converge more effectively.


Here’s the code to initialize the model and set up training:

# init the model
num_hidden_features = 32
model = GraphSAGE(
    train_g.ndata['feat'].shape[1], 
    num_hidden_features, 
    num_hidden_features,
)
predictor = MLPPredictor(num_hidden_features)

# create an optimizer and a learning rate scheduler
learning_rate = 0.01
optimizer = torch.optim.Adam(
    itertools.chain(model.parameters(), predictor.parameters()),
    lr=learning_rate,
)
lr_scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=25)

# train the model
num_epochs = 1000
for epoch in range(num_epochs + 1):
    # forward
    h = model(train_g, train_g.ndata['feat'])
    positive_logits = predictor(train_positive_g, h)
    negative_logits = predictor(train_negative_g, h)
    loss = compute_loss(positive_logits, negative_logits)
    # backward
    optimizer.zero_grad()
    loss.backward()
    # clip gradients
    clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    # adjust learning rate based on the loss
    lr_scheduler.step(loss)
    # print loss, current learning rate, and AUC
    if epoch % 100 == 0:
        last_lr = lr_scheduler.get_last_lr()[0]
        train_auc = compute_auc(positive_logits, negative_logits)
        print(f'Epoch: {epoch}, learning rate: {last_lr}, loss: {loss}, AUC: {train_auc}')


Let's run the code and examine the output:

Epoch: 0, learning rate: 0.01, loss: 262.4156188964844, AUC: 0.4934097124994463
Epoch: 100, learning rate: 0.01, loss: 0.5642552375793457, AUC: 0.7473735298706314
Epoch: 200, learning rate: 0.01, loss: 0.4622882306575775, AUC: 0.8431058751115716
Epoch: 300, learning rate: 0.01, loss: 0.40566185116767883, AUC: 0.8777374138645864
Epoch: 400, learning rate: 0.01, loss: 0.38118976354599, AUC: 0.8944719038039551
Epoch: 500, learning rate: 0.01, loss: 0.3690297603607178, AUC: 0.9039401673234729
Epoch: 600, learning rate: 0.005, loss: 0.3579995930194855, AUC: 0.9112366798940639
Epoch: 700, learning rate: 0.005, loss: 0.3557407557964325, AUC: 0.9128097572016495
Epoch: 800, learning rate: 0.005, loss: 0.3510144352912903, AUC: 0.9152937255697913
Epoch: 900, learning rate: 0.00125, loss: 0.3425179123878479, AUC: 0.9202487786553115
Epoch: 1000, learning rate: 0.00015625, loss: 0.3432360589504242, AUC: 0.9198250134354529


As we can see, the model achieves an AUC of around 0.92, indicating strong predictive performance. Notice that the learning rate is reduced between epochs 500 and 600 when the loss stabilizes. This adjustment allows for finer tuning, leading to a small decrease in loss. After a certain point, the loss and AUC stabilize, indicating the model has converged.


Let's evaluate the model on the test data (which was not used during training) and see if it generalizes well:

# evaluate the model on the test data
with torch.no_grad():
    test_positive_scores = predictor(test_positive_g, h)
    test_negative_scores = predictor(test_negative_g, h)
    test_loss = compute_loss(test_positive_scores, test_negative_scores)
    test_auc = compute_auc(test_positive_scores, test_negative_scores)
    print(f'Test loss: {test_loss}, Test AUC: {test_auc}')


The result is:

Test loss: 0.675215482711792, Test AUC: 0.866213400711374


The test AUC is slightly lower than the training AUC, indicating minor overfitting. However, the AUC of 0.866 shows that the model still performs well on unseen data. Further tuning of hyperparameters could improve generalization, especially if overfitting is a concern.

With our trained model, we can now predict the likelihood of links between nodes in the graph. We’ll generate predictions for all possible pairs of nodes, allowing us to identify potential new connections.

  1. Generating Candidate Node Pairs: We first generate all possible pairs of nodes, excluding self-loops (connections from a node to itself). This gives us the candidate node pairs for which we want to predict link probabilities.
  2. Creating a Candidate Graph: We then create a DGL graph with these candidate edges, and use the model with node embeddings from the original graph. These embeddings, stored in the candidate graph’s ndata['h'] attribute, will serve as inputs for link prediction.


Here’s the code for these steps:

# build node pairs, avoid self-loops (with_replacement=False)
node_pairs = torch.combinations(torch.arange(g.num_nodes()), r=2, with_replacement=False)
candidate_u = node_pairs[:, 0]
candidate_v = node_pairs[:, 1]
# build a graph with all node pairs
candidate_graph = dgl.graph((candidate_u, candidate_v))
candidate_graph_node_embeddings = model(g, g.ndata['feat'])  # we use embeddings from the original graph
candidate_graph.ndata['h'] = candidate_graph_node_embeddings
# use the predictor to predict the existence of links between nodes
predicted_scores = predictor(candidate_graph, candidate_graph_node_embeddings)


Now that we have predictions for all candidate pairs, we can check the link probability between any specific nodes. For example, let’s examine the score and probability of a link between nodes 1773 and 7005, which are not directly connected in the initial dataset:

# find the index of the node pair (1773, 7005)
pair_index = torch.where((candidate_u == 1773) & (candidate_v == 7005))[0]
print(f'Pair index: {pair_index}')
# get the logit score for this pair and compute probability of link existence
pair_link_score = predicted_scores[pair_index].item()  # logit score
print(f'Pair link score: {pair_link_score}')
link_probability = torch.sigmoid(torch.tensor(pair_link_score)).item() # apply sigmoid to convert score into probability
print(f'Link probability: {link_probability * 100}%')


This is the result:

Pair index: tensor([11066978])
Pair link score: 0.7675977945327759
Link probability: 68.30010414123535%


According to our model, there is a 68.3% probability of a link between user 1773 and 7005.

Conclusion

In this post, we successfully built a model for predicting new links in a social graph, demonstrating the use of Graph Neural Networks and DGL for link prediction. Using a relatively small dataset allowed us to work efficiently on a local machine. However, as graphs scale up to millions or billions of nodes and edges, handling them requires more advanced solutions, such as distributed training on GPU clusters.


As a next step, we’ll explore approaches for handling large-scale graphs and implementing link prediction in cloud environments, enabling you to apply these techniques to production-level datasets.