Predicting Links in Graphs with Graph Neural Networks and DGL.ai

Graphs are a powerful way to represent data, uniquely capturing relationships between entities across a wide variety of applications. Whether you're modeling social networks, protein interactions, transportation systems, or recommendation engines, graphs naturally represent and analyze these complex interdependencies. In today’s data-driven world, understanding relationships between entities is often just as important as understanding the entities themselves - this is where graphs truly shine. Link prediction is one of the fundamental tasks in graph analytics, involving the prediction of connections (or links) between nodes (the entities represented in a graph). Imagine recommending new friends in a social network, predicting potential collaborations in an academic citation graph, or forecasting future interactions between users and products in an e-commerce setting - these are all examples of link prediction in action. This task helps expand networks, infer missing information, and detect anomalies. For applications ranging from enhancing user experience to improving fraud detection, link prediction is a key ingredient for success. To illustrate link prediction, we will use the Twitch Social Network dataset from the Stanford Network Analysis Project (SNAP). This dataset captures the social connections between users on the Twitch streaming platform, where nodes represent Twitch users, and edges represent friendships between them. The dataset is well-structured, making it easy to preprocess and work with. By following along, you'll learn how to set up a project, preprocess data, build a model, and evaluate it for link prediction on a real-world dataset. Graph Neural Networks and Deep Graph Library Working with graph-structured data presents unique challenges, and this is where Graph Neural Networks (GNNs) come in. GNNs are a type of neural network specifically designed to work with graph data. Unlike traditional neural networks that operate on fixed-size input, GNNs can handle arbitrary graph sizes and leverage the connectivity patterns within the data. By aggregating information from a node's neighbors, GNNs learn representations that capture both the node attributes and the structure of the graph, making them highly effective for tasks like node classification, link prediction, and graph classification. The Deep Graph Library (DGL.ai) is a powerful toolkit for building GNNs with ease and efficiency. With DGL, developers can leverage state-of-the-art GNN architectures to tackle a variety of tasks, including link prediction. DGL provides a range of utilities for working with both homogeneous and heterogeneous graphs, making it a versatile tool for researchers and practitioners alike. By simplifying the implementation of GNNs, DGL allows you to focus more on developing innovative solutions rather than getting bogged down in the underlying technical complexities. With this foundation in mind, let’s dive into building a link prediction model using GNNs and DGL.ai. Setting up the project The first step is to set up our project by importing the required libraries: import json import numpy as np import pandas as pd import dgl from dgl.data import DGLDataset from dgl.nn import SAGEConv import torch import torch.nn as nn from torch.nn.functional import binary_cross_entropy_with_logits, relu, dropout from torch.nn.utils import clip_grad_norm_ from torch.optim.lr_scheduler import ReduceLROnPlateau import itertools import scipy.sparse as sp from sklearn.metrics import roc_auc_score Data preprocessing and train-test split To prepare the data for training, we’ll first load the Twitch dataset, represent it as a graph, and then split it into training and testing sets. We’ll create a custom class that inherits from DGLDataset, which will help structure the data-loading process and streamline graph-related operations. Loading the Graph Data: In the process function, we start by reading the list of edges from the dataset. Since the edges in the original dataset are undirected (representing mutual friendships), we add reverse edges to ensure that connections are bidirectional in our graph. Loading Node Features: Next, we load node features from a JSON file. Each node has a list of features that may vary in length, so we pad shorter feature lists with zeros to maintain a consistent length across all nodes. Here’s the code to create and preprocess the dataset: # create a dataset that inherits DGLDataset class SocialNetworkDataset(DGLDataset): def __init__(self): super().__init__(name='social_network') def process(self): # load edges edges_df = pd.read_csv('./twitch/ENGB/musae_ENGB_edges.csv') # ensure edges are bidirectional edges_df_rev = edges_df.copy() edges_df_rev.columns = ['to', 'from'] edges_df_rev = edges_df_rev[['from', 'to']] edges_df = pd.concat([edges_df, edges_df_rev], ignore_index=True) edges_df.drop_duplicates(inplace=True) # create a graph using DGL max_node_id = max(edges_df['from'].max(), edges_df['to'].max()) edges_src = torch.from_numpy(edges_df['from'].to_numpy()) edges_dst = torch.from_numpy(edges_df['to'].to_numpy()) self.graph = dgl.graph( (edges_src, edges_dst), num_nodes=max_node_id + 1, ) # load and node features with open('./twitch/ENGB/musae_ENGB_features.json') as f: node_features_dict = json.load(f) # feature lists have various lengths, pad them with zeros max_feature_list_len = max([len(l) for l in node_features_dict.values()]) for k in node_features_dict: if len(node_features_dict[k]) < max_feature_list_len: node_features_dict[k] += [0] * (max_feature_list_len - len(node_features_dict[k])) # set node features in graph node_features_df = pd.DataFrame.from_dict(node_features_dict).T.astype('float64') node_features_np = node_features_df.to_numpy() self.graph.ndata['feat'] = torch.from_numpy(node_features_np).float() def __len__(self): return 1 # only the whole graph is returned def __getitem__(self, idx): return self.graph We now initialize our dataset to load the graph data. # init the dataset dataset = SocialNetworkDataset() g = dataset[0] The next step is to create the training and testing sets. We'll split the edges into an 80/20 ratio for training and testing. We generate both positive (existing edges) and negative samples (nonexistent edges) for both sets. For larger graphs, DGL's dgl.sampling utilities can be helpful, but here, the entire graph fits in memory. Here’s the code to create training and testing sets: # pick edges for train and test sets (80/20 split) # (for larger graphs, we can use dgl.sampling.negative etc) u, v = g.edges() edge_ids = np.random.permutation(g.num_edges()) test_set_size = int(len(edge_ids) * 0.2) train_set_size = len(edge_ids) - test_set_size # positive samples: existing edges test_positive_u, test_positive_v = u[edge_ids[:test_set_size]], v[edge_ids[:test_set_size]] train_positive_u, train_positive_v = u[edge_ids[test_set_size:]], v[edge_ids[test_set_size:]] # negative samples: nonexistent edges adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy()))) adj_negative = 1 - adj.todense() - np.eye(g.num_nodes()) negative_u, negative_v = np.where(adj_negative != 0) negative_edge_ids = np.random.choice(len(negative_u), g.num_edges()) test_negative_u, test_negative_v = ( negative_u[negative_edge_ids[:test_set_size]], negative_v[negative_edge_ids[:test_set_size]], ) train_negative_u, train_negative_v = ( negative_u[negative_edge_ids[test_set_size:]], negative_v[negative_edge_ids[test_set_size:]], ) # create a training graph by copying the original graph and removing test edges train_g = dgl.remove_edges(g, edge_ids[:test_set_size]) # define positive and negative graphs for train and test sets train_positive_g = dgl.graph((train_positive_u, train_positive_v), num_nodes=g.num_nodes()) train_negative_g = dgl.graph((train_negative_u, train_negative_v), num_nodes=g.num_nodes()) test_positive_g = dgl.graph((test_positive_u, test_positive_v), num_nodes=g.num_nodes()) test_negative_g = dgl.graph((test_negative_u, test_negative_v), num_nodes=g.num_nodes()) Model: GraphSAGE Convolutional Layers and MLP Predictor We’ll use a Graph Sample and Aggregate (GraphSAGE) convolutional neural network to learn node representations, also known as embeddings, that capture both the structure and features of each node within the graph. GraphSAGE operates by aggregating feature information from each node's neighbors to create a meaningful representation for each node. This process, known as neighbor aggregation, enables the model to learn rich, localized patterns in the graph. In each GraphSAGE layer, the model applies an aggregation function (in this case, the "mean" function) to collect information from neighboring nodes, which is then combined with the node’s own features. Stacking multiple convolutional layers allows the model to capture information from increasingly distant nodes, effectively expanding each node’s view within the graph. To enhance model performance and reduce overfitting, we’ll apply dropout after each layer. Let’s now build the GraphSAGE model with three convolutional layers, along with a forward function to define how data flows through it: # define the GraphSAGE model with 3 convolutional layers class GraphSAGE(nn.Module): def __init__( self, input_features, hidden_features, output_features, dropout_probability=0.3, ): super(GraphSAGE, self).__init__() self.conv_layer_1 = SAGEConv(input_features, hidden_features, "mean") self.conv_layer_2 = SAGEConv(hidden_features, hidden_features, "mean") self.conv_layer_3 = SAGEConv(hidden_features, output_features, "mean") self.dropout_probability = dropout_probability def forward(self, graph, input_features): # first layer with ReLU activation and dropout h = relu(self.conv_layer_1(graph, input_features)) h = dropout(h, p=self.dropout_probability) # second layer with ReLU activation and dropout h = relu(self.conv_layer_2(graph, h)) h = dropout(h, p=self.dropout_probability) # third layer without dropout h = self.conv_layer_3(graph, h) return h The output after the third layer (h) contains the node embeddings. To predict the likelihood of an edge (or link) between any two nodes, we’ll use a Multi-Layer Perceptron (MLP) predictor. This MLP takes the embeddings of two nodes as input and computes a score indicating the probability of an edge existing between them. # define the MLP predictor class MLPPredictor(nn.Module): def __init__(self, hidden_features): super().__init__() # first linear layer to combine node embeddings self.W1 = nn.Linear(hidden_features * 2, hidden_features) # second linear layer to produce a single score output self.W2 = nn.Linear(hidden_features, 1) def apply_edges(self, edges): # concatenate source and destination node embeddings h = torch.cat([edges.src["h"], edges.dst["h"]], dim=1) # pass through MLP layers to get the edge score score = self.W2(relu(self.W1(h))).squeeze(1) return {'score': score} def forward(self, g, h): with g.local_scope(): g.ndata["h"] = h g.apply_edges(self.apply_edges) return g.edata["score"] The MLP predictor works as follows: Input Size: The predictor’s input size is based on the learned node embeddings. Since each edge connects two nodes, we concatenate their embeddings (each of size hidden_features), resulting in an input size of hidden_features * 2. Layer 1 (W1): This layer processes the concatenated embeddings and reduces the output dimension back to hidden_features, preserving important relational information between the nodes. Layer 2 (W2): This final layer produces a single scalar score for each pair of nodes, which represents the likelihood of an edge existing between them. This layered approach allows the predictor to capture complex relationships between pairs of nodes and compute an edge score that can be interpreted as the probability of an edge's existence. Loss function and AUC To train our model effectively, we need a loss function that can quantify the model’s performance on link prediction. Since this task is a binary classification problem - where each link either exists or doesn’t - we use binary cross-entropy (BCE) as our loss function. Binary cross-entropy measures the discrepancy between the model’s predicted scores and the actual labels (1 for an existing link, 0 for no link). We use the _with_logits version because our model outputs raw scores (logits) rather than probabilities. This version of BCE is more stable when working with logits, as it combines the sigmoid function and cross-entropy into one step. Here’s the code that computes loss: def compute_loss(positive_logits, negative_logits): # concatenate positive and negative scores y_predicted = torch.cat([positive_logits, negative_logits]) # create true labels (1 for existing links, 0 for nonexistent) y_true = torch.cat([torch.ones(positive_logits.shape[0]), torch.zeros(negative_logits.shape[0])]) return binary_cross_entropy_with_logits(y_predicted, y_true) To evaluate the model, we use the Area Under the ROC Curve (AUC) metric. AUC is well-suited for link prediction because it handles imbalanced data effectively, where negative samples (nonexistent edges) are much more common than positive samples. The AUC score gives us a sense of how well the model ranks existing links higher than nonexistent ones. Here’s the code to compute AUC: def compute_auc(positive_logits, negative_logits): y_predicted = torch.cat([positive_logits, negative_logits]).detach().numpy() y_true = torch.cat([torch.ones(positive_logits.shape[0]), torch.zeros(negative_logits.shape[0])]).detach().numpy() return roc_auc_score(y_true, y_predicted) Note: We use detach() to remove the tensors from the computation graph, allowing us to calculate the AUC without affecting gradients. Training the model Now we're ready to train the model. To start, we’ll instantiate the model, the predictor, and the optimizer, and define the training loop. We’ll also specify the learning rate, hidden layer size, and dropout rate, among other hyperparameters. While we won’t cover hyperparameter optimization here, we’ll use a learning rate scheduler to adjust the learning rate if the loss plateaus—meaning it stops decreasing for a set number of epochs (in this case, 25). The scheduler halves the learning rate when this happens, which can help the model converge more effectively. Here’s the code to initialize the model and set up training: # init the model num_hidden_features = 32 model = GraphSAGE( train_g.ndata['feat'].shape[1], num_hidden_features, num_hidden_features, ) predictor = MLPPredictor(num_hidden_features) # create an optimizer and a learning rate scheduler learning_rate = 0.01 optimizer = torch.optim.Adam( itertools.chain(model.parameters(), predictor.parameters()), lr=learning_rate, ) lr_scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=25) # train the model num_epochs = 1000 for epoch in range(num_epochs + 1): # forward h = model(train_g, train_g.ndata['feat']) positive_logits = predictor(train_positive_g, h) negative_logits = predictor(train_negative_g, h) loss = compute_loss(positive_logits, negative_logits) # backward optimizer.zero_grad() loss.backward() # clip gradients clip_grad_norm_(model.parameters(), 1.0) optimizer.step() # adjust learning rate based on the loss lr_scheduler.step(loss) # print loss, current learning rate, and AUC if epoch % 100 == 0: last_lr = lr_scheduler.get_last_lr()[0] train_auc = compute_auc(positive_logits, negative_logits) print(f'Epoch: {epoch}, learning rate: {last_lr}, loss: {loss}, AUC: {train_auc}') Let's run the code and examine the output: Epoch: 0, learning rate: 0.01, loss: 262.4156188964844, AUC: 0.4934097124994463 Epoch: 100, learning rate: 0.01, loss: 0.5642552375793457, AUC: 0.7473735298706314 Epoch: 200, learning rate: 0.01, loss: 0.4622882306575775, AUC: 0.8431058751115716 Epoch: 300, learning rate: 0.01, loss: 0.40566185116767883, AUC: 0.8777374138645864 Epoch: 400, learning rate: 0.01, loss: 0.38118976354599, AUC: 0.8944719038039551 Epoch: 500, learning rate: 0.01, loss: 0.3690297603607178, AUC: 0.9039401673234729 Epoch: 600, learning rate: 0.005, loss: 0.3579995930194855, AUC: 0.9112366798940639 Epoch: 700, learning rate: 0.005, loss: 0.3557407557964325, AUC: 0.9128097572016495 Epoch: 800, learning rate: 0.005, loss: 0.3510144352912903, AUC: 0.9152937255697913 Epoch: 900, learning rate: 0.00125, loss: 0.3425179123878479, AUC: 0.9202487786553115 Epoch: 1000, learning rate: 0.00015625, loss: 0.3432360589504242, AUC: 0.9198250134354529 As we can see, the model achieves an AUC of around 0.92, indicating strong predictive performance. Notice that the learning rate is reduced between epochs 500 and 600 when the loss stabilizes. This adjustment allows for finer tuning, leading to a small decrease in loss. After a certain point, the loss and AUC stabilize, indicating the model has converged. Let's evaluate the model on the test data (which was not used during training) and see if it generalizes well: # evaluate the model on the test data with torch.no_grad(): test_positive_scores = predictor(test_positive_g, h) test_negative_scores = predictor(test_negative_g, h) test_loss = compute_loss(test_positive_scores, test_negative_scores) test_auc = compute_auc(test_positive_scores, test_negative_scores) print(f'Test loss: {test_loss}, Test AUC: {test_auc}') The result is: Test loss: 0.675215482711792, Test AUC: 0.866213400711374 The test AUC is slightly lower than the training AUC, indicating minor overfitting. However, the AUC of 0.866 shows that the model still performs well on unseen data. Further tuning of hyperparameters could improve generalization, especially if overfitting is a concern. Using the trained model to predict new links With our trained model, we can now predict the likelihood of links between nodes in the graph. We’ll generate predictions for all possible pairs of nodes, allowing us to identify potential new connections. Generating Candidate Node Pairs: We first generate all possible pairs of nodes, excluding self-loops (connections from a node to itself). This gives us the candidate node pairs for which we want to predict link probabilities. Creating a Candidate Graph: We then create a DGL graph with these candidate edges, and use the model with node embeddings from the original graph. These embeddings, stored in the candidate graph’s ndata['h'] attribute, will serve as inputs for link prediction. Here’s the code for these steps: # build node pairs, avoid self-loops (with_replacement=False) node_pairs = torch.combinations(torch.arange(g.num_nodes()), r=2, with_replacement=False) candidate_u = node_pairs[:, 0] candidate_v = node_pairs[:, 1] # build a graph with all node pairs candidate_graph = dgl.graph((candidate_u, candidate_v)) candidate_graph_node_embeddings = model(g, g.ndata['feat']) # we use embeddings from the original graph candidate_graph.ndata['h'] = candidate_graph_node_embeddings # use the predictor to predict the existence of links between nodes predicted_scores = predictor(candidate_graph, candidate_graph_node_embeddings) Now that we have predictions for all candidate pairs, we can check the link probability between any specific nodes. For example, let’s examine the score and probability of a link between nodes 1773 and 7005, which are not directly connected in the initial dataset: # find the index of the node pair (1773, 7005) pair_index = torch.where((candidate_u == 1773) & (candidate_v == 7005))[0] print(f'Pair index: {pair_index}') # get the logit score for this pair and compute probability of link existence pair_link_score = predicted_scores[pair_index].item() # logit score print(f'Pair link score: {pair_link_score}') link_probability = torch.sigmoid(torch.tensor(pair_link_score)).item() # apply sigmoid to convert score into probability print(f'Link probability: {link_probability * 100}%') This is the result: Pair index: tensor([11066978]) Pair link score: 0.7675977945327759 Link probability: 68.30010414123535% According to our model, there is a 68.3% probability of a link between user 1773 and 7005. Conclusion In this post, we successfully built a model for predicting new links in a social graph, demonstrating the use of Graph Neural Networks and DGL for link prediction. Using a relatively small dataset allowed us to work efficiently on a local machine. However, as graphs scale up to millions or billions of nodes and edges, handling them requires more advanced solutions, such as distributed training on GPU clusters. As a next step, we’ll explore approaches for handling large-scale graphs and implementing link prediction in cloud environments, enabling you to apply these techniques to production-level datasets. Graphs are a powerful way to represent data, uniquely capturing relationships between entities across a wide variety of applications. Whether you're modeling social networks, protein interactions, transportation systems, or recommendation engines, graphs naturally represent and analyze these complex interdependencies. In today’s data-driven world, understanding relationships between entities is often just as important as understanding the entities themselves - this is where graphs truly shine. Link prediction is one of the fundamental tasks in graph analytics, involving the prediction of connections (or links) between nodes (the entities represented in a graph). Imagine recommending new friends in a social network, predicting potential collaborations in an academic citation graph, or forecasting future interactions between users and products in an e-commerce setting - these are all examples of link prediction in action. This task helps expand networks, infer missing information, and detect anomalies. For applications ranging from enhancing user experience to improving fraud detection, link prediction is a key ingredient for success. Link prediction To illustrate link prediction, we will use the Twitch Social Network dataset from the Stanford Network Analysis Project (SNAP). This dataset captures the social connections between users on the Twitch streaming platform, where nodes represent Twitch users, and edges represent friendships between them. The dataset is well-structured, making it easy to preprocess and work with. Twitch Social Network dataset By following along, you'll learn how to set up a project, preprocess data, build a model, and evaluate it for link prediction on a real-world dataset. Graph Neural Networks and Deep Graph Library Working with graph-structured data presents unique challenges, and this is where Graph Neural Networks (GNNs) come in. GNNs are a type of neural network specifically designed to work with graph data. Unlike traditional neural networks that operate on fixed-size input, GNNs can handle arbitrary graph sizes and leverage the connectivity patterns within the data. By aggregating information from a node's neighbors, GNNs learn representations that capture both the node attributes and the structure of the graph, making them highly effective for tasks like node classification, link prediction, and graph classification. Graph Neural Networks (GNNs) The Deep Graph Library (DGL.ai) is a powerful toolkit for building GNNs with ease and efficiency. With DGL, developers can leverage state-of-the-art GNN architectures to tackle a variety of tasks, including link prediction. DGL provides a range of utilities for working with both homogeneous and heterogeneous graphs, making it a versatile tool for researchers and practitioners alike. By simplifying the implementation of GNNs, DGL allows you to focus more on developing innovative solutions rather than getting bogged down in the underlying technical complexities. Deep Graph Library ( DGL.ai ) DGL.ai With this foundation in mind, let’s dive into building a link prediction model using GNNs and DGL.ai . DGL.ai Setting up the project The first step is to set up our project by importing the required libraries: import json import numpy as np import pandas as pd import dgl from dgl.data import DGLDataset from dgl.nn import SAGEConv import torch import torch.nn as nn from torch.nn.functional import binary_cross_entropy_with_logits, relu, dropout from torch.nn.utils import clip_grad_norm_ from torch.optim.lr_scheduler import ReduceLROnPlateau import itertools import scipy.sparse as sp from sklearn.metrics import roc_auc_score import json import numpy as np import pandas as pd import dgl from dgl.data import DGLDataset from dgl.nn import SAGEConv import torch import torch.nn as nn from torch.nn.functional import binary_cross_entropy_with_logits, relu, dropout from torch.nn.utils import clip_grad_norm_ from torch.optim.lr_scheduler import ReduceLROnPlateau import itertools import scipy.sparse as sp from sklearn.metrics import roc_auc_score Data preprocessing and train-test split To prepare the data for training, we’ll first load the Twitch dataset, represent it as a graph, and then split it into training and testing sets. We’ll create a custom class that inherits from DGLDataset , which will help structure the data-loading process and streamline graph-related operations. DGLDataset Loading the Graph Data: In the process function, we start by reading the list of edges from the dataset. Since the edges in the original dataset are undirected (representing mutual friendships), we add reverse edges to ensure that connections are bidirectional in our graph. Loading Node Features: Next, we load node features from a JSON file. Each node has a list of features that may vary in length, so we pad shorter feature lists with zeros to maintain a consistent length across all nodes. Loading the Graph Data : In the process function, we start by reading the list of edges from the dataset. Since the edges in the original dataset are undirected (representing mutual friendships), we add reverse edges to ensure that connections are bidirectional in our graph. Loading the Graph Data Loading Node Features : Next, we load node features from a JSON file. Each node has a list of features that may vary in length, so we pad shorter feature lists with zeros to maintain a consistent length across all nodes. Loading Node Features Here’s the code to create and preprocess the dataset: # create a dataset that inherits DGLDataset class SocialNetworkDataset(DGLDataset): def __init__(self): super().__init__(name='social_network') def process(self): # load edges edges_df = pd.read_csv('./twitch/ENGB/musae_ENGB_edges.csv') # ensure edges are bidirectional edges_df_rev = edges_df.copy() edges_df_rev.columns = ['to', 'from'] edges_df_rev = edges_df_rev[['from', 'to']] edges_df = pd.concat([edges_df, edges_df_rev], ignore_index=True) edges_df.drop_duplicates(inplace=True) # create a graph using DGL max_node_id = max(edges_df['from'].max(), edges_df['to'].max()) edges_src = torch.from_numpy(edges_df['from'].to_numpy()) edges_dst = torch.from_numpy(edges_df['to'].to_numpy()) self.graph = dgl.graph( (edges_src, edges_dst), num_nodes=max_node_id + 1, ) # load and node features with open('./twitch/ENGB/musae_ENGB_features.json') as f: node_features_dict = json.load(f) # feature lists have various lengths, pad them with zeros max_feature_list_len = max([len(l) for l in node_features_dict.values()]) for k in node_features_dict: if len(node_features_dict[k]) < max_feature_list_len: node_features_dict[k] += [0] * (max_feature_list_len - len(node_features_dict[k])) # set node features in graph node_features_df = pd.DataFrame.from_dict(node_features_dict).T.astype('float64') node_features_np = node_features_df.to_numpy() self.graph.ndata['feat'] = torch.from_numpy(node_features_np).float() def __len__(self): return 1 # only the whole graph is returned def __getitem__(self, idx): return self.graph # create a dataset that inherits DGLDataset class SocialNetworkDataset(DGLDataset): def __init__(self): super().__init__(name='social_network') def process(self): # load edges edges_df = pd.read_csv('./twitch/ENGB/musae_ENGB_edges.csv') # ensure edges are bidirectional edges_df_rev = edges_df.copy() edges_df_rev.columns = ['to', 'from'] edges_df_rev = edges_df_rev[['from', 'to']] edges_df = pd.concat([edges_df, edges_df_rev], ignore_index=True) edges_df.drop_duplicates(inplace=True) # create a graph using DGL max_node_id = max(edges_df['from'].max(), edges_df['to'].max()) edges_src = torch.from_numpy(edges_df['from'].to_numpy()) edges_dst = torch.from_numpy(edges_df['to'].to_numpy()) self.graph = dgl.graph( (edges_src, edges_dst), num_nodes=max_node_id + 1, ) # load and node features with open('./twitch/ENGB/musae_ENGB_features.json') as f: node_features_dict = json.load(f) # feature lists have various lengths, pad them with zeros max_feature_list_len = max([len(l) for l in node_features_dict.values()]) for k in node_features_dict: if len(node_features_dict[k]) < max_feature_list_len: node_features_dict[k] += [0] * (max_feature_list_len - len(node_features_dict[k])) # set node features in graph node_features_df = pd.DataFrame.from_dict(node_features_dict).T.astype('float64') node_features_np = node_features_df.to_numpy() self.graph.ndata['feat'] = torch.from_numpy(node_features_np).float() def __len__(self): return 1 # only the whole graph is returned def __getitem__(self, idx): return self.graph We now initialize our dataset to load the graph data. # init the dataset dataset = SocialNetworkDataset() g = dataset[0] # init the dataset dataset = SocialNetworkDataset() g = dataset[0] The next step is to create the training and testing sets . We'll split the edges into an 80/20 ratio for training and testing. We generate both positive (existing edges) and negative samples (nonexistent edges) for both sets. For larger graphs, DGL's dgl.sampling utilities can be helpful, but here, the entire graph fits in memory. Here’s the code to create training and testing sets: create the training and testing sets 80/20 ratio positive (existing edges) (nonexistent edges) dgl.sampling # pick edges for train and test sets (80/20 split) # (for larger graphs, we can use dgl.sampling.negative etc) u, v = g.edges() edge_ids = np.random.permutation(g.num_edges()) test_set_size = int(len(edge_ids) * 0.2) train_set_size = len(edge_ids) - test_set_size # positive samples: existing edges test_positive_u, test_positive_v = u[edge_ids[:test_set_size]], v[edge_ids[:test_set_size]] train_positive_u, train_positive_v = u[edge_ids[test_set_size:]], v[edge_ids[test_set_size:]] # negative samples: nonexistent edges adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy()))) adj_negative = 1 - adj.todense() - np.eye(g.num_nodes()) negative_u, negative_v = np.where(adj_negative != 0) negative_edge_ids = np.random.choice(len(negative_u), g.num_edges()) test_negative_u, test_negative_v = ( negative_u[negative_edge_ids[:test_set_size]], negative_v[negative_edge_ids[:test_set_size]], ) train_negative_u, train_negative_v = ( negative_u[negative_edge_ids[test_set_size:]], negative_v[negative_edge_ids[test_set_size:]], ) # create a training graph by copying the original graph and removing test edges train_g = dgl.remove_edges(g, edge_ids[:test_set_size]) # define positive and negative graphs for train and test sets train_positive_g = dgl.graph((train_positive_u, train_positive_v), num_nodes=g.num_nodes()) train_negative_g = dgl.graph((train_negative_u, train_negative_v), num_nodes=g.num_nodes()) test_positive_g = dgl.graph((test_positive_u, test_positive_v), num_nodes=g.num_nodes()) test_negative_g = dgl.graph((test_negative_u, test_negative_v), num_nodes=g.num_nodes()) # pick edges for train and test sets (80/20 split) # (for larger graphs, we can use dgl.sampling.negative etc) u, v = g.edges() edge_ids = np.random.permutation(g.num_edges()) test_set_size = int(len(edge_ids) * 0.2) train_set_size = len(edge_ids) - test_set_size # positive samples: existing edges test_positive_u, test_positive_v = u[edge_ids[:test_set_size]], v[edge_ids[:test_set_size]] train_positive_u, train_positive_v = u[edge_ids[test_set_size:]], v[edge_ids[test_set_size:]] # negative samples: nonexistent edges adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy()))) adj_negative = 1 - adj.todense() - np.eye(g.num_nodes()) negative_u, negative_v = np.where(adj_negative != 0) negative_edge_ids = np.random.choice(len(negative_u), g.num_edges()) test_negative_u, test_negative_v = ( negative_u[negative_edge_ids[:test_set_size]], negative_v[negative_edge_ids[:test_set_size]], ) train_negative_u, train_negative_v = ( negative_u[negative_edge_ids[test_set_size:]], negative_v[negative_edge_ids[test_set_size:]], ) # create a training graph by copying the original graph and removing test edges train_g = dgl.remove_edges(g, edge_ids[:test_set_size]) # define positive and negative graphs for train and test sets train_positive_g = dgl.graph((train_positive_u, train_positive_v), num_nodes=g.num_nodes()) train_negative_g = dgl.graph((train_negative_u, train_negative_v), num_nodes=g.num_nodes()) test_positive_g = dgl.graph((test_positive_u, test_positive_v), num_nodes=g.num_nodes()) test_negative_g = dgl.graph((test_negative_u, test_negative_v), num_nodes=g.num_nodes()) Model: GraphSAGE Convolutional Layers and MLP Predictor We’ll use a Graph Sample and Aggregate (GraphSAGE) convolutional neural network to learn node representations, also known as embeddings , that capture both the structure and features of each node within the graph. GraphSAGE operates by aggregating feature information from each node's neighbors to create a meaningful representation for each node. This process, known as neighbor aggregation , enables the model to learn rich, localized patterns in the graph. Graph Sample and Aggregate (GraphSAGE) embeddings neighbor aggregation In each GraphSAGE layer, the model applies an aggregation function (in this case, the "mean" function) to collect information from neighboring nodes, which is then combined with the node’s own features. Stacking multiple convolutional layers allows the model to capture information from increasingly distant nodes , effectively expanding each node’s view within the graph. aggregation function (in this case, the "mean" function) multiple convolutional layers capture information from increasingly distant nodes To enhance model performance and reduce overfitting, we’ll apply dropout after each layer. dropout Let’s now build the GraphSAGE model with three convolutional layers , along with a forward function to define how data flows through it: three convolutional layers forward # define the GraphSAGE model with 3 convolutional layers class GraphSAGE(nn.Module): def __init__( self, input_features, hidden_features, output_features, dropout_probability=0.3, ): super(GraphSAGE, self).__init__() self.conv_layer_1 = SAGEConv(input_features, hidden_features, "mean") self.conv_layer_2 = SAGEConv(hidden_features, hidden_features, "mean") self.conv_layer_3 = SAGEConv(hidden_features, output_features, "mean") self.dropout_probability = dropout_probability def forward(self, graph, input_features): # first layer with ReLU activation and dropout h = relu(self.conv_layer_1(graph, input_features)) h = dropout(h, p=self.dropout_probability) # second layer with ReLU activation and dropout h = relu(self.conv_layer_2(graph, h)) h = dropout(h, p=self.dropout_probability) # third layer without dropout h = self.conv_layer_3(graph, h) return h # define the GraphSAGE model with 3 convolutional layers class GraphSAGE(nn.Module): def __init__( self, input_features, hidden_features, output_features, dropout_probability=0.3, ): super(GraphSAGE, self).__init__() self.conv_layer_1 = SAGEConv(input_features, hidden_features, "mean") self.conv_layer_2 = SAGEConv(hidden_features, hidden_features, "mean") self.conv_layer_3 = SAGEConv(hidden_features, output_features, "mean") self.dropout_probability = dropout_probability def forward(self, graph, input_features): # first layer with ReLU activation and dropout h = relu(self.conv_layer_1(graph, input_features)) h = dropout(h, p=self.dropout_probability) # second layer with ReLU activation and dropout h = relu(self.conv_layer_2(graph, h)) h = dropout(h, p=self.dropout_probability) # third layer without dropout h = self.conv_layer_3(graph, h) return h The output after the third layer ( h ) contains the node embeddings. To predict the likelihood of an edge (or link) between any two nodes, we’ll use a Multi-Layer Perceptron (MLP) predictor . This MLP takes the embeddings of two nodes as input and computes a score indicating the probability of an edge existing between them. h Multi-Layer Perceptron (MLP) predictor MLP takes the embeddings of two nodes as input and computes a score # define the MLP predictor class MLPPredictor(nn.Module): def __init__(self, hidden_features): super().__init__() # first linear layer to combine node embeddings self.W1 = nn.Linear(hidden_features * 2, hidden_features) # second linear layer to produce a single score output self.W2 = nn.Linear(hidden_features, 1) def apply_edges(self, edges): # concatenate source and destination node embeddings h = torch.cat([edges.src["h"], edges.dst["h"]], dim=1) # pass through MLP layers to get the edge score score = self.W2(relu(self.W1(h))).squeeze(1) return {'score': score} def forward(self, g, h): with g.local_scope(): g.ndata["h"] = h g.apply_edges(self.apply_edges) return g.edata["score"] # define the MLP predictor class MLPPredictor(nn.Module): def __init__(self, hidden_features): super().__init__() # first linear layer to combine node embeddings self.W1 = nn.Linear(hidden_features * 2, hidden_features) # second linear layer to produce a single score output self.W2 = nn.Linear(hidden_features, 1) def apply_edges(self, edges): # concatenate source and destination node embeddings h = torch.cat([edges.src["h"], edges.dst["h"]], dim=1) # pass through MLP layers to get the edge score score = self.W2(relu(self.W1(h))).squeeze(1) return {'score': score} def forward(self, g, h): with g.local_scope(): g.ndata["h"] = h g.apply_edges(self.apply_edges) return g.edata["score"] The MLP predictor works as follows: MLP predictor Input Size: The predictor’s input size is based on the learned node embeddings. Since each edge connects two nodes, we concatenate their embeddings (each of size hidden_features), resulting in an input size of hidden_features * 2. Layer 1 (W1): This layer processes the concatenated embeddings and reduces the output dimension back to hidden_features, preserving important relational information between the nodes. Layer 2 (W2): This final layer produces a single scalar score for each pair of nodes, which represents the likelihood of an edge existing between them. Input Size: The predictor’s input size is based on the learned node embeddings. Since each edge connects two nodes, we concatenate their embeddings (each of size hidden_features), resulting in an input size of hidden_features * 2. Input Size: we concatenate their embeddings Layer 1 (W1): This layer processes the concatenated embeddings and reduces the output dimension back to hidden_features, preserving important relational information between the nodes. Layer 1 (W1): reduces the output dimension Layer 2 (W2): This final layer produces a single scalar score for each pair of nodes, which represents the likelihood of an edge existing between them. Layer 2 (W2): scalar score likelihood of an edge This layered approach allows the predictor to capture complex relationships between pairs of nodes and compute an edge score that can be interpreted as the probability of an edge's existence. Loss function and AUC To train our model effectively, we need a loss function that can quantify the model’s performance on link prediction. Since this task is a binary classification problem - where each link either exists or doesn’t - we use binary cross-entropy (BCE) as our loss function. Binary cross-entropy measures the discrepancy between the model’s predicted scores and the actual labels (1 for an existing link, 0 for no link). We use the _with_logits version because our model outputs raw scores (logits) rather than probabilities. This version of BCE is more stable when working with logits, as it combines the sigmoid function and cross-entropy into one step. quantify the model’s performance binary classification problem binary cross-entropy (BCE) _with_logits Here’s the code that computes loss: def compute_loss(positive_logits, negative_logits): # concatenate positive and negative scores y_predicted = torch.cat([positive_logits, negative_logits]) # create true labels (1 for existing links, 0 for nonexistent) y_true = torch.cat([torch.ones(positive_logits.shape[0]), torch.zeros(negative_logits.shape[0])]) return binary_cross_entropy_with_logits(y_predicted, y_true) def compute_loss(positive_logits, negative_logits): # concatenate positive and negative scores y_predicted = torch.cat([positive_logits, negative_logits]) # create true labels (1 for existing links, 0 for nonexistent) y_true = torch.cat([torch.ones(positive_logits.shape[0]), torch.zeros(negative_logits.shape[0])]) return binary_cross_entropy_with_logits(y_predicted, y_true) To evaluate the model, we use the Area Under the ROC Curve (AUC) metric. AUC is well-suited for link prediction because it handles imbalanced data effectively, where negative samples (nonexistent edges) are much more common than positive samples. The AUC score gives us a sense of how well the model ranks existing links higher than nonexistent ones. Area Under the ROC Curve (AUC) handles imbalanced data Here’s the code to compute AUC: def compute_auc(positive_logits, negative_logits): y_predicted = torch.cat([positive_logits, negative_logits]).detach().numpy() y_true = torch.cat([torch.ones(positive_logits.shape[0]), torch.zeros(negative_logits.shape[0])]).detach().numpy() return roc_auc_score(y_true, y_predicted) def compute_auc(positive_logits, negative_logits): y_predicted = torch.cat([positive_logits, negative_logits]).detach().numpy() y_true = torch.cat([torch.ones(positive_logits.shape[0]), torch.zeros(negative_logits.shape[0])]).detach().numpy() return roc_auc_score(y_true, y_predicted) Note: We use detach() to remove the tensors from the computation graph, allowing us to calculate the AUC without affecting gradients. detach() Training the model Now we're ready to train the model. To start, we’ll instantiate the model, the predictor, and the optimizer, and define the training loop . We’ll also specify the learning rate, hidden layer size, and dropout rate, among other hyperparameters. While we won’t cover hyperparameter optimization here, we’ll use a learning rate scheduler to adjust the learning rate if the loss plateaus—meaning it stops decreasing for a set number of epochs (in this case, 25). The scheduler halves the learning rate when this happens, which can help the model converge more effectively. instantiate the model, the predictor, and the optimizer, and define the training loop we won’t cover hyperparameter optimization we’ll use a learning rate scheduler Here’s the code to initialize the model and set up training: # init the model num_hidden_features = 32 model = GraphSAGE( train_g.ndata['feat'].shape[1], num_hidden_features, num_hidden_features, ) predictor = MLPPredictor(num_hidden_features) # create an optimizer and a learning rate scheduler learning_rate = 0.01 optimizer = torch.optim.Adam( itertools.chain(model.parameters(), predictor.parameters()), lr=learning_rate, ) lr_scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=25) # train the model num_epochs = 1000 for epoch in range(num_epochs + 1): # forward h = model(train_g, train_g.ndata['feat']) positive_logits = predictor(train_positive_g, h) negative_logits = predictor(train_negative_g, h) loss = compute_loss(positive_logits, negative_logits) # backward optimizer.zero_grad() loss.backward() # clip gradients clip_grad_norm_(model.parameters(), 1.0) optimizer.step() # adjust learning rate based on the loss lr_scheduler.step(loss) # print loss, current learning rate, and AUC if epoch % 100 == 0: last_lr = lr_scheduler.get_last_lr()[0] train_auc = compute_auc(positive_logits, negative_logits) print(f'Epoch: {epoch}, learning rate: {last_lr}, loss: {loss}, AUC: {train_auc}') # init the model num_hidden_features = 32 model = GraphSAGE( train_g.ndata['feat'].shape[1], num_hidden_features, num_hidden_features, ) predictor = MLPPredictor(num_hidden_features) # create an optimizer and a learning rate scheduler learning_rate = 0.01 optimizer = torch.optim.Adam( itertools.chain(model.parameters(), predictor.parameters()), lr=learning_rate, ) lr_scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=25) # train the model num_epochs = 1000 for epoch in range(num_epochs + 1): # forward h = model(train_g, train_g.ndata['feat']) positive_logits = predictor(train_positive_g, h) negative_logits = predictor(train_negative_g, h) loss = compute_loss(positive_logits, negative_logits) # backward optimizer.zero_grad() loss.backward() # clip gradients clip_grad_norm_(model.parameters(), 1.0) optimizer.step() # adjust learning rate based on the loss lr_scheduler.step(loss) # print loss, current learning rate, and AUC if epoch % 100 == 0: last_lr = lr_scheduler.get_last_lr()[0] train_auc = compute_auc(positive_logits, negative_logits) print(f'Epoch: {epoch}, learning rate: {last_lr}, loss: {loss}, AUC: {train_auc}') Let's run the code and examine the output: Epoch: 0, learning rate: 0.01, loss: 262.4156188964844, AUC: 0.4934097124994463 Epoch: 100, learning rate: 0.01, loss: 0.5642552375793457, AUC: 0.7473735298706314 Epoch: 200, learning rate: 0.01, loss: 0.4622882306575775, AUC: 0.8431058751115716 Epoch: 300, learning rate: 0.01, loss: 0.40566185116767883, AUC: 0.8777374138645864 Epoch: 400, learning rate: 0.01, loss: 0.38118976354599, AUC: 0.8944719038039551 Epoch: 500, learning rate: 0.01, loss: 0.3690297603607178, AUC: 0.9039401673234729 Epoch: 600, learning rate: 0.005, loss: 0.3579995930194855, AUC: 0.9112366798940639 Epoch: 700, learning rate: 0.005, loss: 0.3557407557964325, AUC: 0.9128097572016495 Epoch: 800, learning rate: 0.005, loss: 0.3510144352912903, AUC: 0.9152937255697913 Epoch: 900, learning rate: 0.00125, loss: 0.3425179123878479, AUC: 0.9202487786553115 Epoch: 1000, learning rate: 0.00015625, loss: 0.3432360589504242, AUC: 0.9198250134354529 Epoch: 0, learning rate: 0.01, loss: 262.4156188964844, AUC: 0.4934097124994463 Epoch: 100, learning rate: 0.01, loss: 0.5642552375793457, AUC: 0.7473735298706314 Epoch: 200, learning rate: 0.01, loss: 0.4622882306575775, AUC: 0.8431058751115716 Epoch: 300, learning rate: 0.01, loss: 0.40566185116767883, AUC: 0.8777374138645864 Epoch: 400, learning rate: 0.01, loss: 0.38118976354599, AUC: 0.8944719038039551 Epoch: 500, learning rate: 0.01, loss: 0.3690297603607178, AUC: 0.9039401673234729 Epoch: 600, learning rate: 0.005, loss: 0.3579995930194855, AUC: 0.9112366798940639 Epoch: 700, learning rate: 0.005, loss: 0.3557407557964325, AUC: 0.9128097572016495 Epoch: 800, learning rate: 0.005, loss: 0.3510144352912903, AUC: 0.9152937255697913 Epoch: 900, learning rate: 0.00125, loss: 0.3425179123878479, AUC: 0.9202487786553115 Epoch: 1000, learning rate: 0.00015625, loss: 0.3432360589504242, AUC: 0.9198250134354529 As we can see, the model achieves an AUC of around 0.92, indicating strong predictive performance . Notice that the learning rate is reduced between epochs 500 and 600 when the loss stabilizes. This adjustment allows for finer tuning, leading to a small decrease in loss. After a certain point, the loss and AUC stabilize, indicating the model has converged . AUC of around 0.92, indicating strong predictive performance converged Let's evaluate the model on the test data (which was not used during training) and see if it generalizes well: evaluate the model on the test data # evaluate the model on the test data with torch.no_grad(): test_positive_scores = predictor(test_positive_g, h) test_negative_scores = predictor(test_negative_g, h) test_loss = compute_loss(test_positive_scores, test_negative_scores) test_auc = compute_auc(test_positive_scores, test_negative_scores) print(f'Test loss: {test_loss}, Test AUC: {test_auc}') # evaluate the model on the test data with torch.no_grad(): test_positive_scores = predictor(test_positive_g, h) test_negative_scores = predictor(test_negative_g, h) test_loss = compute_loss(test_positive_scores, test_negative_scores) test_auc = compute_auc(test_positive_scores, test_negative_scores) print(f'Test loss: {test_loss}, Test AUC: {test_auc}') The result is: Test loss: 0.675215482711792, Test AUC: 0.866213400711374 Test loss: 0.675215482711792, Test AUC: 0.866213400711374 The test AUC is slightly lower than the training AUC, indicating minor overfitting. However, the AUC of 0.866 shows that the model still performs well on unseen data . Further tuning of hyperparameters could improve generalization, especially if overfitting is a concern. The test AUC is slightly lower than the training AUC, indicating minor overfitting. However, the AUC of 0.866 shows that the model still performs well on unseen data Using the trained model to predict new links With our trained model, we can now predict the likelihood of links between nodes in the graph. We’ll generate predictions for all possible pairs of nodes , allowing us to identify potential new connections. we can now predict the likelihood of links predictions for all possible pairs of nodes Generating Candidate Node Pairs: We first generate all possible pairs of nodes, excluding self-loops (connections from a node to itself). This gives us the candidate node pairs for which we want to predict link probabilities. Creating a Candidate Graph: We then create a DGL graph with these candidate edges, and use the model with node embeddings from the original graph. These embeddings, stored in the candidate graph’s ndata['h'] attribute, will serve as inputs for link prediction. Generating Candidate Node Pairs : We first generate all possible pairs of nodes, excluding self-loops (connections from a node to itself). This gives us the candidate node pairs for which we want to predict link probabilities. Generating Candidate Node Pairs Creating a Candidate Graph: We then create a DGL graph with these candidate edges, and use the model with node embeddings from the original graph. These embeddings, stored in the candidate graph’s ndata['h'] attribute, will serve as inputs for link prediction. Creating a Candidate Graph: ndata['h'] Here’s the code for these steps: # build node pairs, avoid self-loops (with_replacement=False) node_pairs = torch.combinations(torch.arange(g.num_nodes()), r=2, with_replacement=False) candidate_u = node_pairs[:, 0] candidate_v = node_pairs[:, 1] # build a graph with all node pairs candidate_graph = dgl.graph((candidate_u, candidate_v)) candidate_graph_node_embeddings = model(g, g.ndata['feat']) # we use embeddings from the original graph candidate_graph.ndata['h'] = candidate_graph_node_embeddings # use the predictor to predict the existence of links between nodes predicted_scores = predictor(candidate_graph, candidate_graph_node_embeddings) # build node pairs, avoid self-loops (with_replacement=False) node_pairs = torch.combinations(torch.arange(g.num_nodes()), r=2, with_replacement=False) candidate_u = node_pairs[:, 0] candidate_v = node_pairs[:, 1] # build a graph with all node pairs candidate_graph = dgl.graph((candidate_u, candidate_v)) candidate_graph_node_embeddings = model(g, g.ndata['feat']) # we use embeddings from the original graph candidate_graph.ndata['h'] = candidate_graph_node_embeddings # use the predictor to predict the existence of links between nodes predicted_scores = predictor(candidate_graph, candidate_graph_node_embeddings) Now that we have predictions for all candidate pairs, we can check the link probability between any specific nodes . For example, let’s examine the score and probability of a link between nodes 1773 and 7005, which are not directly connected in the initial dataset: we have predictions for all candidate pairs, we can check the link probability between any specific nodes # find the index of the node pair (1773, 7005) pair_index = torch.where((candidate_u == 1773) & (candidate_v == 7005))[0] print(f'Pair index: {pair_index}') # get the logit score for this pair and compute probability of link existence pair_link_score = predicted_scores[pair_index].item() # logit score print(f'Pair link score: {pair_link_score}') link_probability = torch.sigmoid(torch.tensor(pair_link_score)).item() # apply sigmoid to convert score into probability print(f'Link probability: {link_probability * 100}%') # find the index of the node pair (1773, 7005) pair_index = torch.where((candidate_u == 1773) & (candidate_v == 7005))[0] print(f'Pair index: {pair_index}') # get the logit score for this pair and compute probability of link existence pair_link_score = predicted_scores[pair_index].item() # logit score print(f'Pair link score: {pair_link_score}') link_probability = torch.sigmoid(torch.tensor(pair_link_score)).item() # apply sigmoid to convert score into probability print(f'Link probability: {link_probability * 100}%') This is the result: Pair index: tensor([11066978]) Pair link score: 0.7675977945327759 Link probability: 68.30010414123535% Pair index: tensor([11066978]) Pair link score: 0.7675977945327759 Link probability: 68.30010414123535% According to our model, there is a 68.3% probability of a link between user 1773 and 7005 . 68.3% probability of a link between user 1773 and 7005 Conclusion In this post, we successfully built a model for predicting new links in a social graph, demonstrating the use of Graph Neural Networks and DGL for link prediction. Using a relatively small dataset allowed us to work efficiently on a local machine. However, as graphs scale up to millions or billions of nodes and edges, handling them requires more advanced solutions, such as distributed training on GPU clusters. we successfully built a model for predicting new links Using a relatively small dataset allowed us to work efficiently on a local machine. As a next step, we’ll explore approaches for handling large-scale graphs and implementing link prediction in cloud environments, enabling you to apply these techniques to production-level datasets.