**Unveiling the Limits of Learned Local Search Heuristics: Are You the Mightiest of the Meek?**

by Aiding in the focused exploration of potential solutions.April 12th, 2024

**Authors:**

(1) Ankur Nath, Department of Computer Science and Engineering, Texas A&M University;

(2) Alan Kuhnle, Department of Computer Science and Engineering, Texas A&M University.

Summary and Outlook, References

In this subsection, we briefly discuss datasets included in our analysis. Our datasets as well as the experimental code can be found at this link [1].

**GSET** Stanford GSET dataset (Ye, 2003) is extensively used to benchmark SOTA heuristics (Benlic and Hao, 2013; Leleu et al., 2019) for MaxCut. The dataset comprises three types of weighted and unweighted random graphs: Erd˝os-R´enyi graphs (Erd˝os et al., 1960) with uniform edge probabilities, skew graphs with decaying connectivity, and regular toroidal graphs.

Synthetic graphs We incorporate the dataset published by Barrett et al. (2020), featuring Erd˝os-´enyi (Erd˝os et al., 1960) and Barab´asi-Albert (Albert and Barab´asi, 2002) graphs (referred to as ER and BA, respectively). These graphs involve edge weights wij ∈ {0, ±1} and include up to 500 vertices. Various optimization approaches (CPLEX, 2023; Leleu et al.,2019; Tiunov et al., 2019) are applied to each graph, and the best solution found by any of these approaches is considered the optimum solution.

**Physics** We investigate a real-world dataset known as the “Physics” dataset, comprising ten graphs with 125 vertices representing Ising models. In this dataset, each vertex has six connections, and the edge weights are defined as wij ∈ {0, ±1}. These graphs served as a generalization benchmark in Khalil et al. (2017).

In this subsection, we will discuss the two learned heuristics to be analyzed and explain why we believe it is important to reevaluate these learned heuristics.

**S2V-DQN** Khalil et al. (2017) demonstrated that S2V-DQN outperforms standard greedy, CPLEX, and semidefinite programming (Goemans and Williamson, 1995) for the Max-Cut problem. Despite the wellknown poor performance of the standard greedy approach for this problem (Fujishige, 2005), the standard greedy approach was the second-best competitor to their approach.

**ECO-DQN** Barrett et al. (2020) introduced a RL framework that allows reversible actions and provided seven handcrafted observations per node to represent the state space. Here, we want to highlight two observations closely related to the Tabu Search: i) immediate change of objective value (marginal gain) if vertex state is changed (a vertex is added to or removed from the solution set) and ii) steps since the vertex state was last changed to prevent short looping trajectories. Empirically, ECO-DQN demonstrated superior performance compared to S2V-DQN while demonstrating competitive performance with SOTA heuristics (Tiunov et al., 2019; Leleu et al., 2019). ECO-DQN allows both greedy and non-greedy actions to perform an in-depth local search, striking a balance between exploration and exploitation. We refer to the original paper for more details on the algorithm.

Despite the comprehensive comparison with SOTA heuristics, we observed an absence of a direct comparison with its simple heuristic counterpart, Tabu Search. Incorporating such a comparison can help understand whether ECO-DQN indeed learns a more evolved version of TS or if its superior performance might be attributed to its integration with a local search heuristic.

In this subsection, we will provide a brief overview of baseline heuristics for comparison.

**Maxcut Approximation (MCA)** This is a simple greedy algorithm that starts with a random solution and iteratively improves the solution by choosing the optimal move at each step until no further improvement is possible. This differs from the standard greedy approach, which starts with an empty solution set and does not allow reversible actions.

**Tabu Search (TS)** Glover (1990) proposed a local search optimization algorithm that starts with a random solution and explores the best neighboring solutions. It involves a parameter known as tabu tenure (denoted as γ), acting as short-term memory, restricting the repetition of the same actions for γ number of timesteps. This prevents revisiting recent actions and allows TS to escape from local minima and explore the search space. Although there are various improved versions of this algorithm mentioned in the original paper Glover (1990), we chose the standard Tabu search as our baseline.

We propose a simple RL framework based on TS. Within this framework, we replace the GNN component of ECO-DQN with linear regression. Additionally, we exclusively utilize a specific subset of node features from ECO-DQN that are tied to TS. This substitution enables us to conduct an empirical evaluation of the role of GNN in learning heuristics.

Given an undirected graph G(V, E) and a solution S at time step t, we formally define the state, action, and reward within our RL framework as follows:

State We define the state of the environment denoted as s (t) by providing two observations per vertex. These observations comprise: i) the marginal gain resulting if the vertex state is changed; and ii) the number of steps since the vertex state is changed. Notably, we intentionally omit all other features utilized in ECO-DQN.

**Transition** Transition is deterministic here and corresponds to changing the state of the vertex v ∈ V that was selected as the last action.

We evaluate the algorithms using the average approximation ratio as a performance metric. Given an instance, the approximation ratio for an algorithm (agent) is the ratio of the objective value of the best solution found by the agent to the objective value of the best-known solution for the instance. Unless specifically mentioned, for each graph G(V, E), we run each stochastic reversible agent for 50 randomly initialized episodes with 2|V | timesteps per episode and select the best outcome from these runs. We would like to stress that we run and present all empirical evaluations of ECO-DQN to ensure a fair comparison with it and are able to reproduce its performance on the synthetic datasets (Barrett et al., 2020) at the level reported in the original work. Due to scalability issues with GNNs, we refrain from training agents on distributions with 500 vertices.

In Table 1, we present the performance of agents trained and tested on the same distribution in synthetic datasets (Barrett et al., 2020) up to 200 vertices. The results underscore the noteworthy observation that MCA, a relatively simple heuristic with no hyperparameters, frequently outperforms S2V-DQN for the Max-Cut problem. Looking ahead, if future research on learned heuristics utilizes S2V-DQN as a benchmark and outperforms S2V-DQN, it raises two distinct possibilities. Firstly, it could signify genuine progress in enhancing performance for solving CO problems. Alternatively, it might suggest that all these learned heuristics are weak, and none of them truly excel.

This leads to our initial concern: comparing a weak heuristic against another weak heuristic does not provide meaningful insights into their performance. Benchmarking against weak heuristics may set a low standard, giving a false sense of accomplishment if the evaluated heuristic performs better. Proper baseline selection will ensure a better grasp of the extent of generalizability and the effectiveness of learned heuristics in tackling diverse CO problems.

Moreover, we note that there is no discernible difference in the performance of ECO-DQN and SoftTabu. This can imply three things: i) The problem instances are relatively simple, making it difficult for the added features unrelated to TS and GNN used in ECO-DQN to provide significant advantages over SoftTabu. ii) The additional features and GNN utilized in ECODQN may not be the primary factors contributing to its superior performance or iii) ECO-DQN and SoftTabu learn similar heuristics for solving the problem, leading to comparable performance despite differences in their architectures and approaches. This is proven to be incorrect. The probability of taking greedy actions at any timestamp greatly varies between the two approaches, as demonstrated in Figure 1.

This ability of learned heuristics to perform well on a wide range of distributions, even if these distributions are not represented during training, is a highly desirable characteristic for practical CO problems. ECODQN and S2V-DQN exhibited promising performance across a diverse range of graph structures, including those not present in their training data. We run similar experiments for TS and SoftTabu agents to see if they exhibit weaker generalization performance compared to ECO-DQN and S2V-DQN. In our empirical evaluation, we discover that SoftTabu and TS display similar or even superior performance when compared to all learned heuristics, as illustrated in Figure 2. This result is important because it raises the possibility that the generalization of learned heuristics over small instances may not be as novel as the related literature (Khalil et al., 2017; Barrett et al., 2020) suggests. If even simple local search heuristics perform well, then the learned policy, which somewhat incorporates these heuristics, may also perform well.

Our empirical evaluation, up to this point, has been somewhat inconclusive regarding the performance of ECO-DQN. This lack of clarity might be attributed, at least in part, to the lack of harder instances. Specifically, these small instances may be too simple, making it possible for SoftTabu and ECO-DQN to solve them without requiring any special efforts. Consequently, it is important to evaluate the agents on harder instances that possess the ability to differentiate performance levels. To this end, we test agents on the GSET and Physics datasets. We apply agents that are trained on ER graphs with |V | = 200, following the experimental setup by Barrett et al. (2020). They limited their experiments to ER graphs, so we extend the experiments to include all three types of graphs in GSET. Due to the high computational demands, we restrict the number of episodes per graph to 5 for GSET graphs. As these are harder instances, we let agents run for 4|V | timesteps to optimize (Barrett et al. (2020) suggested that simply increasing the number of timesteps in an episode increases the average approximation ratio). Notably in Table 2, we observe a substantial decline in performance when the ECO-DQN agents are tested on graph distributions other than ER graphs. Even on ER graphs, ECODQN demonstrates only marginal improvement. This outcome may be anticipated. When machine learning models are trained on specific datasets, they may learn patterns and heuristics that are tailored to that particular data. However, when presented with unseen or different data (i.e., during testing or deployment), these learned heuristics may not generalize well and could lead to suboptimal performance or poor outcomes.

To investigate further, we generate synthetic skew graphs using the implementation provided in Rudy, a machine-independent graph generator written by G. Rinaldi (Ye, 2003), with descriptions sourced from (Helmberg and Rendl, 2000). We train the ECODQN on small instances of similar distributions of skew graphs with 200 vertices, given that ECO-DQN performs worst on skew graphs. This yields an approximate 10% to 15% increase in performance, with the average approximation ratio improving from 0.940 to 0.955 for |V | = 800, and from 0.864 to 0.945 for | V |= 2000, while still remaining suboptimal compared to SofTabu trained on ER graphs. We provide more details about this experiment in the supplementary material.

Our experiments indicate that ECO-DQN shows promising generalization to previously unseen graph structures, especially in small instances where simple local search heuristics also perform well. However, when dealing with challenging benchmarks, ECODQN may not perform as optimally as simple heuristics or their simple learned counterparts. While it is always possible to find instances where an optimization algorithm seems to perform poorly (Wolpert and Macready, 1997), our intuition suggests that the reason behind the poor performance of ECO-DQN is that the agent can easily become trapped and cease to explore the search space. Even though the novelty of ECO-DQN lies in the capacity for continuous solution enhancement through exploration, the ECO-DQN agent tends to excessively revisit specific vertices, as shown in Figure 3, thus curbing the exploration of the search space. It becomes clearer when comparing Figure 3b and Figure 3d. When the ECO-DQN agent is trained on skewed graphs, it learns to explore more, resulting in better performance on skewed graphs.

We find out that SoftTabu exhibits a more even distribution of flips in an episode (details provided in the supplementary material) and finds better solutions while exploring. This is an interesting outcome in the sense that GNNs seem like a natural choice to solve combinatorial graph problems. However, if the underlying principles of integrated heuristics are something simple, such as exploring to avoid local minima, simpler machine learning models can often adequately learn these principles. Simpler machine learning models can save computational resources and lead to similar or even better results.

[1] Code Repository: https://tinyurl.com/52ykxtaj

This paper is available on arxiv under CC 4.0 DEED license.

L O A D I N G

. . . comments & more!

. . . comments & more!