Authors: (1) Talya Eden, Bar-Ilan University (talyaa01@gmail.com); (2) Omri Ben-Eliezer, MIT (omrib@mit.edu); (3) C. Seshadhri, UC Santa Cruz (sesh@ucsc.edu). Authors: Authors: (1) Talya Eden, Bar-Ilan University (talyaa01@gmail.com); (2) Omri Ben-Eliezer, MIT (omrib@mit.edu); (3) C. Seshadhri, UC Santa Cruz (sesh@ucsc.edu). Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 1.1 Our Contribution 1.1 Our Contribution 1.2 Setting 1.2 Setting 1.3 The algorithm 1.3 The algorithm Related Work Algorithm 3.1 The Structural Decomposition Phase 3.2 The Routing Phase 3.3 Variants of WormHole Theoretical Analysis 4.1 Preliminaries 4.2 Sublinearity of Inner Ring 4.3 Approximation Error 4.4 Query Complexity Experimental Results 5.1 WormHole𝐸, WormHole𝐻 and BiBFS 5.2 Comparison with index-based methods 5.3 WormHole as a primitive: WormHole𝑀 Related Work Related Work Related Work Algorithm 3.1 The Structural Decomposition Phase 3.2 The Routing Phase 3.3 Variants of WormHole Algorithm 3.1 The Structural Decomposition Phase 3.1 The Structural Decomposition Phase 3.2 The Routing Phase 3.2 The Routing Phase 3.3 Variants of WormHole 3.3 Variants of WormHole Theoretical Analysis 4.1 Preliminaries 4.2 Sublinearity of Inner Ring 4.3 Approximation Error 4.4 Query Complexity Theoretical Analysis 4.1 Preliminaries 4.1 Preliminaries 4.2 Sublinearity of Inner Ring 4.2 Sublinearity of Inner Ring 4.3 Approximation Error 4.3 Approximation Error 4.4 Query Complexity 4.4 Query Complexity Experimental Results 5.1 WormHole𝐸, WormHole𝐻 and BiBFS 5.2 Comparison with index-based methods 5.3 WormHole as a primitive: WormHole𝑀 Experimental Results Experimental Results 5.1 WormHole𝐸, WormHole𝐻 and BiBFS 5.1 WormHole𝐸, WormHole𝐻 and BiBFS 5.2 Comparison with index-based methods 5.2 Comparison with index-based methods 5.3 WormHole as a primitive: WormHole𝑀 5.3 WormHole as a primitive: WormHole𝑀 References References ABSTRACT Computing distances and finding shortest paths in massive realworld networks is a fundamental algorithmic task in network analysis. There are two main approaches to solving this task. On one hand are traversal-based algorithms like bidirectional breadth-first search (BiBFS), which have no preprocessing step but are slow on individual distance inquiries. On the other hand are indexing-based approaches, which create and maintain a large index. This allows for answering individual inquiries very fast; however, index creation is prohibitively expensive even for moderately large networks. For a graph with 30 million edges, the index created by the state-of-the-art is about 40 gigabytes. We seek to bridge these two extremes: quickly answer distance inquiries without the need for costly preprocessing. WormHole offers several additional advantages over existing methods: (i) it does not require reading the whole graph and can thus be used in settings where access to the graph is ratelimited; (ii) unlike the vast majority of index-based algorithms, it returns paths, not just distances; and (iii) for faster inquiry times, it can be combined effectively with other index-based solutions, by running them only on the sublinear core. 1 INTRODUCTION Scalable computation of distances and shortest paths in a large network is one of the most fundamental algorithmic challenges in graph mining and graph learning tasks, with applications across science and engineering. Examples of such applications include the identification of important genes or species in biological and ecological networks [18], driving directions in road networks [1–3], redistribution of task processing from mobile devices to cloud [59], computer network design and security [24, 30, 31, 58], and identifying a set of users with the maximum influence in a social network [32, 56], among many others. Thus, a long line of work [5, 21, 25, 28, 62] has developed over the years, constructing scalable algorithms for distance computation for a variety of real-life tasks. The simplest methods for answering a shortest path inquiry (𝑠,𝑡) use traversals, among which the most basic is a breadth first search (BFS) starting from 𝑠 until we reach 𝑡. However, the inquiry time for BFS is linear in the network size, which is much too slow for real-world networks.1 A popular modification, Bidirectional BFS(BiBFS), runs BFS fromboth𝑠 and𝑡, alternatingbetween the two, until both ends meet.It has well been observed in the literature that BiBFS performs surprisingly well for shortest path inquiries on a wide range of networks (see, e.g., [8, 12, 62] and the many references within). Because BiBFS does not require any prior knowledge on the network structure, it is suitable when the number of shortest path inquiries being made is relatively small. However, pure traversal-based approaches do not scale well when one is required to answer a large number of shortest pair inquiries. As we show in Figure 1, BiBFS ends up seeing the whole graph within just a few hundred inquiries. Alongline ofmodern approaches tackles the distance computation problem in a fundamentally different manner, by preprocessing the network and creating an index. The index, in turn, supports extremely fast real-time computation of distances. This line of work has been been investigated extensively in recent years, with Pruned Landmark Labeling (PLL, Akiba et al. [5]) being perhaps the most influential approach. In virtually allindex-based methods, pre-processinginvolves choosing a subset L of nodes, called landmarks; computing all shortest paths among them; and keeping an index of the distance of every node in the network to every landmark. Thus, the space requirement for the index is at least of order 𝑛· |L|, where 𝑛 is the total number of nodes. Naively, this memory requirement can be as bad as quadratic in 𝑛. Despite several improvements to beat the quadratic footprint, existing hub and landmark-based approaches methods continue to have high cost, and can become infeasible even for moderately-sized graphs [40]. Notably, most index-based approaches only return distances in the graph, and not the paths themselves. The first concrete systematic investigation of solutions outputting shortest paths was made by Zhang et al. [62]. Their work points out that while existing index-based solutions can be adapted to also output shortest paths, these adaptations incur a very high additional space cost on top of that required for distance computations. The authors of [62] then proposed a new approach called monotonic landmark labelling (MLL) for saving on the index construction space cost. While their algorithm is the current state of the art for this problem, it is again index-based, meaning that the preprocessing cost is still rather expensive. Improving the computational complexity of the construction phase remains a fundamental challenge. Beyond the computational constraints, it is sometimes simply unrealistic to assume access to the whole network; examples of scenarios where access is only given via limited query access include, e.g., social network analysis through external APIs [10], page downloads in web graphs [14], modern lightweight monitoring solutions in enterprise security [26, 60], and state space exploration in software testing, reinforcement learning and robotics [27], among many others. Existing indexing-based approaches are unsuitable for these scenarios since they require reading the whole graph as a prerequisite. Traversal-based methods such as BiBFS are suitable, but as mentioned they do not scale well if one requires multiple distance computations. The limitations of indexing-based and traversal-based methods give rise to a natural question of whether there is a middle ground solution, with preprocessing that is more efficient than in index-based approaches and inquiry time that is faster than in traversal-based approaches. Namely, we ask: Is it possible to answer shortest-path inquiries in large networks very quickly, without constructing an expensive index, or even seeing the whole graph? Is it possible to answer shortest-path inquiries in large networks very quickly, without constructing an expensive index, or even seeing the whole graph? A general solution for any arbitrary graph is perhaps impossible; however, real-world social and information networks admit order and structure that can be exploited. In this work we address this question positively for a slightly relaxed version of the shortest path problem on such networks. Inspired by the core-periphery structure of large networks [43, 50, 52, 63], we provide a solution which constructs a sublinearly-sized index and answers inquiries by querying a strictly sublinear subset of vertices. In particular, our solution does not need to access the whole network. The algorithm returns approximate shortest paths, where the approximation error is additive and very small (almost always zero or one). In practice, the setup time is negligible (a few minutes for billion-edges graphs), and inquiry times improve on those of BiBFS. Moreover, it can be easily combined with existing index-based solutions, to further improve on the inquiry times. We also include theoretical results that shed light on the empirical performance. This paper is available on arxiv under CC BY 4.0 license. This paper is available on arxiv under CC BY 4.0 license. available on arxiv available on arxiv [1] To avoid confusion, throughout this paper we use the term inquiry to indicate a request (arriving as an input in real-time to our data structure) to compute a short path SP(𝑠,𝑡) between 𝑠 and 𝑡. The term query refers to the act, taken by the algorithm itself, of retrieving information about a specific node. For more details on the query model we consider, see §1.2.)