Authors:
- Thomas Feltin
- Léo Marchó
- Juan-Antonio Cordero-Fuertes
- Frank Brockners
- Thomas H. Clausen
Abstract:
Deep neural network (DNN) inference on streaming data requires computing resources to satisfy inference throughput requirements. However, latency and privacy sensitive deep learning applications cannot afford to offload computation to remote clouds because of the implied transmission cost and lack of trust in third-party cloud providers. Among solutions to increase performance while keeping computation on a constrained environment, hardware acceleration can be onerous, and model optimization requires extensive design efforts while hindering accuracy. DNN partitioning is a third complementary approach, and consists of distributing the inference workload over several available edge devices, taking into account the edge network properties and the DNN structure, with the objective of maximizing the inference throughput (number of inferences per second). This paper introduces a method to predict inference and transmission latencies for multi-threaded distributed DNN deployments, and defines an optimization process to maximize the inference throughput. A branch and bound solver is then presented and analyzed to quantify the achieved performance and complexity. This analysis has led to the definition of the acceleration region, which describes deterministic conditions on the DNN and network properties under which DNN partitioning is beneficial. Finally, experimental results confirm the simulations and show inference throughput improvements in sample edge deployments.
DNN partitioning at the edge enables acceleration of the inference throughput while lowering the bandwidth usage and latency to centralized servers, e.g., in the cloud.
Introduction
Connected devices generated an estimated 2.5 quintillion bytes of data every day in 2020.
AI applications can have strong latency constraints, e.g., in autonomous driving, manufacturing monitoring, or any real-time inference involving a real world interaction. For example, average response times in autonomous driving are required to be under 100 milliseconds
In comparison with public clouds, the edge is a resource-constrained environment with important limitations
There are two main available methods to meet performance requirements on constrained edge networks: model compression or hardware acceleration. Model compression consists of reducing and optimizing neural networks to a light-weight, often under-performing, version of the model. Standard model compression methods include pruning
DNN partitioning is a complementary method for accelerating inference by leveraging the multiplicity of existing devices on edge networks to distribute the inference computation– which can be used alone, or in conjunction with the two other methods. DNN partitioning consists of considering a neural network as a pipeline to segment into partitions, and distributing these partitions on edge devices. The placement of these partitions is based on both the DNN and the underlying network characteristics. DNN partitioning relies on the identification of split points, which are points in the model graph where the model is separated into partitions. During run-time, partitions are run sequentially, each sending intermediary inference results to the next partition. This allows each partition to start computing the next input data while the other devices continue processing the offloaded one, hereby improving the inference throughput, as shown in
A. Related Work
Methods for DNN partitioning, derived from mobile edge-cloud offloading, seek to optimize varying performance indicators, such as computing latency, energy consumption, resource utilization, cost, or throughput. DNN partitioning relies on the association of one or several of these metrics to define an optimization goal, and a method which exploits this metric to find partitioning schemes. For example, Neurosurgeon
These methods are either interacting with remote clouds, or are limited in the nature of the partitioning, e.g., required to split in exactly two parts, with or without multi-threading. None of the listed DNN partitioning studies considers both multiple split points and multi-threading.
B. Contributions
This paper presents a distributed inference framework to maximize the inference throughput of real-time DNN computation on streaming data, with multiple split points and multi-threaded partitioning. The contributions of this paper are the following:
- A model for computing and transmission latencies of a distributed DNN, through which the expected inference throughput of a given partitioning scheme can be estimated.
- Formulation of the optimization problem for DNN partitioning, implementation of a branch and bound solver for this problem, and evaluation of its complexity.
- Simulations to explore the possible performance improvements with varying network and DNN properties.
- The identification of deterministic regions in the network and DNN properties leading to the existence of optimal partitionings and the cost to compute such solutions, i.e., the conditions under which DNN partitioning is beneficial.
- Experimental results illustrating the regions defined in the simulations, as well as the prediction accuracy and final inference throughput acceleration for homogeneous and heterogeneous environments.
For the remainder of this paper, to avoid confusion, the partitioned deep neural network will be referenced as DNN, and the term network will denote the underlying physical communications network.
C. Paper Outline
The remainder of this paper is organized as follows.
SECTION II.
Background: Deep Learning
Within the field of Machine Learning, Deep Learning (DL) refers to methods relying on the use of DNNs, i.e., artificial neural networks (or related machine learning methods) containing more than one hidden layer. One of the earliest references to deep learning architectures was published in 1967
There are two main DNN structures: feed-forward neural networks and recurrent neural networks. Both types of DNNs can be represented as computation graphs, with the main difference being that feed-forward networks can be represented as direct acyclic graphs (DAGs), while recurrent neural networks may contain cycles. Recurrent neural networks are built for sequences of data, where each data point has a potential dependency with previous data points in the sequence, while feed-forward neural networks do not consider interactions between points in a data sequence.
The remainder of this paper will consider the case of feed-forward DNNs.
SECTION III.
Distributed Inference Modeling
This Section presents a model to represent DNN and network properties in order to predict computation latencies, transmission latencies, and the final inference throughput of a given partitioning solution.
A. DNN and Network Representation
A feed-forward DNN of N layers is modeled as a DAG GA=(L,E) with L=(L1,…,LL) the layers of the DNN. Edges (Li,Lj)∈E are the connections between layers Li and Lj . Each layer Li has an associated compute consumption ci , measured in the number of floating-point operations required to compute a forward pass through the layer. Edges (Li,Lj) are assigned a weight si,j corresponding to the size of the data transiting between layers Li and Lj in bytes.
The physical network is modeled as a fully connected graph G=(N,V) where N=(N1,…,NN) is the set of compute nodes and V is the set of links between nodes. It is assumed that compute nodes N1,…,NN have processing rates η1,…,ηN , respectively, measured in floating-point operations per second. Finally, the link throughput between two adjacent nodes Na and Nb is denoted as θa,b , measured in bytes per second. Every node Ni is connected to itself with infinite throughput to represent the loopback link, and links are assumed to be symmetrical, i.e., θa,b=θb,a .
Partitionings are defined as maps P:L→N , i.e., a partitioning assigns a node number to each layer in the DNN. A partitioning can be described as a matrix P of dimension (N×L) , N being the number of nodes on the network and L the number of DNN layers and, with Pa,i=1 if layer Li is placed on node Na , 0 otherwise. With the example given in
Given a partitioning, a thread is defined as a group of consecutive layers between two split points, run sequentially on the same node. In the example above, node N1 will run two threads, the first containing layers {L1,L2,L3} and the second one containing {L6,L7} .
With these notations defined, the remainder of this Section derives a closed expression of the inference throughput achieved by a given partitioning, as a function of the inference (
B. Inference Latency Prediction
Given a partitioning, this Section looks at inference latency prediction on an isolated node. The latency induced by the computation of a layer Li with consumption ci , to be computed on node Na , with processing rate ηa is expressed as Tci(Na)=ci/ηa . Given a set of layers L′⊂L across all threads running on node Na with processing rate ηa , the inference latency can be expressed as:
This expression is a first order approximation of an inference latency estimation based on the number of floating-point operations (FLOPs) required to process the DNN, i.e., the number of multiply-add operations in the model.
Estimation of inference latency on edge devices is complicated by run-time optimizations. Existing solutions such as nn-Meter
In
Across all nodes, the thread with highest latency sets the inference throughput for the distributed DNN. Omitting transmission latencies, this implies that once this limiting thread is done computing a data frame, it directly starts processing the next one. Other threads will have idle time before processing the next data frame because their inference latency is lower than this maximum latency, as shown on
Given a partitioning matrix P , the inference latencies can be expressed as a single vector Tc of size N where the a -th component is the highest thread inference latency on node Na , i.e., Tca=Tc(Na) :
where H is the diagonal matrix of inverse node processing rates diag(η−11,…,η−1N) and c is the column vector of individual layer consumptions (c1,…,cL) .
C. Transmission Latency Prediction
The transmission latency can be predicted as follows: the time it takes to send the amount of data si,j between layers Li and Lj on edge (Na,Nb) over a link with measured throughput θa,b is Tti,j(Na,Nb)=si,j/θa,b . The time to achieve data transfers si,j∈E′⊂E over link (Na,Nb) is:
Similarly to the inference latency prediction, this implies that the links are shared evenly between data transfers from Na to Nb . The transmission latency prediction relies on the estimation of the link throughput between nodes, more precisely the goodput, i.e., the amount of useful data transmitted per second.
Similarly to inference latency estimation, this is a first order approximation, which efficiently offers good prediction results, as shown in
Given a partitioning matrix P , the transmission latency matrix Tt can be defined as a square matrix of size L , with Tta,b=Tt(Na,Nb) :
where ° is the term by term product, or Hadamard product, Θ is the inverse throughput matrix, i.e., a square matrix of size N with Θa,b=θ−1a,b and S is the transmission size matrix, i.e., a square matrix of size L with Si,j=si,j if Li sends data to Lj , 0 otherwise.
SECTION IV.
DNN Partitioning
The objective of DNN partitioning in this study is to maximize the inference throughput, denoted by C , which corresponds to the inverse of the maximum latency between all the different computation and transmission latencies. Given a partitioning P , the inference throughput is defined as:
Maximizing the inference throughput amounts to finding the joint min-max between all the terms of Tc and Tt . This can be defined as a discrete optimization problem:
with Ji,j being the all-ones matrix of size i×j , conditions
This optimization problem is a Mixed Integer Non-linear Programming (MINLP) problem. The partitioning matrix can be seen as the one-hot encoded version of a vector of size (1×L) with values in [[1,N]] . MINLP problems are NP-hard, implying that the complexity of finding an optimal DNN partitioning is O(NL) since each of the L layers can be placed on N nodes. There are several heuristics to obtain optimal or sub-optimal solutions to this problem in less time than a brute force algorithm, e.g., Genetic Algorithms (GA)
A. Branch and Bound Algorithm
This Section presents the branch and bound (B&B) adaptation to the DNN partitioning optimization problem defined in
In B&B algorithms, the solution space is represented by a tree. The process consists of eliminating entire branches in the tree based on the evaluation of the score of the root node. This implies the existence of a tree topology which creates a relationship between the score of a node in the tree and the bounds on the scores of all of its leaves.
This tree representation of the solution space favors the use of the B&B algorithm in this study. Assuming ps is a partial list of size s , the children of the node ps in the tree topology are all lists which start with ps , i.e., all partitionings where the first s layers are placed according to ps . The leaf node below ps corresponding to ps padded with the last value of ps is labeled ps~ . For example, if L=5 and p2=(2,1) , then p2~=(2,1,1,1,1) . Given Ttmax=maxTt(ps~) the maximum value of the transmission latency matrix for partitioning ps~ , all leaf nodes p below ps will have a higher maximum transmission latency than ps~ , i.e., ∀p,maxTt(p)≥Ttmax . This is explained by the fact that given a partial partitioning, any displacement in the other layers will only add data transfers between nodes.
During the process of looking for an optimal partitioning p , which maximizes the inference throughput C(p,GA,GN) , and given a current best achievable throughput best_throughput , the B&B optimization consists of eliminating entire branches of the tree representing the solution space by evaluating transmission times in partial partitionings. This process allows the B&B optimization to quickly eliminate numerous cases compared to a brute force algorithm. For example, if the throughput between nodes is low compared to the compute capacity, e.g., with nodes connected through a low throughput wireless connection, the optimal solution is often to keep all computation on a single node. The advantage of the B&B algorithm is that it terminates after N operations in such simple cases by eliminating all partial partitionings in the first stage of the tree. Conversely, in cases where links have higher throughputs, the complexity of the B&B can be higher since it will be unable to eliminate large branches.
B. Branch and Bound Complexity
The overhead caused by the total number of partitions in a real world implementation is neglected in this approach. This added latency is correlated with the total number of partitions. With each partition, the data transfer from Network Interface Controller (NIC) to memory causes an additional latency, related to the PCIe throughput, memory throughput and size of the transferred data. Assuming that the data transfer throughput is limited by the memory throughput, a simplistic evaluation of this latency would change the computation latency in
In this expression, ν is the memory throughput, i.e., the read/write speed to and from the memory, and (i,j)∈C are all the layers Li and Lj such that edge (Li,Lj) is a split point with associated data transfer si,j .
1) Early Stopping
With an increasing number of partitions, both the performance and inference latency prediction accuracy will decrease. For this reason, and for the increase of complexity in cases with high available network throughput, an early stopping mechanism is added to limit the number of split points in the final partitioning, i.e., B&B will only explore partial partitioning with a maximum of S split points, further limiting the size of the explored tree. The complete B&B algorithm with limited split points is shown in
Algorithm 1 Branch and Bound Algorithm
Limiting the number of split points lowers the complexity of this MINLP problem. For N the number of nodes, and L the number of DNN layers, the number of possible partitionings considered by the algorithm is now lowered to:
instead of O(NL) for a brute force algorithm.
SECTION V.
Simulations
This Section presents simulations to evaluate the impact of the algorithm parameters on the B&B algorithm performance and complexity. The worst case complexity for B&B is described in
- The relationship between number of nodes N and the maximum allowed number of split points S , which determines the required number of B&B iterations to reach the optimal solution.
- The relationship between link throughput θ and node processing rate η , which determines the existence of a partitioning which improves the inference throughput.
These interactions are evaluated via the following metrics, which can be used to compare the proposed method to the literature:
- The achieved inference throughput, measured in inferences per second.
- The number of B&B iterations to reach the solutions, i.e., the number of explored partitionings in the solutions tree. This metric is used instead of computation time in order to abstract the used device performance. For reference, when running B&B on a desktop computer with an Intel Core i7 processor, it took 17 milliseconds to run through 102 iterations, 1.269 seconds to run through 104 iterations, and 5 minutes to run through 106 iterations.
To isolate each variable contribution, the simulations are run in a homogeneous network scenario, i.e., all nodes have an identical processing rate, and all links have an identical throughput.
A. Number of Nodes and Split Points
The set-up consists of a homogeneous network with processing rates set at 5GHz (effective processing rate of a standard edge device, e.g., a Raspberry Pi 4__8__), and link throughputs at 10MBps (effective throughputs for connections through 802.11). The number of compute nodes N on the network varies between 1 and 6, and the maximum number of split points S varies from 0 (keeping all computation on a single node) to 5 (for a total of 6 partitions, spread across the network).
This limitation can be explained as follows: with the given DNN and network properties, the optimal placement found on N=3 nodes, and a maximum number of splits S=3 , adding a new node to the network, or a possibility for another split point, will not lead to a better partitioning. In a homogeneous network, this implies that any displacement of layers to another node will imply transmission latencies higher than the maximum computing latency of the current partitioning. This leads to a bound on the value of S , after which point the optimal partitioning is the best possible achievable partitioning. This bound can be expressed as:
with Tcmono the computing latency when keeping all computation on a single node, Ttdisp the transmission latency caused by a layer displacement, η and θ the node processing rate and link throughput values in the homogeneous network, ∑Li∈Lci the sum of all layer consumptions in the DNN, and smin the minimal inter-layer data transfer size. This expression can be used to define an upper bound on the value of S , and limit the B&B computation time.
For the remainder of this paper, experiments and simulations will be run with a maximum number of splits S=3 to limit the complexity of the algorithm, with near optimal partitioning in the described set-up, which corresponds to a typical edge scenario.
B. Processing Rate and Link Throughput
The set-up consists of a standard implementation of YOLOv2
- Processing rates between 0.1GHz and 100GHz cover CPUs of systems on a chip, e.g., the Qualcomm Snapdragon suite__9__ with processing rates between several 500MHz and 1GHz, and specialized AI embedded systems, e.g., the NVIDIA Jetson TX2 module,
10 with a measured processing rate of 30.7 GHz in the experiments ofSection VI . - Link throughputs between 1MBps and 10GBps correspond to link throughputs covering 802.11g connections at several MBps, and Gigabit Ethernet links.
Results show that achieved inference throughput and B&B iterations increase with both processing rate and link throughput. This is expected, since faster processors yield faster inferences, and better link throughput creates possibilities for improved partitionings.
Additionally, these simulations highlight the existence of distinguishable regions in the two heatmaps of
1) Partitioning Improvement Boundary
The green dotted line separates the top-left region of
with η and θ the node processing rate and link throughput values in the homogeneous network, ∑Li∈Lci the sum of all layer consumptions in the DNN, and smin the minimal inter-layer data transfer size. This expression is a particular case of
This result is essential, in order to understand DNN partitioning. It defines a criterion on the region where distributing inference can improve the overall performance. In a homogeneous scenario, for every DNN, the link throughput to node processing rate ratio θη (which is a property of the network) needs to exceed a deterministic value described in
2) Optimal Partitioning Boundary
The red dotted line in
3) Maximum Complexity Boundary
The blue dotted line represents the boundary of the region in which the number of evaluated partitionings by B&B is maximal, and corresponds to θη≈0.168 . On the bottom right part of this boundary, increasing the throughput to node processing rate ratio will not impact the complexity of B&B which has reached the maximum number of evaluations it runs through to reach the optimal solution. Notably, the maximum number of partitionings evaluated by B&B in the worst case scenario is around 1.7×105 , which remains below the complexity of B&B with limited number of split points S=3 (around 1.9×106 iterations in this context), and orders of magnitude below the brute force scenario, which would need to evaluate NL≈1029 partitionings (for the given YOLOv2 implementation with L=49 layers).
C. Discussion
Results of performed simulations allow to identify three boundaries which delimit the scope of validity for DNN partitioning in a homogeneous network. These boundaries depend on the DNN and network properties and delimit the conditions under which a partitioning can improve the unpartitioned inference throughput. The partitioning improvement criterion defines these conditions on the link throughput to processing rate ratio, and can be anticipated prior to the deployment.
While the optimal partitioning boundary and maximum complexity boundary depend on the number of nodes N and the maximum number of split points S , the location of the partitioning improvement boundary is independent of these variables, or the number of layers. The scope of validity of DNN partitioning is independent of network or DNN size, and only depends on the relationship between (i) θη , which is a property of the network, and (ii) smin∑Li∈Lci , which is a property of the DNN.
Regarding complexity, it has to be noted that the optimization process is a one-time operation that decides on a partitioning which will remain relevant for long periods, i.e., longer than the order of magnitude of B&B computation times depicted in
In more constrained use-cases, the maximum number of split points S can further act as a tuning parameter for the optimization complexity, e.g., if the computation of an optimal partitioning is more frequent, at the expense of the achieved inference throughput.
Compared to works cited in
SECTION VI.
Experiments
This Section presents experimental results evaluating the accuracy of the model and the achieved inference throughput improvement.
A. Homogeneous Network
Four set-ups are presented in
- Set-up 1 corresponds to conditions above the partitioning improvement boundary in
figure 6 (section V-B1 ) where partitioning the DNN does not improve the inference throughput (i.e., the set-up does not fulfill the partitioning improvement criterion ofequation 11 ). - Set-up 2 corresponds to conditions between the partitioning improvement boundary and the optimal partitioning boundary (
section V-B2 ). DNN partitioning is expected to find partitionings which improve the inference throughput. - Set-up 3 corresponds to conditions between the optimal partitioning boundary and the maximum complexity boundary (
section V-B3 ). This implies that B&B is expected to find the best achievable partitioning for a given model. - Set-up 4 corresponds to conditions below the maximum complexity boundary, i.e., the achieved partitioning is optimal and the number of B&B iterations is maximal.
The values in
The results correspond to the simulations and confirm the existence of the four identified regions of
B. Heterogeneous Network
In order to better understand DNN partitioning performance in heterogeneous networks, this Section presents an experiment which considers the case of adding a single device with varying capacities to a fixed set-up with a single device. This experiment aims to both show results on simple heterogeneous set-ups, and to illustrate the case of an additional device being added to a network to improve the overall performance, e.g., adding a processor in proximity to a smart camera to increase its inference throughput.
This experiment shows that under good link conditions, i.e., above the partitioning improvement boundary (section 11), the achieved inference throughput can be close to the maximum achievable throughput. Notably, the expression of the partitioning improvement boundary differs from its expression in the homogeneous case (
for a heterogeneous case with N=2 nodes, with ηmax being the maximum processing rate between the two nodes.
This expression implies that for link throughputs below the fixed value sminηmax∑Li∈Lci , optimal partitioning keeps all computation on the fastest node, i.e., following the minimum throughput line, as is the case when the nodes are connected via a Wi-Fi connection with θ = 10MBps in
SECTION VII.
Results, Scope, and Limitations
This Section describes key results from simulations and experiments on the DNN partitioning approach, as well as a discussion of their scope, limitations and ways to generalize them.
A. Conditions for Homogeneous Networks
Through simulations (
- The partitioning improvement boundary (
section V-B1 ) describes the conditions under which there are partitionings that improve the monolithic inference throughput. This is extended to the heterogeneous case inequation 12 , and allows to estimate the values of link bandwidth and node processing rate, necessary to achieve an inference throughput improvement. - The upper bound on the number of split-points necessary to achieve an optimal solution (
equation 10 ) is the maximal number of split-points worth considering, to minimize the B&B computing time, while accessing maximal inference throughput. - The maximal B&B complexity, with a chosen maximum number of split-points (
equation 9 ).
These results allow to derive, under conditions of perfect distribution of workload over homogeneous nodes and links, an upper bound on the achievable inference throughput of the partitioning strategy:
The validity of these expressions is subject to the network homogeneity assumption. For non-homogeneous networks, the derivation of similar performance and complexity bounds is, of course, specific to the characteristics of the considered network.
B. Scope and Limitations
This Section discusses the limitations of this work, and the presented assumptions, to illustrate their scope of applicability.
1) Input Data
The study has used the DNN partitioning framework on video stream data applications. Although the study has not proven its applicability to other data types, there are no assumptions in the modeling restraining this partitioning method from applying to other applications, e.g., audio, text, or telemetry data. The only limiting assumption in this study is that the model used a feed-forward DNN, with data of constant size across requests.
2) Optimization Problem Definition
This study focuses on use-cases which require a maximal inference throughput, omitting optimization objectives such as the ones included in the related work of
- jointly optimize several metrics, e.g., throughput, latency, energy consumption, monetary cost, drop rate, node up-time, link usage, etc.
- dynamically adapt what metric to optimize depending on the received data. For example, DNN partitioning application on video streams can choose to optimize throughput to avoid missing detections, and switch to latency when an event occurs in the system, to enable low response times.
- add other constraints to the MINLP problem, e.g., a minimal number of split points, a partial node to layer mapping, e.g., in order to keep sensitive computation on dedicated nodes.
All these assumptions can be expressed as optimization objectives, or constraints in the discrete optimization problem of
3) Limits of Experimental Results
As described in
Completely exploring the influence of other parameters on the DNN partitioning performance and complexity, e.g., the number of compute nodes and number of split points in large networks, the complexity of DNN structures, the heterogeneity of layer consumptions and data transfer sizes, or the heterogeneity of link bandwidths and processing rates in large networks — and providing a more thorough understanding of how the system behaves in cases where (i) other processes are dynamically allocated to compute nodes, (ii) links are dynamically used by other processes, or (iii) faults occur on the system (e.g., node failure, routing change, packet drops) — requires additional experiments, following the pattern and methodology of those presented in this paper.
SECTION VIII.
Conclusion
Deploying DL applications in the cloud is convenient because it allows on-demand easy access to computing resources. However, latency or privacy sensitive applications may not be able to exchange data and models with the cloud, while still requiring the same inference throughput to run with good performance. In such cases, DNN partitioning can offer a complementary alternative to hardware acceleration and model optimization to increase inference throughput. This paper has described such an approach to DNN partitioning, which extends previous works by allowing for multiple split points and multiple threads, and shown to achieve higher inference throughputs than single split point DNN partitioning.
In this context, this paper has quantified the limitations of inference and transmission latency prediction in edge environments. With these assumptions, the DNN partitioning problem is defined as an optimization process, with the objective of maximizing the overall inference throughput. This paper has then introduced a branch and bound algorithm to find optimal DNN partitionings, with a theoretical analysis of its complexity and achieved inference throughput results.
This analysis has led to the identification of the partitioning improvement boundary, a deterministic bound on the network and DNN properties under which a performance improvement can be achieved by partitioning, as well as the cost to compute such solutions, and their expected performance, in a homogeneous network context. This result is essential in understanding DNN partitioning because it defines the scope of validity of this approach, and only depends on (i) the DNN data transfer size to layer consumption ratio, and (ii) the link throughput to processing rate ratio of the underlying network.
Inference throughput accelerations and defined theoretical boundaries are evaluated through experimental set-ups under varying network conditions. The experimental results also illustrate the behavior of DNN partitioning under heterogeneous network conditions, highlighting the use-case of incrementally adding processing capacity to accelerate inference throughput. These results enable sizing of both DNN and underlying network properties to achieve inference throughput improvements, even prior to the deployment, with deterministic conditions on the necessary link throughputs to enable a maximal inference throughput acceleration through partitioning.
REFERENCES
1. S. Mochizuki, K. Matsubara, K. Matsumoto, C. L. P. Nguyen, T. Shibayama, K. Iwata, K. Mizumoto, T. Irita, H. Hara, and T. Hattori, “A 197 mW 70 ms-latency full-HD 12-channel video-processing SoC in 16 nm CMOS for in-vehicle information systems,” IEICE Trans. Fundam. Electron., Commun. Comput. Sci., vol. E100-A, no. 12, pp. 2878–2887, 2017.
- Y.-L. Lee, P.-K. Tsung, and M. Wu, “Techology trend of edge AI,” in Proc. Int. Symp. Design, Autom. Test (VLSI-DAT), Apr. 2018, pp. 1–2.
- Z. Duan, Z. Yang, R. Samoilenko, D. S. Oza, A. Jagadeesan, M. Sun, H. Ye, Z. Xiong, G. Zussman, and Z. Kostic, “Smart city traffic intersection: Impact of video quality and scene complexity on precision and inference,” in Proc. IEEE 23rd Int. Conf. High Perform. Comput. Commun., 7th Int. Conf. Data Sci. Syst., 19th Int. Conf. Smart City, 7th Int. Conf. Dependability Sensor, Cloud Big Data Syst. Appl. (HPCC/DSS/SmartCity/DependSys), Dec. 2021, pp. 1521–1528.
- S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2015, pp. 1135–1143.
- Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” CoRR, vol. abs/1412.6115, pp. 1–10, Dec. 2014.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015, arXiv:1503.02531.
- X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence of edge computing and deep learning: A comprehensive survey,” 2019, arXiv:1907.08349.
- Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Comput. Archit. News, vol. 45, no. 1, pp. 615–629, Apr. 2017.
- X. Chen, J. Zhang, B. Lin, Z. Chen, K. Wolter, and G. Min, “Energy-efficient offloading for DNN-based smart IoT systems in cloud-edge environments,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 3, pp. 683–697, Mar. 2022.
- M. Xue, H. Wu, G. Peng, and K. Wolter, “DDPQN: An efficient DNN offloading strategy in local-edge-cloud collaborative environments,” IEEE Trans. Serv. Comput., vol. 15, no. 2, pp. 640–655, Mar./Apr. 2022.
- T. Mohammed, C. Joe-Wong, R. Babbar, and M. D. Francesco, “Distributed inference acceleration with adaptive DNN partitioning and offloading,” in Proc. IEEE Conf. Comput. Commun. (IEEE INFOCOM), Jul. 2020, pp. 854–863.
- C. Hu, W. Bao, D. Wang, and F. Liu, “Dynamic adaptive DNN surgery for inference acceleration on the edge,” in Proc. IEEE Conf. Comput. Commun. (IEEE INFOCOM), Apr./May 2019, pp. 1423–1431.
- J. Li, W. Liang, Y. Li, Z. Xu, X. Jia, and S. Guo, “Throughput maximization of delay-aware DNN inference in edge computing by exploring DNN model partitioning and inference parallelism,” IEEE Trans. Mobile Comput., vol. 22, no. 5, pp. 3017–3030, May 1, 2021.
- E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,” in Proc. Workshop Mobile Edge Commun., Aug. 2018, pp. 31–36.
- H.-J. Jeong, H.-J. Lee, C. H. Shin, and S.-M. Moon, “IONN: Incremental offloading of neural network computations from mobile devices to edge servers,” in Proc. ACM Symp. Cloud Comput., Oct. 2018, pp. 401–411.
- A. G. Ivakhnenko and V. G. Lapa, Cybernetics and Forecasting Techniques, vol. 8. New York, NY, USA : Elsevier, 1967.
- S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed. Hoboken, NJ, USA : Prentice-Hall, 2020.
- L. L. Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y. Yang, and Y. Liu, “nn-Meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices,” in Proc. 19th Annu. Int. Conf. Mobile Syst., Appl., Services, Jun. 2021, pp. 81–93.
- L. Dudziak, T. Chau, M. S. Abdelfattah, R. Lee, H. Kim, and N. D. Lane, “BRP-NAS: Prediction-based NAS using GCNs,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2020, pp. 10480–10490.
- J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” CoRR, vol. abs/1612.08242, pp. 1–9, Dec. 2016.
- J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis With Applications to Biology, Control, and Artificial Intelligence. Cambridge, MA, USA : MIT Press, 1992.
- R. Eberhart and J. Kennedy, “Particle swarm optimization,” in Proc. Int. Conf. Neural Netw., vol. 4, Nov./Dec. 1995, pp. 1942–1948.
- A. H. Land and A. G. Doig, “An automatic method of solving discrete programming problems,” Econometrica, vol. 28, no. 3, pp. 497–520, Jul. 1960. [Online]. Available: https://www.jstor.org/stable/1910129
This paper is available on ieee.org under CC by 4.0 Deed (Attribution 4.0 International) license.
