Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Implementation

by Writings, Papers and Blogs on Text ModelsOctober 2nd, 2024

Too Long; Didn't Read

Apparate is implemented as a layer atop TensorFlowServing 39 and Clockwork 22. Original models are ingested in the ONNX format 6 and compiled for performance. Ramp training (during bootstrapping) uses the first 10% of each dataset following a 1:9 split for training and validation.

featured image - Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Implementation

Authors:

(1) Yinwei Dai, Princeton University (Equal contributions);

(2) Rui Pan, Princeton University (Equal contributions);

(3) Anand Iyer, Georgia Institute of Technology;

(4) Ravi Netravali, Georgia Institute of Technology.

Table of Links

Abstract and 1 Introduction

2 Background and Motivation and 2.1 Model Serving Platforms

2.2 Early-Exit Models

2.3 Challenges

3 Design

3.1 Preparing Models with Early Exits

3.2 Accuracy-Aware Threshold Tuning

3.3 Latency-Focused Ramp Adjustments

4 Implementation

5 Evaluation and 5.1 Methodology

5.2 Overall Results

5.3 Comparison with Existing EE Strategies

5.4 Microbenchmarks

6 Additional Related Work

7 Conclusion, References, Appendix

4 IMPLEMENTATION

Apparate is implemented as a layer atop TensorFlowServing [39] and Clockwork [22] (using PyTorch [7]) and includes the components described in 3 written as Python modules in ∼7500 lines of code. Although we chose these platforms for our current implementation, we note that Apparate is not limited to them and its techniques can be implemented in any inference platform. Importantly, Apparate entirely leverages the scheduling and queuing mechanisms of the underlying framework. Original models are ingested in the ONNX format [6] and compiled for performance. Ramp training (during bootstrapping) uses the first 10% of each dataset following a 1:9 split for training and validation; the remaining 90% of each dataset is used for evaluation.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

Writings, Papers and Blogs on Text Models@textmodels

We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.

Read my stories About @textmodels

TOPICS

tech-stories #early-exit-models #ml-inference-optimization #latency-reduction #throughput-optimization #adaptive-machine-learning #efficient-neural-networks #real-time-ai-processing #apparate-system

THIS ARTICLE WAS FEATURED IN...

Terminal

Lite

Textmodels

Briefly

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Implementation

Too Long; Didn't Read

Table of Links

4 IMPLEMENTATION

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES