What Is Edge AI?
Engineering Manager/Lead, Edge Platform Team,
Edge AI—also referred to as on-device AI—commonly refers to the components required to run an AI algorithm locally on a hardware device.
Of late, it means running deep learning algorithms on-device, and most articles tend to focus only on one component i.e. inference.
This series of articles will shed some light on the other components and challenges of Edge AI.
This series is divided as follows:
- Part 1: Why Run AI Algorithms on Edge
- Part 2: Edge AI Components - Sensor Data Capture
- Part 3: Edge AI Components - Pre-processing
- Part 4: Edge AI Components - Inference
- Part 5: Edge AI Components - Performance Evaluation
- Part 6: Edge AI Components - Real Time Metrics
- Part 7: Edge AI Components - Scheduling & System Architecture
- Part 8: Edge AI Components - Bridging the gap between Edge AI & Cloud AI
Edge devices are very diverse in their cost/capabilities. To make the discussion more concrete, here’s the experimental setup used in this series:
Qualcomm Snapdragon 855 Development Kit 
* Qualcomm Snapdragon 855 Development Kit.
* Object Detection as the Deep learning model to be run on an Edge device. There are a lot of good articles describing state of the art in Object detection [survey
paper]. We will use Mobilenet SSD
model for Object Detection in this series.
to quickly run object detection model in nodejs environment
Why run AI algorithms on Edge
Why can’t we rely on cloud to run AI algorithms? After all scaling resources to run an AI/Deep learning model to match your performance needs is easier on cloud. So why should one worry about running them on an edge device with compute and power constraints? To answer this question let’s consider two scenarios:
a) Cloud based architecture, where inference happens on cloud.
b) Edge based architecture, where inference happens locally on device.
(To keep the comparison as fair as possible, in both the cases a nodejs webserver along with tensorflowjs (cpu only) will be used, only difference being that in case a) webserver will run on an EC2 instance and in case b) webserver will run locally on an edge device
. Goal here is NOT to have an optimized implementation for a platform (cloud or edge) but rather to have a framework to do fair comparison.)
Cloud based architecture
Here’s how a cloud based setup would look like, it would involve the steps detailed below:
Cloud only Architecture for Inference. (image references at end).
Step 1: Request with input image
There are two possible options here:
* We can send the raw image (RGB or YUV) from edge device as it’s captured from a camera. Raw images are always bigger and takes longer to send to cloud.
* We can encode the raw image to JPEG/PNG or some other lossy format before sending, decode them back to raw image on cloud before running inference. This approach would involve an additional step to decode the compressed image as most deep learning models are trained with raw images. We will cover some more ground on different raw image formats in future articles in this series.
To keep the setup simple, first approach [RGB image] is used. Also HTTP is used as the communication protocol to POST an image to a REST endpoint (http://<ip-address>:<port>/detect).
Step 2: Run inference on cloud
is used to run inference on an EC2 (t2.micro) instance, only a single nodejs worker instance (no load balancing, no fail over etc
) is used.
* Mobilenet version used is hosted here
* Apache Bench
) is used to collect latency numbers for HTTP requests. In order to use ab
, RGB image is base64 encoded and POST ed to an endpoint. express-fileupload
is used to handle the POST ed image.
Total latency (RGB) = Http Request + Inference Time + Http Resp
ab -k -c 1 -n 250 -g out_aws.tsv -p post_data.txt -T "multipart/form-data; boundary=1234567890" http://<ip-address>:<port>/detect
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking <ip-address> (be patient)
Completed 100 requests
Completed 200 requests
Finished 250 requests
Server Hostname: <ip-address>
Server Port: <port>
Document Path: /detect
Document Length: 22610 bytes
Concurrency Level: 1
Time taken for tests: 170.875 seconds
Complete requests: 250
Failed requests: 0
Keep-Alive requests: 250
Total transferred: 5705000 bytes
Total body sent: 50267500
HTML transferred: 5652500 bytes
Requests per second: 1.46 [#/sec] (mean)
Time per request: 683.499 [ms] (mean)
Time per request: 683.499 [ms] (mean, across all concurrent requests)
Transfer rate: 32.60 [Kbytes/sec] received
287.28 kb/s sent
319.89 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 5.0 0 79
Processing: 530 683 258.2 606 2751
Waiting: 437 513 212.9 448 2512
Total: 530 683 260.7 606 2771
Percentage of the requests served within a certain time (ms)
100% 2771 (longest request)
Histogram of end to end Inference Latencies for Cloud based architecture (bucket size of 1s). It shows the inference latencies for requests generated by Apache Bench (ab) in a given second.
End to End Inference Latencies for Cloud based architecture sorted by response time (ms). This article explains the difference between the two plots.
As we can see here 95% percentile request latency is around 1084ms.
Edge based architecture
Web server (which runs tensorflowjs) is running locally on an edge device (Qualcomm Snapdragon 855 Development Kit [4
]). We repeat the same steps using Apache Bench (with http requests to localhost this time instead of remote sever) and the results are as follows.
Histogram of end to end Inference Latencies for Edge based architecture (bucket size of 1s). It shows the inference latencies for requests generated by Apache Bench (ab) in a given second.
End to End Inference Latencies for Edge based architecture sorted by response time (ms). This article explains the difference between the two plots.
As we can see here 95% percentile request latency is around 357ms.
As you can see the latency numbers are fairly high, the numbers we obtained here are more like upper bound latencies, there are many optimization opportunities, some of them are detailed below:
Cloud based architecture:
* Have multiple nodejs worker instances and load balance between.Have multiple deployments (us-east, us-west etc) and route the request to the closest deployment.
* Batch multiple input images and run batched inference on cloud.
* Use a different communication protocol like MQTT
geared more towards IOT / cloud connectivity to avoid overheads with HTTP.
Edge based architecture:
* Have an optimized implementation for your Edge device. In this case for Qualcomm Snapdragon 855 Development Kit [4
] inference would be accelerated on GPU / DSP or their NPU
* Most likely implementation on device would depend on native libraries through vendor frameworks like SNPE
* Optimize the data path consisting of image capture from camera to feeding the deep learning models to run inference.
We looked in detail at one of the factors to decide if you need Edge based solutions, as we saw if your application is tolerant to cloud latencies then cloud based inference would be the quickest way to get going. However if your application is latency sensitive then you can consider Edge based solutions. Be sure to benchmark your particular use case to pick one vs the other. In addition to latency these are some of the other reasons to consider Edge based solutions:
* You already have an existing deployment of Edge devices and want to leverage it to save on cloud compute costs.
* Privacy, you don’t want data to ever leave an edge device.
* Devices which are not fully connected / have poor connectivity to cloud, edge based solutions becomes inevitable
Future articles in the series will cover different components involved in an Edge based solution. Stay tuned!.
Subscribe to get your daily round-up of top tech stories!