The main goal is an attempt to detect face and mask in the browser, instead of Python implementation at back-end. This application is a simple webapp / SPA application which contains JS code only and can send some data to a backend for next processing. But initial face and mask detection is done on the browser side, no Python backend implementation is needed.
At the current moment, the app works only in the Google Chrome browser.
In future articles, I will describe more technical details, the implementation of all our investigation results.
There are 2 approaches of how this can be done with browser implementation:
Both runtimes support WASM, WebGL, and CPU backends. But we will compare only WASM and WebGL, because CPU performance is very low and it can't be used in production.
View the demo here
At the official website, TensorFlow.js proposes some pre-trained and ready to be used models which includes the appropriate JS postprocessing. For real-time face detection, it is specified in the BlazeFace model, for which an online demo is available.
More info about BlazeFace can be found here.
We created a demo to play a bit with the runtime and model, identify any existing issues. The appropriate links to run the app with different models can be found below:
WASM (face detection image size: 160x120; mask detection image size: 64x64px
WebGL (face detection image size: 160x120; mask detection image size: 64x64px
We can get frames via the appropriate HTML API features. But this process frames consumes time as well. So we need to understand how much time we will spend on such activities. There are the appropriate timing metrics below.
Our goal is to detect faces as fast as possible so we should get each frame of the live stream and process. For this issue, we can use requestAnimationFrame which is called every 16.6ms or each frame.
Using grabFrame() method of the ImageCapture allows us to take a snapshot of the live video in a MediaStreamTrack and return a promise that resolves with an ImageBitmap containing the snapshot.
Single mode it is max performance of getting frame. Sync mode it's how often we can get frame with face detection.
Color scheme: < 6 fps red, 7-12 fps orange, 13-18 fps yellow, 19+ fps green
Results:
We excluded timing metrics from the application start. It is obvious that during the application startup and first runs of the model it will consume more resources/time. When the app is in a warm mode it is worth getting performance metrics only. Warm mode in our case it is just let the app do work 5 - 10 sec and then get performance metrics.
Possible inaccuracies in the gathered timings metrics are up to 50ms.
The BlazeFace model was developed especially for the mobile devices and helps to achieve good performance with the TFLite inference on Android and IOS platforms (~50-200 FPS).
More info is here.
Meanwhile the dataset for retraining the model from scratch is not available (Google Research Team did not share it).
So, there are two model types:
This means that the preparation of a detector model with three classes (clear face, face with mask, background) can be a time consuming task.
The original image can be of any size according to the camera settings and business needs. But when we process the frames for face and mask detection the original frame is resized to the appropriate size depending on the model.
Image which is used by BlazeFace for face detection is specified with the size 128 x 128px. The original frame is resized to this size considering its proportion. Image which is used by mask detection is specified with the size 64 x 64px.
We chose the minimal resolutions for both images taking into account performance requirements and results. Such minimal images demonstrated the best performance results on PC and mobile devices. We use 64 x 64px for thumbnails to detect the mask because size 32 x 32px is not enough for mask detection with sufficient accuracy.
With TensorFlow.js we have the following options to get the best images to be used in feature by the application:
Such checks can be the following:
The selected images according to the specified rules above will be sent to next stages of the pipeline: mask detection. Such logic will provide faster face detection results till the moment when we are sure that the detected face is in a good quality.
For mask detection using trained MobileNetV2 and MobileNetV3 with different types and multipliers.
We tend to use light or ultra-light models with TensorFlowJS in the browser (< 3Mb). The main reason is the fact that the WASM backend is faster with such models. It is specified in the official documentation and it is confirmed by our performance tests as well.
For webapp time to interact (TTI) with ~3.5Mb of JS and binary + JSON of size from 1.5Mb to 6Mb (everything will be in internal network) will be >10s in cold mode; in warm mode the expected TTI - 4-5sec.
If it is used by web worker(s) and OpenCV.js is used in the worker(s) only it will significantly reduce the size of the main app with 800-900Kb JS, TTI will be 7-8s in cold mode; in warm mode - <5s.
Possible approaches to run neural network models in browser:
One thread implementation:
It is a default implementation for browsers. We run both models for face and mask detections in one thread. A crucial point is to provide a good performance for both models to be run with any issues in one single thread. It is the current approach which was used to take the performance metrics above.
This approach has some limitations. If we would like to add additional models to be run in the pipeline they will be run in the same thread in async way but in sequential flow. It will decrease total performance metrics regarding frames processing.
Web Worker(s) usage to run models in different context and have parallelism in the browser:
It is used in the main JS thread to run BlazeFace model for face detection, but mask detection is running in a separate thread via web worker. With such implementation we can separate both models running and introduce parallelism for processing in the browser. It will have a positive influence on general UX feeling with the application. Web workers will load TFJS and OpenCV libraries, main JS thread - TF.JS only.
It means that the main app will start much faster and by doing it we can significantly reduce TTI time in the browser. Face detection will start more often, it will increase the FPS of the face detection process. As a result mask detection process will be run more often as well and FPS of this process will be increased as well. The expected improvements are up to ~20%. It means that the FPS and MS specified in the article above can be improved by this value.
With such an approach to run different models in separate contexts via web workers we can run more neural network models in the browser with good performanсe. The main factor here will be all-in-one device hardware characteristics to support such loading.
We have implemented such an approach in the app and it works. But we have some technical issues with postMessage callback when a web worker sends a message back to the main thread. For some reasons it introduces additional delay (up to 200ms on mobile devices) which kill the performance improvement which we achieved with parallelism (this issue actual only for vanilla JS, after implementation in React this issue was fixed)
Our idea is to use web workers (WW) to run every model in a separate worker/thread/context and achieve model processing parallelism in the browser. But we noticed that the callback function in WW is called with some delays. Below you can find the appropriate investigation measurements and conclusions on what factors it depends on.
Test params:
mobileNetVersion=V3
mobileNetVersionMultiplier = 0.75
mobileNetVersionType = float16
thumbnailSize=32px
backend = wasm
In the app we run BlazeFace in main thread and mask detection model in web worker.
Here are some measurements for WW processing time on different devices:
The results above demonstrate that for some reasons time to send a callback from Web Worker to the main thread depends on the model running or using TensorFlow method browser.fromPixels in this Web Worker.
If it is run with a model the callback on MAC OS is sent ~27ms, if model is not run - 5ms. This difference on MAC OS in 22ms can lead to 100-300ms delays on more weak devices and it will have an effect on general app performance with Web Worker usage.
Currently we don't understand why it happens but we know the fact that it has happened.
We need to have more "context" to take the appropriate decision to inform to face detection and removing mask . Such context is the prev processed frames and their state (is face or not, is face or not, is mask or not). We implemented a "flying" segment to manage such additional context. The length of segment is configurable and depends on the current FPS, but in any case it should not be more than 200-300ms to not introduce visible delays.
Below is the appropriate scheme which describes the idea:
As stronger device we have as better performance results regarding taking metrics it is demonstrated:
For PC we could get the following metrics:
For mobile devices we could get the following metrics: