In the previous article, I discussed whether it is possible to use machine learning (in particular, face and mask detection) in the browser, approaches to detection, and optimization of all processes.
Today I want to give the technical details of the implementation.
The primary language for development is TypeScript. The client application is written in React.js.
The application uses several neural networks to detect different events: face detection, mask detection. Each model/network runs in a separate thread (Web Worker). Neural networks are launched using TensorFlow.js, and Web Assembly or WebGL is used as a backend, which allows you to execute code at speed close to native. The choice of this or that backend depends on the size of the model (small models work faster on WebAssembly), but you should always test and choose what is faster for a particular model.
Receiving and displaying a video stream using WebRTC. The OpenCV.js library is used to work with images.
The following approach was implemented:
The main thread is only orchestrating all processes. It doesn't load the heavy OpenCV library and doesn't use TensorFlow.js. It gets images from the video stream and sends them for processing by web workers.
A new image is not sent to the worker until it informs the main thread that the worker is free and can process the next image. Thus a queue is not created, and we process the last image each time.
Initially, the image is sent for face recognition, in case the face is recognized; only then is the image sent for mask recognition. Each result of the worker is saved and can be displayed on the UI.
Total:
In ~ 105 ms, we would know all the information from the image.
Each model (face detection and mask detection) runs in a separate web worker, which loads the necessary libraries (OpenCV.js, Tensorflow.js, models).
We have 3 web workers:
Web workers and how to work with them
A web worker is a way to run a script on a separate thread.
They allow running heavy processes in parallel with the main thread without blocking the UI. The main thread executes the orchestration logic; all heavy computation is running in the web workers. Web workers are supported in almost all browsers.
Features:
navigator
objectlocation
objectXMLHttpRequest
setTimeout()
/
clearTimeout()
и
setInterval()
/
clearInterval()
Limitations:
To provide communication between the main thread and the web workers
postMessage
and onmessage
the event handler is used.If you look at the specification of the
method, you will notice that it accepts not only data but also a second argument - a transferable object.postMessage()
worker.postMessage(message, [transfer]);
Let's see how using it will help us.
A transferable interface is an object that can be passed between different execution contexts, such as the main thread and web workers.
This interface is implemented in:
If we want to transfer 500 MB of data to the worker, we can do it without the second argument, but the difference will be in the time transfer and memory usage.
Sending data without an argument will take 149 ms and 1042 MB for Google Chrome, in other browsers even more.
When you use the transfer argument, it will take 1ms and will decrease memory consumption by 2 times!
Since images are often transferred from the main thread to the web workers, it is important for us to do this as quickly and efficiently as possible for memory usage, and this feature helps us a lot with this.
The web worker does not have access to the DOM, so you cannot use canvas directly.
OffscreenCanvas
comes to the rescue.Advantages:
requestAnimationFrame
allows you to receive images from the stream with maximum performance (60 FPS), and it is only limited by the camera's capability, not all cameras send video with such frequency.The main advantages are:
At first, using stats.js seemed to be a good idea for displaying application metrics, but after the count of metrics became 20+, the main flow of the application began to slow down due to the specifics of the browser. Each metric uses a canvas on which draws a graph (data are received very often there), and the browser calls re-render with high frequency, which negatively affects the application. As a result, the metrics are underestimated.
To avoid such a problem, it is better to abandon the use of "beauty" and simplify displaying the current and calculated average for the entire time by text. Updating a value in the DOM will be much faster than rendering graphics.
Quite often, during development, we encountered memory leaks on mobile devices, while on a desktop, it could work for a very long time.
In web workers, it is impossible to know how much memory it actually consumes (
performance.memory
does not work in web workers).Based on this, we provided for the launch of our application through web workers and completely in the main thread. By running all our detection models on the main thread, we can take the memory consumption metrics and see where the memory leak is and fix it.
We got acquainted with the main tricks that were used in the application; now we will look at the implementation.
For working with web workers initially, comlink-loader was used. It's a very handy library that allows you to work with the worker as a class object without using the
onmessage
and postMessage
methods and control the asynchronous code using async-await. All this was convenient until the application was launched on a tablet (Samsung Galaxy Tab S7), and suddenly it crashed after 2 minutes.After analyzing all the code, no memory leaks were found, except for a black box of this library for working with workers. For some reason, the launched Tensorflow.js models were not cleared and stored somewhere inside this library.
It was decided to use a worker-loader, which allows you to work with web workers from pure js without unnecessary layers. And this solved the problem; the application works for days without crashes.
Create web worker
this.faceDetectionWorker = workers.FaceRgbDetectionWorkerFactory.createWebWorker();
Create a message handler from a worker in the main thread
this.faceDetectionWorker.onmessage = async (event) => {
if (event.data.type === 'load') {
this.faceDetectionWorker.postMessage({
type: 'init',
backend,
streamSettings,
faceDetectionSettings,
imageRatio: this.imageRatio,
});
} else if (event.data.type === 'init') {
this.isFaceWorkerInit = event.data.status;
// When both workers inited it is run processes to grab and process frames only
if (this.isFaceWorkerInit && this.isMaskWorkerInit) {
await this.grabFrame();
}
} else if (event.data.type === 'faceResults') {
this.onFaceDetected(event);
} else {
throw new Error(`Type=${event.data.type} is not supported by RgbVideo for FaceRgbDatectionWorker`);
}
};
Sending an image for face processing
this.faceDetectionWorker.postMessage(
{
type: 'detectFace',
originalImageToProcess: this.lastImage,
lastIndex: lastItem!.index,
},
[this.lastImage], // transferable object
);
Face detection web worker code
The init method initializes all the models, libraries, and canvas that are needed to work with.
export const init = async (data) => {
const { backend, streamSettings, faceDetectionSettings, imageRatio } = data;
flipHorizontal = streamSettings.flipHorizontal;
faceMinWidth = faceDetectionSettings.faceMinWidth;
faceMinWidthConversionFactor = faceDetectionSettings.faceMinWidthConversionFactor;
predictionIOU = faceDetectionSettings.predictionIOU;
recommendedLocation = faceDetectionSettings.useRecommendedLocation ? faceDetectionSettings.recommendedLocation : null;
detectedFaceThumbnailSize = faceDetectionSettings.detectedFaceThumbnailSize;
srcImageRatio = imageRatio;
await tfc.setBackend(backend);
await tfc.ready();
const [blazeModel] = await Promise.all([
blazeface.load({
// The maximum number of faces returned by the model
maxFaces: faceDetectionSettings.maxFaces,
// The width of the input image
inputWidth: faceDetectionSettings.faceDetectionImageMinWidth,
// The height of the input image
inputHeight: faceDetectionSettings.faceDetectionImageMinHeight,
// The threshold for deciding whether boxes overlap too much
iouThreshold: faceDetectionSettings.iouThreshold,
// The threshold for deciding when to remove boxes based on score
scoreThreshold: faceDetectionSettings.scoreThreshold,
}),
isOpenCvLoaded(),
]);
faceDetection = new FaceDetection();
originalImageToProcessCanvas = new OffscreenCanvas(srcImageRatio.videoWidth, srcImageRatio.videoHeight);
originalImageToProcessCanvasCtx = originalImageToProcessCanvas.getContext('2d');
resizedImageToProcessCanvas = new OffscreenCanvas(
srcImageRatio.faceDetectionImageWidth,
srcImageRatio.faceDetectionImageHeight,
);
resizedImageToProcessCanvasCtx = resizedImageToProcessCanvas.getContext('2d');
return blazeModel;
};
The
isOpenCvLoaded
method is waiting for openCV to loadexport const isOpenCvLoaded = () => {
let timeoutId;
const resolveOpenCvPromise = (resolve) => {
if (timeoutId) {
clearTimeout(timeoutId);
}
try {
// eslint-disable-next-line no-undef
if (cv && cv.Mat) {
return resolve();
} else {
timeoutId = setTimeout(() => {
resolveOpenCvPromise(resolve);
}, OpenCvLoadedTimeoutInMs);
}
} catch {
timeoutId = setTimeout(() => {
resolveOpenCvPromise(resolve);
}, OpenCvLoadedTimeoutInMs);
}
};
return new Promise((resolve) => {
resolveOpenCvPromise(resolve);
});
};
Face detection method
export const detectFace = async (data, faceModel) => {
let { originalImageToProcess, lastIndex } = data;
const facesThumbnailsImageData = [];
// Resize original image to the recommended BlazeFace resolution
resizedImageToProcessCanvasCtx.drawImage(
originalImageToProcess,
0,
0,
srcImageRatio.faceDetectionImageWidth,
srcImageRatio.faceDetectionImageHeight,
);
// Getting resized image
let resizedImageDataToProcess = resizedImageToProcessCanvasCtx.getImageData(
0,
0,
srcImageRatio.faceDetectionImageWidth,
srcImageRatio.faceDetectionImageHeight,
);
// Detect faces by BlazeFace
let predictions = await faceModel.estimateFaces(
// The image to classify. Can be a tensor, DOM element image, video, or canvas
resizedImageDataToProcess,
// Whether to return tensors as opposed to values
returnTensors,
// Whether to flip/mirror the facial keypoints horizontally. Should be true for videos that are flipped by default (e.g. webcams)
flipHorizontal,
// Whether to annotate bounding boxes with additional properties such as landmarks and probability. Pass in `false` for faster inference if annotations are not needed
annotateBoxes,
);
// Normalize predictions
predictions = faceDetection.normalizePredictions(
predictions,
returnTensors,
annotateBoxes,
srcImageRatio.faceDetectionImageRatio,
);
// Filters initial predictions by the criteri that all landmarks should be in area of interest
predictions = faceDetection.filterPredictionsByFullLandmarks(
predictions,
srcImageRatio.videoWidth,
srcImageRatio.videoHeight,
);
// Filters predictions by min face width
predictions = faceDetection.filterPredictionsByMinWidth(predictions, faceMinWidth, faceMinWidthConversionFactor);
// Filters predictions by recommended location
predictions = faceDetection.filterPredictionsByRecommendedLocation(predictions, predictionIOU, recommendedLocation);
// If there are any predictions it is started faces thumbnails extraction according to the configured size
if (predictions && predictions.length > 0) {
// Draw initial original image
originalImageToProcessCanvasCtx.drawImage(originalImageToProcess, 0, 0);
const originalImageDataToProcess = originalImageToProcessCanvasCtx.getImageData(
0,
0,
originalImageToProcess.width,
originalImageToProcess.height,
);
// eslint-disable-next-line no-undef
let srcImageData = cv.matFromImageData(originalImageDataToProcess);
try {
for (let i = 0; i < predictions.length; i++) {
const prediction = predictions[i];
const facesOriginalLandmarks = JSON.parse(JSON.stringify(prediction.originalLandmarks));
if (flipHorizontal) {
for (let j = 0; j < facesOriginalLandmarks.length; j++) {
facesOriginalLandmarks[j][0] = srcImageRatio.videoWidth - facesOriginalLandmarks[j][0];
}
}
// eslint-disable-next-line no-undef
let dstImageData = new cv.Mat();
try {
// eslint-disable-next-line no-undef
let thumbnailSize = new cv.Size(detectedFaceThumbnailSize, detectedFaceThumbnailSize);
let transformation = getOneToOneFaceTransformationByTarget(detectedFaceThumbnailSize);
// eslint-disable-next-line no-undef
let similarityTransformation = getSimilarityTransformation(facesOriginalLandmarks, transformation);
// eslint-disable-next-line no-undef
let similarityTransformationMatrix = cv.matFromArray(3, 3, cv.CV_64F, similarityTransformation.data);
try {
// eslint-disable-next-line no-undef
cv.warpPerspective(
srcImageData,
dstImageData,
similarityTransformationMatrix,
thumbnailSize,
cv.INTER_LINEAR,
cv.BORDER_CONSTANT,
new cv.Scalar(127, 127, 127, 255),
);
facesThumbnailsImageData.push(
new ImageData(
new Uint8ClampedArray(dstImageData.data, dstImageData.cols, dstImageData.rows),
detectedFaceThumbnailSize,
detectedFaceThumbnailSize,
),
);
} finally {
similarityTransformationMatrix.delete();
similarityTransformationMatrix = null;
}
} finally {
dstImageData.delete();
dstImageData = null;
}
}
} finally {
srcImageData.delete();
srcImageData = null;
}
}
return { resizedImageDataToProcess, predictions, facesThumbnailsImageData, lastIndex };
};
The input is an image and an index for face matching and mask detection in the future.
Since blazeface accepts images with a maximum size of 128 px, the image from the camera must be reduced.
Calling the
faceModel.estimateFaces
method starts the image analysis using blazeface, and the predicted coordinates with the coordinates of the face, nose, ears, eyes, mouth area are returned to the main thread.Before working with them, you need to restore the coordinates for the original image because we compressed it to 128 px.
Now you can use these data to decide whether the face is in the desired area or not. What is the minimum face size you need for subsequent identification.
The following code cuts the face out of the image and aligns it to identify the mask using openCV methods.
Model initialization and
webAssembly
backendexport const init = async (data) => {
const { backend, streamSettings, maskDetectionsSettings, imageRatio } = data;
flipHorizontal = streamSettings.flipHorizontal;
detectedMaskThumbnailSize = maskDetectionsSettings.detectedMaskThumbnailSize;
srcImageRatio = imageRatio;
await tfc.setBackend(backend);
await tfc.ready();
const [maskModel] = await Promise.all([
tfconv.loadGraphModel(
`/rgb_mask_classification_first/MobileNetV${maskDetectionsSettings.mobileNetVersion}_${maskDetectionsSettings.mobileNetWeight}/${maskDetectionsSettings.mobileNetType}/model.json`,
),
]);
detectedMaskThumbnailCanvas = new OffscreenCanvas(detectedMaskThumbnailSize, detectedMaskThumbnailSize);
detectedMaskThumbnailCanvasCtx = detectedMaskThumbnailCanvas.getContext('2d');
return maskModel;
};
The mask detection requires the coordinates of eyes, ears, nose, mouth, and the aligned image which is returned by the face detection worker.
this.maskDetectionWorker.postMessage({
type: 'detectMask',
prediction: lastItem!.data.predictions[0],
imageDataToProcess,
lastIndex: lastItem!.index,
});
Detection method
export const detectMask = async (data, maskModel) => {
let { prediction, imageDataToProcess, lastIndex } = data;
const masksScores = [];
const maskLandmarks = JSON.parse(JSON.stringify(prediction.landmarks));
if (flipHorizontal) {
for (let j = 0; j < maskLandmarks.length; j++) {
maskLandmarks[j][0] = srcImageRatio.faceDetectionImageWidth - maskLandmarks[j][0];
}
}
// Draw thumbnail with mask
detectedMaskThumbnailCanvasCtx.putImageData(imageDataToProcess, 0, 0);
// Detect mask via NN
let predictionTensor = tfc.tidy(() => {
let maskDetectionSnapshotFromPixels = tfc.browser.fromPixels(detectedMaskThumbnailCanvas);
let maskDetectionSnapshotFromPixelsFlot32 = tfc.cast(maskDetectionSnapshotFromPixels, 'float32');
let expandedDims = maskDetectionSnapshotFromPixelsFlot32.expandDims(0);
return maskModel.predict(expandedDims);
});
// Put mask detection result into the returned array
try {
masksScores.push(predictionTensor.dataSync()[0].toFixed(4));
} finally {
predictionTensor.dispose();
predictionTensor = null;
}
return {
masksScores,
lastIndex,
};
};
The result of the neural network is the probability that there is a mask, which is returned from the worker. It helps to increase and decrease the threshold of mask detection. By
lastIndex
, we can compare the face and the presence of a mask and display some information on a specific person on the screen.I hope this article will help you to learn about the possibilities of working with ML in the browser and ways to optimize it. Most applications can be optimized using the tricks described above.