paint-brush
A Prototype That Leverages Facial Expressions to Facilitate Non-vocal Communicationby@hackerclfpsbs1z00003b749gle0rrz
395 reads
395 reads

A Prototype That Leverages Facial Expressions to Facilitate Non-vocal Communication

by karamvir singhMarch 30th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Aims to emulate technology used by Dr. Stephen Hawking during advanced stages of ALS (Amyotrophic Lateral Sclerosis) The prototype will detect eye blinks and utilize them as inputs to select letters displayed on a screen. It can readily expand upon this by integrating external APIs to facilitate next word prediction, as well as incorporating a text-to-speech feature.
featured image - A Prototype That Leverages Facial Expressions to Facilitate Non-vocal Communication
karamvir singh HackerNoon profile picture

Being a passionate follower of technology, I have been captivated by numerous technological advancements. However, the one that has particularly piqued my interest is the assistive technology used by Dr. Stephen Hawking during advanced stages of ALS (Amyotrophic Lateral Sclerosis), enabling him to communicate despite his inability to speak. Driven by my inquisitive nature, I delved into researching the underlying technology and only recently have I been motivated enough to undertake the task of developing a very simple prototype that emulates this technology.


Technology

Through a combination of twitching his cheek muscle and blinking his eyes, Dr. Hawking utilized a scanning method to select characters on a screen. Despite its slow pace, this technology proved to be a crucial tool in facilitating his communication and sharing his invaluable ideas and knowledge with the world.


While Dr. Hawking’s system was notably more intricate, incorporating an adaptive word predictor following the selection of each letter, the prototype we intend to develop will operate on a simpler premise. Specifically, it will detect eye blinks and utilize them as inputs to select letters displayed on a screen. Subsequently, we can readily expand upon this by integrating external APIs to facilitate next word prediction, as well as incorporating a text-to-speech feature. You can read more about intel’s technology that Dr. Hawking used here.


Research

If you possess even a modicum of interest in facial recognition or object detection, it is highly likely that you are familiar with OpenCV. Having previously gained some experience working with OpenCV on a project pertaining to lane detection and vehicle collision prevention, I began to delve more deeply into how I could leverage its capabilities for my current undertaking.

I perused numerous pre-existing APIs, such as Kairos and AWS image recognition services, among others. A comprehensive list of these APIs can be found here.


Despite encountering some options that failed to provide the desired functionality or proved cumbersome to implement, my aim was to identify a user-friendly and easily extendable solution.

During the initial stages of my research, I came across face-api.js, which is a highly advanced package geared towards facial recognition and detection. Intrigued by the examples I had seen of others utilizing this package, I delved into the process of coding. However, I soon encountered a challenge, as blink detection was not included in the package’s features. In order to detect blinks, a high frame rate was required, which ultimately led to my browser crashing. Consequently, I abandoned face-api.js and sought out alternative solutions. Nevertheless, it is important to note that despite not being suited to my project, the package contains several noteworthy features that can be leveraged to create some truly impressive applications.


From JS to Python

During the course of my research, I frequently encountered Python as a potential solution. Given my previous experience working with OpenCV in C#, I was intrigued by the prospect of exploring Python. Ultimately, I determined that Python OpenCV with Dlib held particular promise, which I read about in greater depth on pyimagesearch.com.


While I initially experienced some challenges installing the requisite dependencies on my Mac, I eventually succeeded in setting up my development environment and commenced coding. Rapidly, I was able to create functional code for detecting eye blinks, albeit with certain limitations. Specifically, the blink detection was not especially efficient, and constructing the UI in Python proved cumbersome. Nonetheless, the solution was more efficient and expeditious than face-api.js, and did not result in any instances of my computer hanging or crashing.


However, an additional constraint arose with Python, as constructing the user interface proved to be a rather challenging task. As I continued my research, I stumbled upon an intriguing alternative in Google Mediapipe. What truly excited me about this solution was the fact that it could be integrated into mobile applications, a possibility that had not initially crossed my mind, but which nevertheless sparked my interest.


NOTE: Official Mediapipe documentation is moving to this link starting April 3, 2023.

Back to JS

In light of the aforementioned limitations with Python, I found myself compelled to revert back to Javascript. Furthermore, the engaging and accessible demos available through Mediapipe were another factor that drew me towards using JS. With some experimentation and tinkering in JS, I became convinced that this was the ideal solution for my project.


Getting Started

Although I initially intended to construct a React Native app utilizing the Mediapipe face-detection, time constraints compelled me to opt for a vanilla JS project instead. Nevertheless, adapting this code to a React or Vue JS app should be a relatively straightforward process for anyone interested in doing so. Additionally, it is worth noting that by the time you are reading this article, I may have already developed a React app.


Mediapipe has a significant advantage over other alternatives such as face-api.js and Python OpenCV, as it can estimate 468 3D face landmarks in real-time, compared to only 68 landmarks in the model used with Python OpenCV and face-api.js. This advantage is significant since it allows for more efficient facial expression detection.


Facial landmarks are crucial features of a human face that enable us to differentiate between different faces. The following are the facial landmarks that I utilized with Python and face-api.js:


source: https://pyimagesearch.com/wp-content/uploads/2017/04/facial_landmarks_68markup.jpg


Here are the ones used by MediaPipe:

source: https://github.com/google/mediapipe/blob/a908d668c730da128dfa8d9f6bd25d519d006692/mediapipe/modules/face_geometry/data/canonical_face_model_uv_visualization.png


Here is a list of all the landmark points:
https://github.com/tensorflow/tfjs-models/blob/838611c02f51159afdd77469ce67f0e26b7bbb23/face-landmarks-detection/src/mediapipe-facemesh/keypoints.ts


And here is an example of a real face with the facemesh:

Face landmarks: the red box indicates the cropped area as input to the landmark model, the red dots represent the 468 landmarks in 3D, and the green lines connecting landmarks illustrate the contours around the eyes, eyebrows, lips and the entire face. Source: https://mediapipe.dev/images/mobile/face_mesh_android_gpu.gif


It is evident that the Mediapipe landmark model is far superior to the other models, and I could effortlessly notice the difference in detection accuracy. You can read more about landmarks here.


Blink Detection

Out of the box, the sample code gives you a face mesh superimposed on your face when you look at the camera. But any facial expression detection has to be done by you. There were 2 approaches I could follow to detect a blink:

  1. Watch for the Iris to disappear when the eye is closed

  2. Detect when the upper eyelid meets the lower eyelid.


I decided to go with the second option. Observe the landmark points on the right eye below. When the eye is closed, the distance between the vertical landmark points 27 and 23 decreases significantly (almost to 0), while the distance between the horizontal points 130 and 243 remains nearly constant.


Here is the function for blink detection:

function blinkRatio(landmarks) {
    // RIGHT EYE
    // horizontal line 
    let rh_right = landmarks[33]
    let rh_left = landmarks[133]
    // vertical line 
    let rv_top = landmarks[159]
    let rv_bottom = landmarks[145]

    // LEFT EYE 
    // horizontal line 
    let lh_right = landmarks[362]
    let lh_left = landmarks[263]
    // vertical line 
    let lv_top = landmarks[386]
    let lv_bottom = landmarks[264]

    // Finding Distance Right Eye
    let rhDistance = euclideanDistance(rh_right, rh_left)
    let rvDistance = euclideanDistance(rv_top, rv_bottom)
    // Finding Distance Left Eye
    let lvDistance = euclideanDistance(lv_top, lv_bottom)
    let lhDistance = euclideanDistance(lh_right, lh_left)
    // Finding ratio of LEFT and Right Eyes
    let reRatio = rhDistance / rvDistance
    let leRatio = lhDistance / lvDistance
    let ratio = (reRatio + leRatio) / 2
    return ratio
}


We check both eyes because we want both eyes closed for a blink. Another way to utilize eye detection is by identifying winks, which can be used as a different user input. However, detecting a wink can be challenging since when you close one eye, the other eye also slightly narrows its vertical height. Although I attempted to detect winks using the 68 landmark model when working with Python, the results were hit and miss. With the MediaPipe model having more landmarks on the eye, I anticipate better results for wink detection.


Returning to the topic of blinks, I became aware that utilizing a natural blink as a user input was resulting in unintended inputs. Consequently, I had to devise an alternative approach. My solution involved registering an input only if the user kept their eyes closed for a slightly longer duration than a regular blink. Specifically, I classified a blink as a user input only if the eyes remained closed for five successive video frames.


let CLOSED_EYES_FRAME = 5


Varying the lengths of eye closure allowed for additional user options such as the ability to delete an input or restart the prompts.


let BACKSPACE_EYES_FRAME = 20
let RESTART_EYES_FRAME = 40


Below is a code snippet that detects a user input and distinguishes it from input, deletion, or restart.

if (finalratio > CLOSED_EYE_RATIO) {
        CEF_COUNTER += 1
    } else {
        // debugger
        if (CEF_COUNTER > CLOSED_EYES_FRAME) {
            TOTAL_BLINKS += 1
            blinkDetected = true
            document.getElementById('blinks').innerHTML = TOTAL_BLINKS
            document.getElementById('ratio').innerHTML = finalratio
        }

        if (CEF_COUNTER > RESTART_EYES_FRAME) {
            console.log('restart frames detected')
            restart()
        } else if (CEF_COUNTER > BACKSPACE_EYES_FRAME) {
            console.log('backspace frames detected')
            
            backspace()
            restart()
        } 

        CEF_COUNTER = 0
    }


A List of Future Improvements

  1. Add more expressions — look left/right detection (can use this for deletion and auto completion), raised eyebrow detection.
  2. A training interface for the user, before they start using the app.
  3. Next word prediction using an API.
  4. A mobile app.


Lead image source: Tumisu Pixabay