Not as many of us think!
Let us start with history.
“Hey Siri!” feature was launched in iOS 8 in September 2014 as an additional feature to the pre-installed voice assistant- Siri. However, in iOS 9 (September 2015) it was upgraded with the capability to identify only the personalized voice of the user.
In android, the Google Assistant has had this feature way before 2013 but it did not support while your screen was off(and doesn’t even now for many phones)
Comparing the user experience-
Normal vs “Hey Siri!”
In the normal method the user grabs the phone -> Long presses the home button -> Siri opens. Whereas in now, he can just speak “Hey Siri!” to get Siri activated.
This is of great benefit as the user can access the basic features while physically being engaged and is even helpful in an emergency situation(ex- Car Crash).
A co-processor can be understood as a secondary processor that has limited functionality and battery consumption to enable the “always-on” features which can be accessed even when your phone is idle(screen-off).
The M9 motion coprocessor, 3rd generation in the apple’s coprocessor family, was launched on board with the iPhone 6s in September 2015. Owing to its A9 64-bit ARM-based system-on-chip which has enormous processing power and meager battery consumption, it enabled the famous “Raise to Wake” feature. It is also sometimes described as “Always on Processor(AOP) embedded onto the motion coprocessor.”
When you enable this feature for the first time, it prompts you to utter “Hey Siri!” a certain number of times. Your iPhone saves a bunch of these voices in the form of what you can say a “Trigger Key”.
This personalized trigger key is saved in the coprocessor and even when your phone is idle, the coprocessor is listening(not hearing) to every sound that falls on the mic.
So when the voice falling on the mic, matches with the “trigger key”, the coprocessor activates(or triggers) the main processor to start recording(as it would do if we open Siri by long-pressing the home button). That recording is then sent to the servers and interpreted in a process similar to every voice assistant out there.
Think of this process as if you have a bundle of 1000s of keys and you are trying to find which one will fit right in a lock you wanted to open.
The important thing here to note is that the Always on Processor(A9) always “listens” and not “hears”(interprets) it. It is like a baby who is always listening to a language but is unable to process it fully and is triggered only when his name is called.
Motion coprocessor M9 was launched in 2015 September along with iPhone 6s. But as stated in the starting, “Hey Siri!” feature was launched in September 2014. Then how did the earlier version of the iPhone(6), had the ability to passively listen?
Well if you happen to know someone who has an iPhone 6, you can check that “Hey Siri!” feature works even while your phone is idle(screen off) only when it is in the CHARGING MODE. Well as one can simply deduce, it simply takes a small amount of extra energy only when charging. Look at the screenshot from iPhone 6 below-
Small sounds of 0.01 seconds are recorded and a cluster of 20 of these(0.2 seconds) at a time are continuously fed to a Deep Neural Network(RNN) which converts these into a probability density function which has a threshold above which, it activates the main processor.
The threshold here is not fixed as such and varies according to the background noise. Hence, for a clear understanding, you can say that the DNN is also computing threshold at each moment.
Additionally, when the first time your sample voice was recorded and the “Trigger Key” was generated, it was actually to train that DNN and define weights for calculating the probability.
The DNN is trained differently for different accents. For example, the pronunciation of “Hey Siri” is a bit like as in “Serious” for the American English that too without the punctuation but “Hey, Siri!” with different lengths of both ‘i’s and with punctuation.
To all the machine learning enthusiasts :) , here is the model -
P.S.- If someone is willing to replicate and play with it, ping me :)
The overall probability function is calculated using-
where
Here s(i) and m(i) are related to the weights that are trained while defining the “Trigger Key” and can be hypothetically linked as-
s(i)- Depends on a single frame of the “Trigger Key” on parameters like pitch, volume, etc.
m(i)- Depends on the frequency or in simple words speed of the “Trigger Key”, and how much and how frequent the parameters of s(i) change.
For example- m(i) and s(i) will be very different for Eminem and Adele since he sings faster (much faster) with lesser immediate variations whereas she sings slower with decent variations.
The size of layers in the DNN is different for the coprocessor(32 layers) and the main processor(192 layers), keeping in mind the processing power and battery consumption.
The “Hey Siri!” feature, although not publicized that much is a revolutionary step towards the automation and enhancing the ease-of-use of mobile phones. It can also be seen as a great example of how a small change can enormously impact the user experience and how sometimes there is extensive research required for these small revolutionary changes.
Any queries/comments are really appreciated.
I write articles on various products, services, and features, analyzing their technical, business and consumer aspects and try to put them in simple words :). For any queries and potential collaborations, you can know more about me and reach me out at- LinkedIn