This is the first part of a trilogy, where I will discuss some high level ideas for measuring end user performance for conversational voice assistant. In this article, I will be discussing how to think about metrics ideologically for conversational voice assistant and some pitfalls from a engineer’s point of view.
The aim is keep the discussion simplified and fairly high level for everyone working in this domain.
I am a senior engineer and tech lead for the Google Assistant. Here is my linkedin profile in case you want to know more about me.
I also hold 2 patents in performance evaluation infrastructure for voice assistants:
And yes I am a big fan of the “Matrix” trilogy :)
Metrics should be classified into top 2 categories
Note the word - “user” - is present in both which should be expected. The golden rule for any conversational product or any product in general is “user” should always be given the top priority.
Before going into the definition for the above, a user facing metric usually consists of 2 aspects
Note that high latency can also mean low reliability. No user will wait for 20 seconds for a successful action completion.
These metrics which capture reliability and latency of actions that can be directly observed by the user.
e.g.
These metrics are similar to system level metrics and hence have high coverage but comparatively lower confidence (by virtue of being derived).
These metrics are directly reported by the users either through feedback channels, internal testing, surveys etc. These metrics are explicitly coming from the user and hence have low coverage but comparatively higher confidence.
A rule of thumb I follow for metrics is swallow the red pill which means never fully trust a metric - always question negative and positive movement in a metric.
The reason I say this is because metrics cannot be measured explicitly most of the time and need to be derived many times from different signals.
We might not be able to capture the true “latency” that the user experiences in this case with a 100% accuracy. But we can observe different components such a long did a specific backend(s) take to respond and use specific heuristics to come up with a metric value which approximates the user perceived latency to the best of our knowledge. This is where system level component metrics come into the picture.
This is difficult to determine based on User Perceived Metrics. The simple rationale is if the voice assistant intelligence is able to detect that user wants to play music - then it would started playing music. So, this metric in reality will always reflect 100% reliability which is incorrect. This is where User Reported Metrics come into the picture. However, since user reported metrics can be detached from the system level components, it is important to connect those with system level metrics or User Perceived Metrics to make it actionable.
Both User Perceived Metrics and User Reported Metrics complement each other and form the foundation for a highly reliable conversational assistant. It might not be trivial to enable these metrics at a later stage and may require significant bandwidth. As a result, metric evaluation should be given high priority when designing conversational assistant products. This is the only way to objectively evaluate the user experience and may influence some of the design decisions. As a result, it might be worthwhile to understand the metric requirements in early stages of product development.
There is a need to invest in infrastructure to ensure these metrics easy to use, accurate and actionable - we will be discussing more details on how to do that in the second part of this trilogy. So stay tuned!