This brings us to the end of this trilogy series focusing on performance evaluation metrics for conversation assistants.
I would highly encourage reading the first 2 articles of this series to get a more detailed understanding.
This article will be focussing on how do we scale the metric instrumentation framework to accommodate new user cases across different form factors - phone, tablet, headsets, car etc.
It is important to note that different form factors may have different product features. However, for conversational assistants, the high level mental model should still be the same.
As a result, there is a need to enable metric collection from different log types along with customizations needed for additional form factor specific features.
The metric collection framework should work out of the box for new form factors without needing any specific change on metric collection framework.
This is essential to ensure that the metrics required for determining success for the conversational AI agent are available from early stage and can be used to plan the future roadmap for next milestones. This also removes additional engineering bandwidth and ensures that the metrics are measured in an objective manner.
I will be discussing a simple solution that can work for most use cases - transform the device specific logs to an uniform representation.
I am also the first author patent holder in this area - Standardizing analysis metrics across multiple devices. I will not be discussing low level details of the solution but the high level details.
Implementations relate to generating standardized metrics from device specific metrics that are generated during an interaction between a user and an automated assistant.
The solution consists of the following steps:
It is important to note that “ground truth” metrics should continue to be based on device specific logs. This can help uncover issues with the metric instrumentation framework.
“Ground truth” metrics should always reflect the metrics with high confidence. It is crucial to remember to never fully trust metrics - “swallow the red pill”.
As AI applications integrate more into our daily lives, it is crucial to ensure that these AI agents are reliable for the end user. AI agents should be useful and usable. As a result, end user performance evaluation infrastructure provides a foundational backbone for the development of conversational agents. I hope this trilogy series provides a good starting point to understand and instrument these metrics at high level.