In Data We Trust: Defining Metrics and KPIs

Imagine that you’re on a critical mission to space and your team needs to simulate all the aspects of that mission. Some of the aspects you’ll know from previous attempts; some will be based on the needs of the current mission. In this two-part blog, we’ll look at lessons learned from the Genesys simulation of “outer space” and defining metrics.

The Journey Is as Important as the Destination

As cliché as it sounds, it’s true. If you’re embarking on a journey to collect data for your bot, you shouldn’t have to wait until bots are rolled out to gain metrics and insights.

Prepping Your Bots

Authoring bots requires making decisions on how to convert use cases to intents, what information is required from the user in the form of entities and slots, and how to converse with a wide variety of user personas. Here is what we’ve learned from our prior roll outs and the present data collection exercise.

Reduce author bias: Individual preferences and biases make it nearly impossible to create an experience for every user persona. Authoring bot responses requires multiple iterations of feedback from internal teams.
Humility and empathy: Authoring empathetic and humble responses is the hardest part of bot authoring. Human agents rely on soft skills acquired over decades to respond to an irate, upset or a frustrated customer. Bot authors, on the other hand, need to imagine scenarios in which a user will get frustrated. Then they have to think of bot responses to de-escalate the situation. To-the-point prompts aren’t always the right choice for a good user experience.
Confirm the obvious: Human interactions with a bot can swing between two extremes — users expect a bot to be a fancy version of IVR or that it will have near-human capabilities. Dealing with the latter is tricky; having confirmation prompts for every low-confidence interaction can become counter-intuitive.
Determine a “Plan B:” User experience when interacting with bot goes beyond intent and entity recognition. It also means knowing what your bot can’t do (yet) and what should be the stop-loss strategy. For example, when the user experience starts to go south, determine if a human agent should intervene as soon as they can or if a service ticket should be opened.

Defining Metrics

Businesses are all about data and insights. But deriving insights from data is only half the job. To define the data needed — and how to structure that data for analysis — first define the metrics that are important to your specific business. Here are some commonly tracked metrics:

Intent and entity recognition rates – In a case of a banking bot that recognizes the intent of the customer, collects data and passes the flow to a human agent, intent recognition and entity recognition rates define the bot’s success. They are defined as the percentage of the intents and entities the bot recognizes versus the total number of intent and entity recognition opportunities presented to the bot. For our data collection exercise, the intent and entity recognition rates were the primary metrics to detect performance.

Task completion rate – If the goal of a banking bot is to fulfill the customer’s request, task completion rates are the most important metric. They’re defined as the percentage of the total conversations the bot handles and successfully fulfills.

Turns – A turn is a single interaction between a customer and an agent. An entire conversation between a customer and an agent can take several turns to complete. Turns indicate whether the bot understood the user intent without much clarification — and whether entities were captured the first time they were uttered. A lower number of turns means a better user experience. We used average turns, along with the intent recognition rates. This helped us understand if the bot recognized a low-performing intent better when clarifying questions were asked.

Containment – Containment metrics show how many automation opportunities the bot tapped from the entire chat corpus. This helps businesses understand ROI metrics and set goals for bot rollouts.

Feedback scores – Bot automation improves operational efficiencies and reduces operational costs. However, user experience shouldn’t suffer as a result. Learn whether the user experience has been positive, negative or neutral. This ensures that savings from bot automation don’t result in a lost customer. In our initiative, we included icons for users to indicate satisfaction levels with the bot interaction.

Structuring the Data Capture

To compute the above metrics, log as much context about the interaction as possible.

Capture ground truth – In a real-world scenario, it can be difficult to capture ground truth. But it’s easily done in a controlled experimental set up. Business analysts and consultants who monitor bot rollout experiments should capture the expected use case — the ground truth for intent and entities. This ensures that intents and entities recognized by the bot are captured in a structured way. In our data collection exercise, we stored information on the expected intent from the scenario that was presented to the user. This avoided the need for annotation, which can be time consuming and expensive.

Determining one turn and one conversation – You’ll need enough identifiers to identify one turn from another as well as the number of turns in a conversation. It’s also important to distinguish between a turn within a conversation and one in a new conversation.

User session – Understand if the user closed the conversation after fulfillment or if it was closed part way through out of frustration. Did the user make another attempt and was it successful? Capture a unique ID for each user and their session to determine this.

Decoding the chaos with a confusion matrix – A confusion matrix helps derive insights on intent performance and confusable intents. It also gives you a picture of how changes in the bot affect precision, recall and f1 scores. The next blog in this series will explore this in detail.

In the second blog in the series, we discussed the importance of repeating some scenarios to the same customer to understand user behavior on repeat scenarios. In addition to ground truth and the user session identifier, we also flagged scenarios that were repeated to the user on purpose.

Automation of Metrics

Two steps for automating insights are defining metrics and structuring data to capture it. If the data is structured properly, you can calculate metrics using assumptions and a formula. However, not all metrics can be automated without compromising accuracy.

Error tolerance is an important parameter in determining whether to capture a metric using automation. Manual annotation is an expensive and time-consuming alternative to metrics automation. So, it’s a business call to balance the requirement of precision of data metrics with the time and effort required to reach an extremely high level of accuracy that would mandate human annotations.

The next blog post in this series will examine metrics and insights gained in our specific data collection exercise as it relates to the banking domain.

Catch up with the previous blogs in this series:

This blog was co-authored by Aravind Ganapathiraju.

Harshali Desai

Harshali is a Genesys Product Manager in artificial intelligence (AI), with a focus on innovation and research in Conversational AI. She works closely with the AI Applied Research team at...

In Data We Trust: Defining Metrics and KPIs

Want more?

Recommended for you

The competitive advantage of bank branches in the digital age

Spotlight on success: Customer Innovation Awards nominations are open

Celebrating Earth Day 2025: Uniting technology and sustainability goals

Smart metrics, real impact: Seeing measurable success from AI copilots