Introduction
In the evolving landscape of contact centers, Large Language Models (LLMs) have emerged as pivotal tools. However, the effectiveness of these models is often obscured by traditional evaluation benchmarks that fail to capture the nuances of real-world interactions. One critical aspect often overlooked is the generative nature of LLM outputs, which presents unique challenges for evaluation. Unlike discriminative or classification tasks, where response structures are more predictable and standardized, the creative and varied responses generated by LLMs demand a distinct approach for assessment. This blog delves into the intricacies of evaluating LLMs in contact center environments, highlighting not only the shortcomings of generic benchmarks but also emphasizing the need for evaluation techniques that specifically address the generative and dynamic outputs of these models.
The Shortcomings of Generic Benchmarks
In evaluating Large Language Models (LLMs) for contact centers, we recognize that standard benchmarks like BLEU or ROUGE, which focus on lexical (word) overlap, are insufficient. They fall short because they don't capture the semantic context or the true meaning behind the generated responses. In contact center interactions such as in sales, customer service, or technical support, understanding and accurately responding to the nuanced and varied needs of customers goes beyond mere word matching, highlighting the need for more sophisticated evaluation methods tailored to the unique demands of contact center communications.
Call Example:
Agent: How may I help?
Customer: I wanted to know if my current plan is active and can I get multi-user feature and premium access? I really felt the need for them.
Agent: Sure, I can see that a plan of $249 a month was active till last month on your account. If you need our new multi-user feature, we are bundling the same at the existing tier of $249 a month for which you need to re-activate the plan from your account. Additionally, premium access is priced at another $49.99 a month.
Customer: Oh I see.. you mean it was active last month. I need to think about premium access as $49 is high
Agent: Sure, please let us know. I will check with the team if we can waive the first month for you.
Customer: Sure thanks, bye.
- Query: How did the agent reply to the question asked by the customer: “Is multi-user feature and premium access active on my account?”
- LLM-A Response: The agent mentioned that the feature is already included in the payment plan of $249 that was active last month. A top-up plan needs to be added for premium access at another $50 per month.
- LLM-B Response: The reply by the agent was that the feature can be enabled in the payment plan of $249 per month, by re-activating the plan. The customer can also top it up by $49.99 a month for premium access.
Scenario 1: Inconsistency and Hallucinations
In the provided example, responses A and B exhibit significant lexical similarity, utilizing very similar words while conveying substantially divergent interpretations. Response A is inaccurate, suggesting erroneously that the multi-user feature is encompassed within the customer's existing plan, a claim contradicted by the given conversation. Furthermore, it inaccurately cites the cost for premium access as $50, deviating from the actual price of $49.99, indicating a hallucination error. Conversely, Response B accurately reflects the details from the conversation, thereby standing as the factual response.
Scenario 2: Incompleteness and Low coverage
Additionally, the language models have to understand the semantic and specifics of language of the domain, including the terms in a support call, while also picking up on the significant events that matter to understand the context. For instance, in a sales call, the model should be able to summarize not only what the customer and sales agent said but also understand the customer's hesitations or preferences.
Using the same example as above, while LLM-B response is factually correct, it misses mentioning the customer’s apprehension of purchasing a plan for premium access and that the agent further offered looking into possibility of a discount. Thus, LLM-B response falls short on the “Completeness” aspect.
Traditional benchmarks fail to notice when these important details are missing or when the summary doesn't fully capture the essence of the conversation. We have noticed such missing details negatively impacting the coverage quality of the generated summaries which doesn’t correlate with the ROUGE metrics. What these metrics report and what the human experience perceives in real-world scenarios can be starkly different. This disconnect underscores the need for more sophisticated, context-aware evaluation metrics that can accurately reflect the nuanced requirements of contact center interactions.
Towards Domain-Specific Evaluation
Considering the challenges outlined in the previous section, it's imperative to adopt a more comprehensive and nuanced approach that encompasses various aspects of evaluation, as elaborated below:
Evaluating Task-Specific Competence: In a contact center, summarization tasks vary widely, from condensing customer complaints, to outlining the outcomes of sales calls, to extracting the values of entities mentioned in the call. Evaluating an LLM's performance in these specific tasks involves assessing its ability to capture key issues, solutions offered, and customer sentiments accurately. This would test the model's functional competence in understanding and conveying the essence of diverse conversations.
Handling Data Complexity: In the realm of contact center call transcripts, Large Language Models (LLMs) grapple with two notable challenges: incomplete calls and conversations spanning multiple topics. When dealing with incomplete calls, LLMs tend to add details such as next steps that make sense logically but were never actually mentioned in the conversation. This phenomenon, known as hallucination, can disrupt the consistency of the generated responses. On the other hand, in calls discussing multiple subjects, LLMs may unintentionally leave out important information, impacting the completeness of the summary. Balancing the need to fill gaps in incomplete calls without hallucinating and ensuring that multi-topic discussions are summarized comprehensively is crucial for the effectiveness of LLMs in handling complex contact center communications.
Alignment with Human-Centric Metrics: In tasks such as Summarization, Knowledge AI, and AutoQA, it is crucial to ensure that the evaluation of these generative models is closely aligned with human-centric metrics. This alignment is necessary to accurately measure the real-world practicality of the models' outputs. For instance, in the context of business applications, the ability of a summary to inform and prompt effective follow-up actions can be significantly more valuable. Therefore, an evaluation metric that specifically assesses how well a summary covers actionable next steps could be highly beneficial. Likewise, for Knowledge AI systems, the inclusion of reference citations in generated responses is essential for ensuring trustworthiness and utility. A system that fails to provide such citations may be considered less reliable, even if the accuracy of the generated response is very high.
Adaptability to Varied Conversational Contexts: The context in which a conversation occurs in a contact center can vary significantly. For instance, summarizing a technical support call requires a different understanding and focus, as compared with summarizing a product inquiry call. Evaluating an LLM’s ability to discern and adapt to these varied contexts, and subsequently reflect them in its summaries, is crucial. The model should not only identify the main topic of the conversation, but also the subtleties that differentiate various types of customer interactions.
The Path Forward - A Demonstration of Summarization Evaluation Framework
Combining the observations and situations discussed above, we demonstrate how we have put together an evaluation framework for one of several generative AI applications: Summarization. We sample calls across different distributions (long and short calls, incomplete and complete calls, sales and support calls), and run through an evaluation framework as represented in the diagram below.
In this framework, we evaluate the summaries across different criteria. Our evaluation metric consists of three main components:
1. Faithfulness: Summaries must factually align with the original conversation, including accurate details of entity values, speaker attribution, and action status.
2. Completeness: Effective summaries encompass all critical aspects of an interaction, extending beyond the customer's main concern to include specific requests and agent actions and commitments.
3. Coherence: Summaries should present information in a clear, logical order that reflects the conversation's natural flow.
Once the evaluation metrics are obtained along with user feedback, the Calibrated Quality Score is computed, which is a combined score of above three metrics. We use user feedback to estimate and refine the score. It involves collecting and analyzing end-user feedback to align our evaluation algorithms with user needs, ensuring the correlation between the user’s perception and feedback. This dynamic, iterative process minimizes the gap between model output and user perception, making CQS a continually evolving performance indicator for measuring the quality. The evaluation scores and user feedback are directly integrated in a form of a feedback loop into the summarization model. This allows it to continuously improve by learning from its weaknesses, resulting in improved performance.
In conclusion, the nuanced and multifaceted approach to evaluating Large Language Models (LLMs) in contact centers, as outlined in this blog, emphasizes the importance of moving beyond generic benchmarks towards domain-specific evaluation metrics. This evaluation framework, as demonstrated for a summarization task that focused on facets of faithfulness, completeness, and coherence, provides a more accurate reflection of an LLM's performance in real-world settings for the summarization task. The principles and methodologies discussed here can be extended to other Generative AI applications, such as AutoQA and Knowledge AI. This shift towards refined evaluation frameworks represents a crucial advancement in harnessing the extensive capabilities of LLMs, ensuring their alignment with the nuanced requirements of specific applications.