All LLMs Aren’t Equal
The promise of LLMs is undeniable. They analyze interactions at scale, summarize conversations, uncover insights, and even guide agent responses. But let’s bust two common myths:
- There isn’t a one-size-fits-all LLM.
- Bigger doesn’t always mean better.
While large LLMs tend to be more capable, they also come with massive costs. They require more computation, more time, and a larger budget, especially when teams end up prompting the model multiple times just to obtain a valid result.
Enter a domain-specific or smaller language model (SLMs). These models are cost-effective and trained for specific tasks, like analyzing contact center conversations. But they have limitations, too. They don’t self-evaluate or refine their responses like larger models can.
It becomes even more noticeable when you feed each model type the same set of prompts. This brings us to the second problem, which lies not just in the models themselves but in how we interact with them.
Prompting Is Hard – And That’s The Point
Prompting a model may sound simple, but it’s deceptively complex. Humans naturally ask questions in varied ways. And as we've seen, different models interpret these prompts differently. Ask the same question twice—or ask it slightly differently—and you’ll often get wildly different responses. The result? Inconsistent answers, frustration, and wasted time.
When you’re analyzing millions of customer interactions daily, or relying on AI to assist agents in real time, inconsistency isn’t just frustrating—it’s unsustainable. You can’t afford trial-and-error prompting, nor should your results hinge on who wrote the prompt or how they phrased it.
This phenomenon, known as AI variance, poses a serious problem in high-volume environments like contact centers. You need a system to make every prompt work—every time.
So what’s the solution? It starts with optimizing how you prompt the AI.
And it ends with the ability to leverage different types of LLMs - not just large, costly ones based on the use case and desired outcome, without sacrificing quality.
Introducing PROPEL: Observe.AI’s Prompt Optimization Framework
That’s precisely where PROPEL comes in. Whether you’re using a small domain-specific model or a large general-purpose one, PROPEL tailors each prompt to the strengths and constraints of that LLM, so you’re never locked into a single model. Instead, you can mix and match the best tools for each job. All while ensuring consistent, high-quality outcomes.
PROPEL takes the art of prompting and turns it into a repeatable science. It systematically refines prompts through an automated loop of generation, evaluation, and improvement, yielding high-quality, consistent outputs from any LLM, without relying on expensive models or expert prompt engineers.
How PROPEL Works
PROPEL utilizes a collaborative AI system made up of three LLM roles:
- Responder LLM: A cost-efficient model that generates the final response based on an optimized prompt.
- Judge LLM: A powerful model that evaluates the output using predefined quality criteria (e.g., clarity, completeness, relevance).
- Optimizer LLM: Uses feedback from the Judge and “Expert Priors” to improve the prompt for the next iteration.

This loop continues until the system generates the best possible prompt for the given task and model.
The Secret Weapon: Expert Priors
Expert Priors are essentially a cheat sheet for each model. They guide the AI to craft better prompts by identifying what each model responds to best and what to avoid. For example:
- Do: Use direct instructions, examples, and simple language.
- Don’t: Use ambiguous language, overly abstract tasks, or long-winded phrasing.
PROPEL uses design principles like "Use direct instructions," "Provide examples," "Ask the model to think step-by-step (Chain-of-Thought)," or "Avoid ambiguity", and then systematically tests how well the target Responder LLM follows dozens of such principles.
- Principles like "Use direct instructions," "Provide examples," or "Avoid ambiguity" are evaluated for each Responder LLM.
- Principles the model follows well become Emphasis Rules (things to DO) for the Optimizer.
- Principles the model struggles with become Avoidance Rules (things to AVOID).
For example, for a specific summarization task using a Llama 1B model, the Expert Priors might include:
- Emphasis Rules (DO): Use direct instructions; Provide examples; Use affirmative directives ("Do this"); Use simple words; Mention desired output format.
- Avoidance Rules (AVOID): Do not rewrite prompts at a high level; Do not decompose tasks; Do not use complex terms; Do not leave instructions ambiguous.
These priors allow the Optimizer LLM to intelligently craft prompts that play to the Responder LLM's strengths while sidestepping its weaknesses, leading to significantly better results.
Using ‘Expert Priors’ ended up improving response quality by up to 21%.
By applying these priors, PROPEL tailors every prompt to the model’s strengths, resulting in significantly better outcomes.
PROPEL in Action
Before:
“Summarize the agent’s performance, focusing on professionalism, adaptability, and communication skills.”
Initial Prompt:
Summarize the agent’s performance, focusing on positive attributes or skills that were shown throughout their engagement. The summary should include examination of professionalism and adaptability in communication, problem-solving techniques, emotional intelligence, and active listening skills, strict adherence to guidelines, and the ability to communicate clearly. The summary should follow these formatting rules: 1. Begin each point with a verb, limited to one or two sentences for each point. 2. A maximum of five key points that holistically cover the agent’s engagement. 3. Each point must be distinct and offer unique, context-related observations.
This sounds fine, but it’s vague. An LLM might return a generic summary or skip key areas.
After (via PROPEL):
A structured, detailed prompt that sets clear expectations, formatting rules, and evaluation criteria, enabling the LLM to produce sharper, more actionable insights.
Optimized Prompt
Identify and describe up to five positive actions displayed by a customer service agent in a hypothetical interaction. These actions should encompass the following key performance areas:
- Professional language and tone: Evaluate the agent’s selection of words and how the tone was adapted to fit the situation.
- Effective problem-solving: Inspect the agent’s logical reasoning, prioritization ability, and use proactive strategies.
- Active listening and empathy: Identify instances where the agent showed understanding, validated emotions, or gave personalized responses.
- Procedure compliance: Assess if the agent adhered to protocols yet exhibited flexibility when needed.
- Clear communication: Judge how clearly the agent communicated, and check if the explanations provided were suitable for the customer.
Required output format:
- Commence each point with an action verb, framing the agent’s action in one to two precise sentences.
- Your points should collectively provide a thorough coverage of the interaction.
- Keep your insights tailored to the specific scenario, making sure to avoid repeating information.
Penalties will be imposed in cases of deviation from prescribed instructions. Utilize simplistic language and strictly observe the outlined format. For instance, a point could be: "Providing reassurance, the agent skillfully varied his tone, effectively solved the problem by prioritizing steps, expressed genuine empathy by acknowledging the customer’s frustration, adhered to protocols with room for flexibility, and conveyed clear, easy-to-understand explanations."
Strive to provide a detailed critique of the agent’s performance. Remember to highlight unique instances that demonstrate the agent’s proficiency across different areas.
This transformation ensured the AI understands not just what to do, but how to do it effectively.
Real Results: 10-24% Gains in Output Quality
We tested PROPEL across core contact center tasks like summarization and entity extraction. The impact was clear:
- Summarization: 10-24% quality improvement.
- Entity Extraction: 5-16% quality improvement.
Across models from 1B to 8B parameters, PROPEL consistently outperformed manual and other automated prompt strategies.
Why This Matters for Contact Centers
With PROPEL, contact centers can:
- Achieve consistent, high-quality results without relying on prompt engineers or massive LLMs
- Maximize ROI from smaller, domain-specific models
- Automate workflows like Auto QA, sentiment analysis, and coaching insights
- Future-proof performance as models evolve and expand
In short, prompt optimization isn’t a nice-to-have; it’s the linchpin for scalable, reliable AI in the contact center and beyond.
Let’s Unlock the Full Power of AI, Together
At Observe.AI, we’re not just building powerful AI—we’re making it practical and performant for the real-world contact center. PROPEL is a testament to that commitment.
By embedding automated prompt optimization directly into our platform, we help you:
- Drive better decisions through consistent, accurate insights
- Scale AI adoption without scaling complexity
- Empower every user, not just technical experts, to get more from AI
Ready to see how optimized prompting can revolutionize your contact center operations? Explore Observe.AI’s capabilities and reach out for a consultation on how PROPEL can help your team unlock the full potential of AI, intelligently prompted.