2025¶

2025/05/26
in Applied AI
2 min read

Why Qualitative Research Experts Are the Secret Weapon for Generative AI Evaluation

To my colleagues in digital humanities, anthropology, and experts in qualitative research: you're sitting on a goldmine and don't even know it.

While everyone's trying to learn "prompt engineering," you are the true experts because you have been doing this for your entire career.

You know how to write precise, nuanced questions that extract meaningful responses.

You understand context, subtext, and how framing shapes answers. You're experts in context analysis, at iterative inquiry and grounded theory .

You evaluate sources critically, find biases and inconsistencies, and know that the most interesting insights often lie in what's not being said.

You understand that interpretation requires both rigor and creativity.

These aren't just transferable skills, they're the skills that matter most when designing, evaluating and aligning LLMs or AI agents.

The AI world is busy teaching engineers to think like qualitative researchers.

Meanwhile, you already think like AI researchers. You just need to learn the syntax. When this clicks for our field, we won't just be catching up, we'll be leading the conversation on responsible AI evaluation and human-centered AI systems design.

The skills to evaluate AI agents are not that different from evaluating humans.

As I've specialized more in AI agent evaluation, I can't stop thinking about how my colleagues and I, as engineers, are 'discovering' methodologies that you have known for ages.

You are so well equipped to help design evaluation or fine-tuning datasets and you don't even know it.

The future of AI isn't just engineering. It is more human than ever. And that's your territory.

2025/05/03
in AI Evaluation
8 min read

My Journey with Evaluation-Driven Development

Over the past year, I've been working on building generative AI systems that actually solve real problems. What I've learned is that the biggest challenge isn't getting an AI to generate responses—it's ensuring those responses are reliable, useful, and continuously improving based on real user feedback.

This led me to develop what I call an "evaluation-driven development" approach. Instead of building first and evaluating later, I've structured my entire workflow around feedback and evaluation from day one. Here's what I've discovered works.

GenAI Evals Maturity Pyramid

The Framework I Use

My evaluation-driven development process consists of seven interconnected phases that I've refined through multiple projects:

Knowledge Base Understanding: I start by identifying requirements and assessing data (taxonomy, metadata) to understand the knowledge base segments I'm working with.

Foundation Building: Next, I establish a baseline RAG (Retrieval-Augmented Generation) setup as my technical foundation.

Enhanced Processing: I implement optimized data processing, search enhancements, response generation, and prompt engineering.

Observability: I deploy comprehensive logging, dashboards, monitoring, and user feedback systems to gain visibility into performance.

Review Evaluation: My multi-level evaluation includes unit tests, human + model evaluation, and A/B testing to measure effectiveness.

Optimization: Based on feedback and prioritization, I analyze data, improve metadata (particularly domain-specific), and review agent architecture.

Iterative Improvement: Finally, I implement deployment and monitoring improvements, trace analysis and curation, and integrate data updates.

AI Augmented Evaluation Workflow

Who I Collaborate With

This process has taught me the importance of cross-functional collaboration. I work closely with:

Product Managers and UX Engineers who focus on interface design
Domain Experts who provide specialized knowledge
Data Engineers who manage infrastructure
Client Data Engineers who bridge domain and technical needs
Beta testers and early adopters who provide real-world validation

My Core Goals

Through this work, I've focused on four key objectives:

Create simple, intuitive UIs with clear feedback visualization
Aggregate and prioritize user feedback for targeted iterations
Understand knowledge base segments for comprehensive coverage
Establish reliable measurement baselines for ongoing improvements

This systematic approach has transformed my standard observability practices into actionable insights, ensuring my solutions evolve in response to real user needs and quantifiable metrics.

Phase 1: Understanding the Knowledge Base

Goal: Understand KB Segments

This initial phase has become the foundation of my approach. I've learned that comprehensive knowledge base analysis before any implementation is crucial for success.

What I Focus On:

Discovery & Requirements - Identifying the specific knowledge domains needed for the solution - Mapping user query patterns to determine information needs - Documenting domain-specific terminology and concepts

Data Assessment - Taxonomy development to categorize information hierarchically - Metadata creation to enhance searchability and relationships - Segmentation of knowledge into functional areas

Smart Data Ingestion - Building custom pipelines that ingest domain-specific data sources - Creating intelligent metadata automatically - Preparing baseline content (v0)

I've found that this foundational phase ensures I have properly organized and understood the knowledge segments before proceeding to build the RAG baseline. By thoroughly analyzing the knowledge requirements upfront, I create a more effective foundation for all subsequent phases.

Phase 2: Foundation Building - RAG Baseline Setup

Goal: Setup RAG Baseline

This second phase creates the technical foundation upon which all my subsequent enhancements are built.

Key Components I Implement:

Synthetic Testing Framework - Utilizing RAGAs and DeepEvals for standardized evaluation - Implementing synthetic data validation with Argilla or custom UI - Validating test questions with domain experts to ensure relevance

Streamlined Testing Approach - Implementing simple testing using pytest - Focusing on basic result generation rather than complex metrics initially - Using straightforward logging and response analysis techniques

Experimentation Tracking Infrastructure - Setting up MLFlow for systematic experiment management - Establishing capabilities to track experiment history - Creating version control for all test configurations

Automated Evaluation with LLMs - Implementing LLM-as-a-Judge evaluation methodology - Beginning with standard RAGAS metrics as baseline measurements - Building capabilities to generate evaluation reports and insights

This phase prioritizes creating a solid, measurable foundation rather than advanced features. By establishing clear baselines and testing methodologies, I create a framework that allows for data-driven improvements in subsequent phases.

Important Note: As I've learned from experience and written about in my post on LLM-as-Judge metrics, automated metrics are just tools—not absolute truths. The real test is whether these metrics correlate with actual user preferences.

Phase 3: Observability and User Feedback Analysis

Goal: Have a Simple UI and an Overview of User Feedback and Evaluations

This phase focuses on creating accessible interfaces for both users and developers while implementing robust feedback collection and analysis systems.

What I Build:

Basic Chat Interface - I use Streamlit for rapid development and deployment - Providing an intuitive, straightforward user experience - Ensuring accessibility for all stakeholders, including non-technical users

Integration with Trace Tools - Implementing connections to observability platforms like LangSmith, LangFuse, and MLFlow - Enabling developers to follow the execution path of queries

Simple User Feedback Mechanisms - Implementing thumbs-up/thumbs-down ratings integrated directly in the UI - Creating seamless integrations with Slack, tracing tools, and databases - Ensuring high participation rates for continuous feedback collection

Feedback Aggregation Dashboard - Developing dashboards for aggregated user feedback - Creating visualizations of feedback patterns and trends - Providing metrics on positive vs. negative feedback ratios

The observability infrastructure I design is lightweight yet comprehensive, capturing both system performance data and user sentiment. By integrating feedback collection directly into the user interface, I ensure high participation rates and create a continuous stream of evaluation data.

Phase 4: Beyond Observability - Generating Actionable Insights

Goal: Aggregate and prioritize user feedback and iterate

This phase is where I transform raw observability data into actionable insights through sophisticated analysis and clustering techniques.

My Analysis Framework:

Data Collection Integration - Setting up ingestion pipelines from observability tools - Collecting both structured feedback and unstructured comments - Capturing user queries with corresponding timestamps and metadata

Analysis Implementation - Developing clustering algorithms to group similar feedback patterns - Creating categorization systems for user queries by domain and intent - Implementing automated analysis of response quality metrics

Insight Generation Systems - Building dashboards for visualization of feedback trends and patterns - Developing recommendation engines to prioritize system improvements - Establishing automated reporting with actionable next steps

This phase transforms my observability from passive monitoring into a strategic development driver by identifying exactly where improvements will have the greatest impact on user satisfaction and system performance.

Phase 5: A/B Testing & Evaluation

Goal: Validate improvements through controlled experiments

This phase implements controlled experiments to validate improvements and ensure changes positively impact user experience before release.

My A/B Testing Framework:

Controlled Experimentation - Developing capability to serve multiple response versions simultaneously - Implementing variant assignment methodology for unbiased testing - Creating monitoring systems to track performance differences

Correlation Analysis - Establishing systems to monitor correlation between user preferences and LLM-as-a-judge metrics - Validating that automated evaluation metrics align with actual user satisfaction - Identifying discrepancies that require metric adjustment

This correlation analysis is crucial—I've experienced firsthand how LLM-as-judge metrics can diverge from user preferences. In one project, our automated metrics showed only 55% correctness while users preferred our system over 70% of the time. The lesson: always validate your metrics against real user feedback.

Release Confidence Assessment - Defining threshold criteria for determining release readiness - Implementing statistical significance testing for experimental results - Creating dashboards for visualizing confidence intervals and performance differences

This phase provides scientific validation of my improvements through controlled experimentation, ensuring that development decisions are based on empirical evidence rather than assumptions.

The Three-Level Evaluation System I Use

I've developed a comprehensive, multi-level approach that ensures continuous improvement through systematic testing:

Level 1: Unit Tests - Goal: Quickly catch obvious issues with minimal resources - Process: Fast, automated tests that run on every code change - Characteristics: Fastest and cheapest, focus on basic assertions, avoid actual LLM calls

Level 2: Human & Model Evaluation - Goal: Identify subtle issues and potential improvements - Process: Combined human review with LLM-as-a-Judge evaluation - Characteristics: Log and analyze conversation traces, combines human expertise with automated metrics

Level 3: A/B Testing - Goal: Validate user value and business outcomes - Process: Live testing with real users and statistical analysis - Characteristics: Tests with actual users in real scenarios, measures impact on business metrics

The Continuous Improvement Loop

The entire process I've developed forms a continuous improvement cycle:

Observability provides raw data and feedback
Review Evaluation analyzes this data at multiple levels
Optimization translates insights into targeted improvements
Iterative Improvement implements changes and updates

Key Lessons Learned

Through this journey, I've discovered several critical success factors:

Commitment to data-driven decision making - Every improvement needs to be validated
Integration of both automated and human evaluation - Neither alone is sufficient
Clear prioritization based on quantifiable metrics - Focus efforts where they'll have the most impact
Continuous experimentation and validation - Build learning into every release
Cross-functional collaboration - Great AI systems require diverse expertise

Sample Size Matters for Statistical Confidence

One practical challenge I've encountered is determining how many samples we need for reliable evaluation. As I discuss in my post about evaluation dataset sample sizes, statistical significance matters immensely. For 95% confidence with ±5% margin of error, you need around 385 samples. This becomes a resource planning question: how much time can you borrow from domain experts for annotation?

The Enterprise Reality

The broader context here is what I've observed about enterprise AI adoption: the real bottleneck isn't model capability—it's domain-specific evaluation. Most organizations are still between Level 1 and Level 2 AI maturity, using AI as a productivity boost rather than as a true system backbone. Without rigorous evaluation, enterprise trust is impossible.

Looking Forward

This evaluation-driven approach has fundamentally changed how I build AI systems. Instead of hoping my solutions work well, I now have systematic ways to measure, understand, and improve them based on real user feedback and rigorous testing.

The framework isn't just about building better AI—it's about building AI that gets better over time. And in a field that's evolving as rapidly as generative AI, that continuous improvement capability might be the most valuable feature of all.

If you're building AI systems and struggling with evaluation, I'd encourage you to start with even simple feedback collection. The insights you'll gain from real users interacting with your system will be more valuable than any synthetic benchmark.

As I've written about in my thoughts on context engineering and evaluation, we're moving beyond prompt engineering to become "context engineers." With solutions like Anthropic's Model Context Protocol standardizing data plumbing, the real differentiator for AI engineers will be building robust evaluation workflows and feedback loops.

If you found this post helpful, you might also be interested in:

My Journey with Evaluation-Driven Development(notes-on-evaluation-driven-development.md) - My Journey with Evaluation-Driven Development
When LLM-as-Judge Metrics and User Preferences Diverge - Real examples of how automated metrics can mislead
LLMs as Judges: Why Automated Metrics Aren't Enough - My thoughts on the limitations of LLM-as-judge approaches
The Real Bottleneck in Enterprise AI Adoption - Why evaluation, not models, is the limiting factor
How Many Samples Do We Need in Our Evaluation Dataset? - Practical guidance on statistical confidence and sample sizes
From Prompt to Context Engineering - Why evaluation is becoming the key differentiator

What's your experience been with AI evaluation? I'd love to hear about the approaches you've tried and what's worked (or hasn't worked) for your use cases.

2025/04/30
in AI Evaluation
1 min read

LLMs as Judges: Why Automated Metrics Aren't Enough

LLMs as Judges are just tools, not absolute truths. In the context of generative AI systems, evals are sometimes misunderstood. Some people think that adding another framework, or LLM-as-judge metric will solve the problems and save the day.

I don't really care how high your "Factuality" or "Correctness" metrics are if users don't like the answers your system generates. If these metrics don't correlate with user preference, you have a bigger problem to solve.

LLM-as-judge are simply tools pointing to where we should look deeper.

Automated metrics help us identify weak spots in our systems that deserve human attention. That's it.

No fancy evaluation framework will magically solve your product problems.

What matters is the process. Build one that helps you monitor, annotate, measure what real users actually like and iterate. This process will be different in each organisation. How you sample the data to give to domain experts to validate and what "a correct answer" actually means in your case will depend on your context and internal processes.

Remember, Evals aren’t static datasets or metrics.

They’re a living process that enables you to apply the scientific method.

Observe, annotate, create hypothesis, design experiments, measure, repeat.

2025/04/21
in AI Evaluation
2 min read

When LLM-as-Judge Metrics and User Preferences Diverge: Lessons from Real-World Evaluation

Why you should deeply understand what your LLM-as-a-Judge metric is actually measuring

When our automated evaluation metrics showed only 55% correctness for our LLM-generated answers, but users consistently preferred our system over 70% of the time, we knew something was off.

After deep analysis , we found that our Ragas-inspired correctness metric was actually penalizing our system for being 'too informative'. The metric counts additional facts beyond the ground truth as 'False Positives' - effectively punishing more comprehensive answers.

We changed the focus for another metric that according to our AB testing together with correlation analysis provided a more accurate picture.

This metric now categorises answers as: - Subsets of expert validated answers (consistent but less comprehensive) - Supersets of expert validated answers (consistent with additional information) - Fully consistent/equivalent to expert validated answers - In disagreement with expert validated answers

Using this approach, our actual factual accuracy jumps to 78%, much closer to what our user preference AB tests suggested.

Key takeways

𝗕𝗲 𝗰𝗮𝗿𝗲𝗳𝘂𝗹 𝘄𝗶𝘁𝗵 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗝𝘂𝗱𝗴𝗲 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: they may not align with what users actually value. Always calibrate your metrics with AB testing results. What the users prefer is more important than any LLM judge.
𝗔𝗹𝘄𝗮𝘆𝘀 𝘃𝗲𝗿𝗶𝗳𝘆 𝗰𝗼𝗿𝗿𝗲𝗹𝗮𝘁𝗶𝗼𝗻 between automated LLM metrics and user preferences through AB testing

Finally, the most accurate metric isn't always the most complex one. It's the one that best predicts user satisfaction. I cannot stress enough how you should focus on AB tests for that.

Proper user AB testing is more important than any LLM-as-a-Judge metric.

Revisit and challenge your ground truth and your metrics periodically. Challenge the domain expert to validate and revisit your ground truth dataset whenever you detect misalignments like this.

2025/04/20
in Applied AI
2 min read

The Real Bottleneck in Enterprise AI Adoption: Evaluation, Not Models

Enterprise AI adoption is moving much slower than the hype would suggest, and there’s a clear reason: domain-specific evaluation. Public benchmarks and leaderboard scores mean little when it comes to real-world business needs. What matters is whether these systems can be trusted to perform reliably in your unique context—and that requires custom evaluation pipelines and validation from domain experts.

The real bottleneck isn’t model capability or even energy consumption (at least, not yet). It’s the scarcity of human expertise available to rigorously validate these systems. Without that, trust is impossible.

It’s tempting to think that every new model release from OpenAI or another big lab will instantly unlock new business value. But the reality is, just because a new reasoning model drops doesn’t mean it will suddenly understand your business logic. Enterprise software isn’t built on vibes—it’s built on structure, reliability, and determinism.

Building production-grade, reasoning-based AI systems is hard. Getting agents to do real, valuable work is even harder. Most companies are still wrestling with the basics—like generating a clean summary from a SharePoint folder. And that’s okay. That’s where the real work begins.

Despite the noise from AI opportunists selling quick fixes, the truth is that almost no one has cracked operational AI at scale. Real credibility comes not from clever hacks, but from putting in the work: shipping infrastructure that quietly, reliably, and deeply integrates into the fabric of a company.

AI in operations isn’t a magic prompt. It’s a ladder you climb, step by step. Using ChatGPT at work is just the entry point.

Here’s what the real AI adoption stack looks like:

Level	Description	Use Cases	Skills	Org Needs
Level 1: Generative AI + Proprietary Data	This is ChatGPT, but on your own documents.	Writing reports, summarizing docs, answering FAQs	Prompting, metadata management, content accuracy	Basic data hygiene, light governance
Level 2: Contextual AI + Knowledge Integration	Now the model can access and use internal data automatically (think RAG).	LLMs pulling in internal data	Data pipelines, embeddings, retrieval tuning	Strong taxonomy, content architecture, access controls
Level 3: AI Agents for Business Tasks	LLMs don’t just talk—they act.	Agents processing tickets, scheduling meetings, writing emails	API integration, reasoning and acting (ReAct prompting), tool orchestration	Clear processes, oversight, evaluation
Level 4: Multi-Agent Workflow Orchestration	Agents coordinate with each other to automate entire workflows.	Specialized agents collaborating and adapting	Multi-agent architecture, AI-Ops, fallback design	High AI maturity, observability, risk controls

Most teams today are somewhere between Level 1 and Level 2, using AI as a productivity boost rather than as a true system backbone. If you’re eager to deploy agents but haven’t yet mastered generating meaningful reports, you’re skipping crucial steps.

There are no shortcuts to maturity. No prompt hack or agent will build the necessary infrastructure for you. True operational AI isn’t about looking clever—it’s about building systems that work, even when no one’s watching.

2025/04/17
in Applied AI
2 min read

From Prompt to Context Engineering: Why Evaluation is the Real AI Differentiator

Yesterday I had a fantastic dinner in London with Samuel Colvin the mind behind Pydantic (one of the most ubiquitous Python frameworks) and now focusing on PydanticAI and Logfire, Laura Modiano Strategic Partnerships Developer in Europe at OpenAI, David Soria Parra - Anthropic engineer behind the Model Context Protocol - Joon-sang Lee CEO of Pentaform, and industry legends Jason Liu and Ivan Leo - creators of instructor and Andreas Stathopoulos.

Dinner

I had a realization: everyone building generative AI agents and assistants is writing remarkably similar code. We're all building redundant data plumbing systems to manage context from data sources to agents, backends, and UIs.

Enought of AI data plumbing. The future and what will make or break your AI agents engineering is domain-specific evaluation.

Let’s acknowledge that we are at the infancy of this tech. As we get more mature, solutions like Anthropic's Model Context Protocol (MCP) are standardizing how we manage context and communicate AI with data sources more efficiently. If you are living under a rock and don’t know MCP please check here.

We're evolving beyond prompt engineering concatenating strings to become what I like to call “context engineers”. Our primary job is finding the most relevant context to place in front of LLMs to answer user needs.

This context engineering for AI agents is fundamentally repetitive across domains. We're all: - Deriving user questions into research plans using chain-of-thought reasoning - Searching for relevant context - Synthesizing answers

What struck me is where applied AI engineers should actually be focusing their time: building effective domain-specific evaluation workflows.

As I discussed in my previous posts, the AI adoption bottleneck for enterprise is not model capabilities, is the ability to prove (with statistical rigor) the accuracy of these systems to the specific domain.

I have been expending most of my time building evaluation pipelines for user validation and it is hard. Not everyone has the luxury of having millions of users to AB test. At the enterprise level, the time you can borrow from domain experts for validation is scarse.

We need to shift our energy from repetitive data plumbing to building robust evaluation workflows and feedback loops. We should be assessing our systems' confidence when handling knowledge-specific topics and measuring accuracy with rigor.

This is where I want to allocate the majority of project budgets going forward. With MCP's exponential adoption across the industry, the plumbing problem is being solved. The differentiator will be how well we can evaluate and improve our systems in specific domains.

What are your thoughts? Are you seeing this shift in your AI projects?

Also, I highly recommend checking PydanticAI And the AI Observality platform Logfire

2025/03/30
in Applied AI
2 min read

How many samples do we need in our evaluation dataset?

The most common question I hear when teams set up their evaluation framework: "How many samples do we need?"

This question also comes in different forms , like "How much time do we need from our domain experts to annotate the dataset?"

In the context of big corporate business, domain experts are usually the most expensive resource in the company. Being able to get one hour of their time is serious money. So we need to prove how much exactly we need to borrow from their time to annotate our brand new AI project.

The answer? It depends on how much confidence you want in your results.

Statistical significance matters and it matters even more in the context of evaluating generative AI systems. If you're making project decisions based on these evaluations, you need to know they're reliable.

Here's how we frame this question. Let’s consider the confidence level and margin of error we are comfortable with. 3 points to decide:

The confidence level you require (90%, 95%, 99%)
The margin of error you can accept
The expected variance in your data

For example: - For 90% confidence with ±5% margin of error: ~270 samples - For 95% confidence with ±5% margin of error: ~385 samples - For 99% confidence with ±5% margin of error: ~665 samples

These numbers can be calculated using standard statistical formulas for sample size determination.

Remember: Insufficient samples can lead to misleading results and poor decisions about your AI project. Invest in proper evaluation to build trust in your models and your decision-making process.

Even better, quantify how much it will cost in annotation time versus the level of confidence you want to achieve.

Sample size