Skip to content

Applied AI

Why Qualitative Research Experts Are the Secret Weapon for Generative AI Evaluation

To my colleagues in digital humanities, anthropology, and experts in qualitative research: you're sitting on a goldmine and don't even know it.

While everyone's trying to learn "prompt engineering," you are the true experts because you have been doing this for your entire career.

You know how to write precise, nuanced questions that extract meaningful responses.

You understand context, subtext, and how framing shapes answers. You're experts in context analysis, at iterative inquiry and grounded theory .

You evaluate sources critically, find biases and inconsistencies, and know that the most interesting insights often lie in what's not being said.

You understand that interpretation requires both rigor and creativity.

These aren't just transferable skills, they're the skills that matter most when designing, evaluating and aligning LLMs or AI agents.

The AI world is busy teaching engineers to think like qualitative researchers.

Meanwhile, you already think like AI researchers. You just need to learn the syntax. When this clicks for our field, we won't just be catching up, we'll be leading the conversation on responsible AI evaluation and human-centered AI systems design.

The skills to evaluate AI agents are not that different from evaluating humans.

As I've specialized more in AI agent evaluation, I can't stop thinking about how my colleagues and I, as engineers, are 'discovering' methodologies that you have known for ages.

You are so well equipped to help design evaluation or fine-tuning datasets and you don't even know it.

The future of AI isn't just engineering. It is more human than ever. And that's your territory.

The Real Bottleneck in Enterprise AI Adoption: Evaluation, Not Models

Enterprise AI adoption is moving much slower than the hype would suggest, and there’s a clear reason: domain-specific evaluation. Public benchmarks and leaderboard scores mean little when it comes to real-world business needs. What matters is whether these systems can be trusted to perform reliably in your unique context—and that requires custom evaluation pipelines and validation from domain experts.

The real bottleneck isn’t model capability or even energy consumption (at least, not yet). It’s the scarcity of human expertise available to rigorously validate these systems. Without that, trust is impossible.

It’s tempting to think that every new model release from OpenAI or another big lab will instantly unlock new business value. But the reality is, just because a new reasoning model drops doesn’t mean it will suddenly understand your business logic. Enterprise software isn’t built on vibes—it’s built on structure, reliability, and determinism.

Building production-grade, reasoning-based AI systems is hard. Getting agents to do real, valuable work is even harder. Most companies are still wrestling with the basics—like generating a clean summary from a SharePoint folder. And that’s okay. That’s where the real work begins.

Despite the noise from AI opportunists selling quick fixes, the truth is that almost no one has cracked operational AI at scale. Real credibility comes not from clever hacks, but from putting in the work: shipping infrastructure that quietly, reliably, and deeply integrates into the fabric of a company.

AI in operations isn’t a magic prompt. It’s a ladder you climb, step by step. Using ChatGPT at work is just the entry point.

Here’s what the real AI adoption stack looks like:

Level Description Use Cases Skills Org Needs
Level 1: Generative AI + Proprietary Data This is ChatGPT, but on your own documents. Writing reports, summarizing docs, answering FAQs Prompting, metadata management, content accuracy Basic data hygiene, light governance
Level 2: Contextual AI + Knowledge Integration Now the model can access and use internal data automatically (think RAG). LLMs pulling in internal data Data pipelines, embeddings, retrieval tuning Strong taxonomy, content architecture, access controls
Level 3: AI Agents for Business Tasks LLMs don’t just talk—they act. Agents processing tickets, scheduling meetings, writing emails API integration, reasoning and acting (ReAct prompting), tool orchestration Clear processes, oversight, evaluation
Level 4: Multi-Agent Workflow Orchestration Agents coordinate with each other to automate entire workflows. Specialized agents collaborating and adapting Multi-agent architecture, AI-Ops, fallback design High AI maturity, observability, risk controls

Most teams today are somewhere between Level 1 and Level 2, using AI as a productivity boost rather than as a true system backbone. If you’re eager to deploy agents but haven’t yet mastered generating meaningful reports, you’re skipping crucial steps.

There are no shortcuts to maturity. No prompt hack or agent will build the necessary infrastructure for you. True operational AI isn’t about looking clever—it’s about building systems that work, even when no one’s watching.

From Prompt to Context Engineering: Why Evaluation is the Real AI Differentiator

Yesterday I had a fantastic dinner in London with Samuel Colvin the mind behind Pydantic (one of the most ubiquitous Python frameworks) and now focusing on PydanticAI and Logfire, Laura Modiano Strategic Partnerships Developer in Europe at OpenAI, David Soria Parra - Anthropic engineer behind the Model Context Protocol - Joon-sang Lee CEO of Pentaform, and industry legends Jason Liu and Ivan Leo - creators of instructor and Andreas Stathopoulos.

Dinner

I had a realization: everyone building generative AI agents and assistants is writing remarkably similar code. We're all building redundant data plumbing systems to manage context from data sources to agents, backends, and UIs.

Enought of AI data plumbing. The future and what will make or break your AI agents engineering is domain-specific evaluation.

Let’s acknowledge that we are at the infancy of this tech. As we get more mature, solutions like Anthropic's Model Context Protocol (MCP) are standardizing how we manage context and communicate AI with data sources more efficiently. If you are living under a rock and don’t know MCP please check here.

We're evolving beyond prompt engineering concatenating strings to become what I like to call “context engineers”. Our primary job is finding the most relevant context to place in front of LLMs to answer user needs.

This context engineering for AI agents is fundamentally repetitive across domains. We're all: - Deriving user questions into research plans using chain-of-thought reasoning - Searching for relevant context - Synthesizing answers

What struck me is where applied AI engineers should actually be focusing their time: building effective domain-specific evaluation workflows.

As I discussed in my previous posts, the AI adoption bottleneck for enterprise is not model capabilities, is the ability to prove (with statistical rigor) the accuracy of these systems to the specific domain.

I have been expending most of my time building evaluation pipelines for user validation and it is hard. Not everyone has the luxury of having millions of users to AB test. At the enterprise level, the time you can borrow from domain experts for validation is scarse.

We need to shift our energy from repetitive data plumbing to building robust evaluation workflows and feedback loops. We should be assessing our systems' confidence when handling knowledge-specific topics and measuring accuracy with rigor.

This is where I want to allocate the majority of project budgets going forward. With MCP's exponential adoption across the industry, the plumbing problem is being solved. The differentiator will be how well we can evaluate and improve our systems in specific domains.

What are your thoughts? Are you seeing this shift in your AI projects?

Also, I highly recommend checking PydanticAI And the AI Observality platform Logfire

How many samples do we need in our evaluation dataset?

The most common question I hear when teams set up their evaluation framework: "How many samples do we need?"

This question also comes in different forms , like "How much time do we need from our domain experts to annotate the dataset?"

In the context of big corporate business, domain experts are usually the most expensive resource in the company. Being able to get one hour of their time is serious money. So we need to prove how much exactly we need to borrow from their time to annotate our brand new AI project.

The answer? It depends on how much confidence you want in your results.

Statistical significance matters and it matters even more in the context of evaluating generative AI systems. If you're making project decisions based on these evaluations, you need to know they're reliable.

Here's how we frame this question. Let’s consider the confidence level and margin of error we are comfortable with. 3 points to decide:

  • The confidence level you require (90%, 95%, 99%)
  • The margin of error you can accept
  • The expected variance in your data

For example: - For 90% confidence with ±5% margin of error: ~270 samples - For 95% confidence with ±5% margin of error: ~385 samples - For 99% confidence with ±5% margin of error: ~665 samples

These numbers can be calculated using standard statistical formulas for sample size determination.

Remember: Insufficient samples can lead to misleading results and poor decisions about your AI project. Invest in proper evaluation to build trust in your models and your decision-making process.

Even better, quantify how much it will cost in annotation time versus the level of confidence you want to achieve.

Sample size