Applied AI · RN-007

Separating real capability from market hype

New AI models ship faster than evaluations of them. A working approach for assessing whether a frontier capability is actually production-ready, or whether the demo is doing all the work.

Published

2023 · 07

Read

7 min

Author

GET Team

Why frontier model evaluation is harder than it looks

Public benchmarks have a contamination problem. Once a test set exists on the open web, there is no guarantee that subsequent model training did not include it, directly or through synthetic derivatives. A score above the prior state of the art on a well-known benchmark is no longer a clean signal — it is a hypothesis that requires verification. Leaderboard saturation compounds this: when ten models are within two points of each other on the same eval, the eval has stopped discriminating.

Demos suffer from a related issue. Vendors pick the prompts, the seeds, the system message, the retrieval corpus, and often the model temperature. They show the runs that worked. None of these choices are wrong on their own, but together they describe a best-case envelope that has almost no overlap with how the model will be invoked in production — under latency budgets, against messy inputs, by users who do not phrase questions the way the demo did.

Finally, frontier capability is not uniform. A model can be excellent at competitive math and mediocre at extracting a clause from a 60-page contract. The headline capability — the one driving the announcement cycle — is rarely the one that matters for a given enterprise workload. Evaluation has to be specific to the job, not to the brand.

What a real-world AI performance test looks like

A useful evaluation does three things at once. It measures the model on a representative sample of the actual work. It compares against a defensible baseline — usually the incumbent system or a prior model version. And it produces a number that survives being shown to a skeptical reviewer who does not work on the project.

The foundation is a golden set: 200 to 500 examples drawn from real workflows, labeled by domain experts, kept out of every prompt and every vendor demo. The set is versioned. It is never used for prompt tuning. When the team wants to iterate on prompts or retrieval, that happens on a separate development set; the golden set is held in reserve for the decision moments. This single discipline does more to defend against self-deception than any other practice.

Around the golden set sits the eval harness — the scripts that run the model, score the outputs, and log latency, token cost, and failure mode. The harness is the artifact that lets a team rerun the same test against a new model version six months later and detect model regression. Without it, every vendor swap becomes a fresh argument about whether things got better.

How to pressure-test a vendor demo

Treat any vendor demonstration as a hypothesis-generating exercise, not as evidence. The right response to an impressive demo is a structured request that converts marketing claims into testable predictions. Most vendors will agree to a bounded pilot if the alternative is losing the deal; the ones that refuse are telling buyers something useful.

The pilot should be designed before the vendor sees it. That means specifying the inputs, the success criteria, and the evaluation method in advance, and refusing to renegotiate them when results come in. A common failure mode in enterprise AI assessment is the moving target — the criteria drift to match whatever the model happens to be good at. Lock the rubric before the run.

Submit a sealed test set the vendor has not seen and cannot retain.
Require deterministic settings where possible — fixed temperature, fixed seeds, documented system prompt.
Measure tail latency and cost per call, not just averages — the long tail is where production breaks.
Probe distribution shift deliberately: include inputs outside the vendor's stated training distribution and observe failure modes.
Test prompt brittleness by rephrasing the same request in three plausible ways and comparing outputs.
Ask for the model card and the eval methodology in writing; absence of either is a finding.

Capabilities that look real but rarely are

Several categories of claim deserve sharper scrutiny because the gap between demo and production is reliably large. Long-context comprehension is one — a model that can summarize a 200-page document in a demo may attend only to the first and last sections, missing the material in the middle. Standard evals catch the obvious failures but miss the subtle ones, where the summary reads well and is quietly wrong.

Agentic workflows are another. A demonstration of a model orchestrating five tools in sequence is a different artifact from a system that handles thousands of such sequences a day with logging, retries, and human handoff. The demo proves the capability exists in principle; it does not prove the operational envelope. Reasoning benchmarks are a third — strong performance on competition math has limited transfer to the kind of structured analytical work most enterprises need.

None of this means the capabilities are fake. It means the claim being demonstrated and the claim being implied are different claims, and the gap is where procurement risk lives.

Building an AI procurement process that scales

AI procurement should not be a one-off conversation each time a new model lands. The organizations that handle this well treat evaluation as standing infrastructure — a small set of golden sets per high-value workflow, a harness that can run any new model overnight, and a governance process that documents which models passed which bars on which dates.

The payoff compounds. The second model is faster to evaluate than the first because the harness already exists. The third decision is faster than the second because the team has a calibrated sense of what a real two-percent improvement looks like versus a benchmark artifact. Vendors notice when buyers ask sharper questions, and the questions themselves shape what the vendor brings to the next conversation.

What to do this quarter

Pick one high-value workflow where AI is in production or about to be. Build a golden set of 200 to 500 real examples, labeled by people who do the work. Write down the three failure modes that would matter most if they appeared at scale. Run the current model against the set and record the numbers — these are the baseline. Then run the next vendor demo against the same set, under the same conditions, and compare.

Most of the disagreement in AI procurement comes from the absence of a shared measuring stick. Build the stick once, and the next ten decisions get easier. The hype cycle will not slow down; the defense is internal infrastructure that does not depend on vendor narratives.

Authored by GET Team · GET AI Labs

← All research notes

Related notes

Continue reading.

Other notes from the same line of work.

RN-011Applied AI

The difference between a demo and a deployable prototype

Demos optimize for the path of least resistance. Deployable prototypes optimize for the path of most resilience. The cost of conflating them — especially in AI work — and how to scope around it.

2025 · 01 · 5 min→

RN-017Adoption

Managing the risks of enterprise AI adoption

AI risk is not a variant of software risk, and treating it as one is how regulated programs fail. A taxonomy of the real exposures and how to contain them through the roadmap rather than after it.

2026 · 06 · 9 min→

RN-016Adoption

Measuring ROI on enterprise AI investments

Most AI return figures are unfalsifiable. A method for costing the full system, classifying which benefits are measurable, and instrumenting an initiative so its return can be defended after deployment.

2026 · 05 · 8 min→

Next step

Have a technical challenge worth investigating?

Bring us the problem. We will help determine what is possible, what is practical, and what should be built next.

Discuss a Technical Challenge Explore Capabilities

Response within two business days · NDAs available when required