Evals aren't tests, they're product specs

Every team building agents eventually invents an eval suite. The shape of the suite is the shape of the product. Done badly, it is a list of unit tests tha

Sana Iyer ยท Mar 31, 2026

Every team building agents eventually invents an eval suite. The shape of the suite is the shape of the product. Done badly, it is a list of unit tests that pass when the model parrots the right answer. Done well, it is the most precise spec your team has.

An eval is not a question and an expected string. It is a scenario, a behavior, and a check. The scenario is the input the agent receives, in the format the agent will see it in production. The behavior is the rule the agent must obey. The check is the rubric a human or model uses to score the output.

Most teams skip the rubric. They write "expected: 'Yes'" and call it done. That tells you nothing about why a different answer was wrong. A good rubric reads like a code-review note: did the agent cite the right policy, did it route to the right approver, did it stop when it should have stopped.

We do not invent eval cases at a whiteboard. Every case in every NatorOS eval suite originated in a real production exception. The operator marked it as wrong, or the agent paused and asked, or the approver rejected the output. That moment became a fingerprint. The fingerprint became an eval.

More NatorOS resources