Eval Driven Design

18 August 2025 - 11 mins read time
Tags: Evaluation AI

This article is based on the cookbook by OpenAI on eval driven system design. It is written mainly without AI since I want to get back into proper writing.

My Key Takeaways

My most insightful points in this document were:

Connecting evals to business metrics back to evals
Using Graders
The flywheel setup

Connecting Evals to Business

We need to imbue our measurements with raison d’être - meaning a reason to exist. Besides the evaluation, we need a set of business metrics. Always connect the performance (eval) of the model to US dollars else you cannot measure your impact (money saved to the company, money made for the company) but write it as a simple parametrized Python function that takes the tn and tp of your eval results as parameters.

Therefore, use the classic McKinsey style calculations:

our company processes 1 million receipts a year, at a baseline cost of $0.20 / receipt = $200,000
auditing a receipt costs about $2
failing to audit a receipt we should have audited costs an average of $30
5% of receipts need to be audited = 50,000 receipts
the existing process identifies receipts that need to be audited 97% of the time misidentifies receipts that don’t need to be audited 2% of the time

This gives us two baseline comparisons:

if we identified every receipt correctly, we would spend $100,000 on audits
our current process spends
45,000 to un-audited expenses
On top of that, the human-driven process costs an additional $200,000.

We’re expecting our service to save money by costing less to run (≈1¢/receipt if we use the prompts from above with o4-mini), but whether we save or lose money on audits and missed audits depends on how well our system performs. It might be worth writing this as a simple function — written below is a version that includes the above factors but neglects nuance and ignores development, maintenance, and serving costs.

def calculate_costs(fp_rate: float, fn_rate: float, per_receipt_cost: float):
    audit_cost = 2
    missed_audit_cost = 30
    receipt_count = 1e6
    audit_fraction = 0.05

    needs_audit_count = receipt_count * audit_fraction
    no_needs_audit_count = receipt_count - needs_audit_count

    missed_audits = needs_audit_count * fn_rate
    total_audits = needs_audit_count * (1 - fn_rate) + no_needs_audit_count * fp_rate

    audit_cost = total_audits * audit_cost
    missed_audit_cost = missed_audits * missed_audit_cost
    processing_cost = receipt_count * per_receipt_cost

    return audit_cost + missed_audit_cost + processing_cost


perfect_system_cost = calculate_costs(0, 0, 0)
current_system_cost = calculate_costs(0.02, 0.03, 0.20)

print(f"Current system cost: ${current_system_cost:,.0f}")

Then ALWAYS connect these numbers back to the model / eval metrics to inform your model development efforts. Their example is:

_The point of the above model is it lets us apply meaning to an eval that would otherwise just be a number. For instance, when we ran the system above we were wrong 85% of the time for merchant names. But digging in, it seems like most instances are capitalization issues or “Shell Gasoline” vs. “Shell Oil #2144” — problems that when we follow through, do not appear to affect our audit decision or change our fundamental costs.

On the other hand, it seems like we fail to catch handwritten “X”s on receipts about half the time, and about half of the time when there’s an “X” on a receipt that gets missed, it results in a receipt not getting audited when it should. Those are overrepresented in our dataset, but if that makes up even 1% of receipts, that 50% failure would cost us $75,000 a year.

Similarly, it seems like we have OCR errors that cause us to audit receipts quite often on account of the math not working out, up to 20% of the time. This could cost us almost $400,000!_

Using Graders

OpenAI uses graders to score evaluations. The provide narrow checks for specific fields / some portion of the predicted output.

The Flywheel Setup

Importantly, you want a flywheel even if you go from 20 examples to more.

flywheel

The Improvement Process

Furthermore, they describe an improvement process for increasing the eval performance of the model as follows:

improvement

General Functions

The team of Fractional (a consultancy and OpenAI) introduce a fictional receipt processing and audit classification system. Therefore, they use two functions:

extract_receipt_details: extracting the Pydantic defined structured output
evaluate_receipt_for_audit:

The extraction function

The pydantic schema simply looks for a location, and the receipt details with its lineitems:

from pydantic import BaseModel


class Location(BaseModel):
    city: str | None
    state: str | None
    zipcode: str | None


class LineItem(BaseModel):
    description: str | None
    product_code: str | None
    category: str | None
    item_price: str | None
    sale_price: str | None
    quantity: str | None
    total: str | None


class ReceiptDetails(BaseModel):
    merchant: str | None
    location: Location
    time: str | None
    items: list[LineItem]
    subtotal: str | None
    tax: str | None
    total: str | None
    handwritten_notes: list[str]

The basic v0 prompt asks to identify all these elements from the receipt image::

BASIC_PROMPT = """
Given an image of a retail receipt, extract all relevant information and format it as a structured response.

# Task Description

Carefully examine the receipt image and identify the following key information:

1. Merchant name and any relevant store identification
2. Location information (city, state, ZIP code)
3. Date and time of purchase
4. All purchased items with their:
   * Item description/name
   * Item code/SKU (if present)
   * Category (infer from context if not explicit)
   * Regular price per item (if available)
   * Sale price per item (if discounted)
   * Quantity purchased
   * Total price for the line item
5. Financial summary:
   * Subtotal before tax
   * Tax amount
   * Final total
6. Any handwritten notes or annotations on the receipt (list each separately)

## Important Guidelines

* If information is unclear or missing, return null for that field
* Format dates as ISO format (YYYY-MM-DDTHH:MM:SS)
* Format all monetary values as decimal numbers
* Distinguish between printed text and handwritten notes
* Be precise with amounts and totals
* For ambiguous items, use your best judgment based on context

Your response should be structured and complete, capturing all available information
from the receipt.
"""

And the actual extraction function just puts these together inferring the mime_type.

import base64
import mimetypes # for image procesing
from pathlib import Path

from openai import AsyncOpenAI

client = AsyncOpenAI()


async def extract_receipt_details(
    image_path: str, model: str = "o4-mini"
) -> ReceiptDetails:
    """Extract structured details from a receipt image."""
    # Determine image type for data URI.
    mime_type, _ = mimetypes.guess_type(image_path)

    # Read and base64 encode the image.
    b64_image = base64.b64encode(Path(image_path).read_bytes()).decode("utf-8")
    image_data_url = f"data:{mime_type};base64,{b64_image}"

    response = await client.responses.parse(
        model=model,
        input=[
            {
                "role": "user",
                "content": [
                    {"type": "input_text", "text": BASIC_PROMPT},
                    {"type": "input_image", "image_url": image_data_url},
                ],
            }
        ],
        text_format=ReceiptDetails,
    )

    return response.output_parsed

Audit Portion

The audit prompt defines a rubric of 4 elements to evaluate the receipt:

from pydantic import BaseModel, Field

audit_prompt = """
Evaluate this receipt data to determine if it need to be audited based on the following
criteria:

1. NOT_TRAVEL_RELATED:
   - IMPORTANT: For this criterion, travel-related expenses include but are not limited
   to: gas, hotel, airfare, or car rental.
   - If the receipt IS for a travel-related expense, set this to FALSE.
   - If the receipt is NOT for a travel-related expense (like office supplies), set this
   to TRUE.
   - In other words, if the receipt shows FUEL/GAS, this would be FALSE because gas IS
   travel-related.

2. AMOUNT_OVER_LIMIT: The total amount exceeds $50

3. MATH_ERROR: The math for computing the total doesn't add up (line items don't sum to
   total)

4. HANDWRITTEN_X: There is an "X" in the handwritten notes

For each criterion, determine if it is violated (true) or not (false). Provide your
reasoning for each decision, and make a final determination on whether the receipt needs
auditing. A receipt needs auditing if ANY of the criteria are violated.

Return a structured response with your evaluation.
"""

Then the Pydantic schema takes the fields that shall all be false to not audit them, and also requests a reasoning, and a final decision based on the rubric.

class AuditDecision(BaseModel):
    not_travel_related: bool = Field(
        description="True if the receipt is not travel-related"
    )
    amount_over_limit: bool = Field(description="True if the total amount exceeds $50")
    math_error: bool = Field(description="True if there are math errors in the receipt")
    handwritten_x: bool = Field(
        description="True if there is an 'X' in the handwritten notes"
    )
    reasoning: str = Field(description="Explanation for the audit decision")
    needs_audit: bool = Field(
        description="Final determination if receipt needs auditing"
    )

Finaly, the evaluate receipt for audit function is pretty standard. It just puts the receipt details into a JSON and sends it to the model.

async def evaluate_receipt_for_audit(
    receipt_details: ReceiptDetails, model: str = "o4-mini"
) -> AuditDecision:
    """Determine if a receipt needs to be audited based on defined criteria."""
    # Convert receipt details to JSON for the prompt
    receipt_json = receipt_details.model_dump_json(indent=2)

    response = await client.responses.parse(
        model=model,
        input=[
            {
                "role": "user",
                "content": [
                    {"type": "input_text", "text": audit_prompt},
                    {"type": "input_text", "text": f"Receipt details:\n{receipt_json}"},
                ],
            }
        ],
        text_format=AuditDecision,
    )

    return response.output_parsed