Standard / bench

OGS Benchmark Specification

Open Gastronomy Standard — Benchmark v0.1


1. Purpose

The OGS Benchmark (OGS-Bench) defines a reproducible, open methodology for measuring the gastronomy intelligence of large language models and other AI systems against the Open Gastronomy Standard. It is intended to become the official global benchmark for evaluating gastronomy-capable AI across:

OGS-Bench is model-agnostic. Any system that can consume text prompts and produce text (ideally JSON) responses can be scored — including "gpt-5.4", "gpt-4.1", open-source LLMs, rule-based systems, and human panels.


2. Design Principles

Principle Description
Schema-grounded Every task consumes and/or produces OGS-conformant data.
Category-balanced The benchmark covers eight capability categories; an aggregate score is only meaningful when all categories are attempted.
Objective scoring Each task specifies a deterministic scorer (exact match, set F1, numeric MAE, schema validation, etc.). Free-form judge-based scoring is explicitly out of scope for v0.1.
Explainable Each task declares the capability it targets and the rationale for its gold answer.
Extensible Third parties may publish their own task packs using the same schema under a non-core namespace.
Reproducible Prompts are fully specified by the benchmark runner — there is no hidden prompting. A fixed benchmark_version pins the task set and scoring weights.
Ethical & license-clean All reference data is either original to the OGS project or cited to permissively licensed sources. No proprietary tasting notes are used verbatim.

3. Capability Categories

OGS-Bench v0.1 defines eight capability categories. Each has its own ID prefix, scoring methods, and weight in the aggregate score.

Code Category What it measures Default weight
VOCAB Controlled Vocabulary Knowledge of OGS enums: component roles, cooking methods, aromatic families, match types, explanation-code categories 0.10
SENSE Sensory Estimation Predicting basic tastes, structural attributes, and dominant aromatics on the 0–10 OGS scale 0.15
COMP Composition Analysis Identifying component roles, cooking methods, Maillard index, and reduction levels from a dish description 0.15
PAIR Pairing Judgment Selecting the best beverage for a dish from a short list, or ranking candidates 0.20
XPLAIN Explanation Codes Choosing the correct set of positive explanation codes that describe why a known good pairing works 0.10
RISK Risk Identification Detecting negative interactions (tannin-fish clash, intensity mismatch, etc.) in a proposed pairing 0.10
STRUCT Structured Output Producing OGS-valid JSON (ingredient, dish, beverage, or pairing) from a natural-language brief 0.15
REGION Regional & Cultural Knowledge Short-answer knowledge of cuisines, classic dishes, wine regions, grape varieties, and traditional pairings 0.05

Implementers MAY use alternate weights for their own leaderboards, but the default weights above MUST be reported for any result that claims to be an "OGS-Bench v0.1 aggregate score".


4. Task Identifiers

Benchmark task IDs follow the OGS ID convention with a new entity type benchmark-task:

ogs:core:benchmark-task:<category-code>-<NNN>

Example: ogs:core:benchmark-task:pair-004.

The core namespace is reserved for the official task set shipped with this specification. Third-party task packs MUST use a distinct namespace (e.g., ogs:acme:benchmark-task:pair-001).


5. Task Document Structure

Every task is a standalone JSON document matching schema/benchmark-task.schema.json.

5.1 Required fields

Field Type Description
id string OGS benchmark-task identifier
ogs_version string OGS core version the task targets
benchmark_version string Benchmark specification version
category string One of vocabulary, sensory, composition, pairing, explanation, risk, structured_output, regional
capability string Free-text short label for the specific capability probed
difficulty string easy, medium, hard, or expert
prompt object The instruction and input payload shown to the model (see §5.3)
expected object The gold answer used for scoring (see §5.4)
scoring object Scoring configuration (see §6)

5.2 Optional fields

Field Type Description
name string Human-readable task name
description string Long-form description
rationale string Explanation of why the gold answer is correct
tags array Free-form tags (e.g., tannin, wine, italian)
references array Source citations
metadata object Standard OGS metadata envelope

5.3 Prompt object

"prompt": {
  "instruction": "Select the OGS aromatic family for 'lemon zest'.",
  "input": {
    "item": "lemon zest"
  },
  "answer_format": "one_of:fruit,floral,herbal,spice,earth,wood,dairy,savory,confection,vegetal,marine",
  "response_schema": null
}

5.4 Expected object

The shape of expected depends on the scoring type:

Scoring type Shape of expected
exact_match { "value": <string> }
numeric_mae / numeric_rmse { "value": <number>, "tolerance": <number?> }
numeric_vector_mae { "values": { "<key>": <number>, ... } }
set_f1 { "values": [<string>, ...] }
ranked_selection { "best": <string>, "ranking": [<string>, ...]? }
json_match { "value": <object>, "required_keys": [<string>, ...]? }
schema_validation { "schema_ref": "<path>" }

6. Scoring

Each task declares exactly one scoring.type. All scorers produce a real-valued score in [0, 1]. The benchmark runner aggregates these at four levels:

  1. Task score — raw scorer output for a single task.
  2. Capability score — unweighted mean of task scores in the same category.
  3. Category score — same as capability score for v0.1 (one capability per category for now).
  4. Aggregate score — weighted sum over categories using the weights in §3, expressed on a 0–100 scale.

6.1 Scorer specifications

Scorer Definition
exact_match 1.0 if the normalized string response equals expected.value, else 0.0. Normalization: lowercase, strip, collapse whitespace.
numeric_mae Score = max(0, 1 - \|pred - expected.value\| / tolerance) using tolerance (default 2.0 on the 0–10 OGS intensity scale).
numeric_rmse Same as numeric_mae but penalizes large errors quadratically.
numeric_vector_mae Mean of per-key numeric_mae scores over the shared keys. Missing keys count as maximum error.
set_f1 Standard set-F1 between the predicted set of tokens/codes and expected.values.
ranked_selection 1.0 for correct top-1 choice; 0.5 if the correct item appears in top-2; 0.0 otherwise. If a ranking is requested, uses NDCG over the provided ranking.
json_match Score is the fraction of required_keys whose predicted values match the expected values (deep-equal for primitives; for numbers, within 10 % relative tolerance).
schema_validation 1.0 if the response validates against the referenced JSON Schema and is internally consistent (no unknown enum values, IDs well-formed). 0.5 for schema-valid but inconsistent; 0.0 otherwise.

6.2 Aggregate formula

Let C = {VOCAB, SENSE, COMP, PAIR, XPLAIN, RISK, STRUCT, REGION} and w_c the weight of category c. Let s_c be the mean of all task scores in category c. Then:

aggregate_score = 100 * Σ_c w_c * s_c

If a category has zero attempted tasks, the aggregate is reported as partial and the missing category weight is redistributed proportionally.


7. Running the Benchmark

The reference runner lives in bench/ and exposes a small CLI:

# List available tasks
python -m bench list

# Run a specific model on all tasks
python -m bench run --model gpt-5.4 --out results/gpt-5.4.json

# Run only a subset of categories
python -m bench run --model gpt-4.1 --categories pairing,risk --out results/gpt-4.1.pair.json

# Compare two result files
python -m bench report results/gpt-5.4.json results/gpt-4.1.json

Model adapters currently supported out of the box:

Adapter Model-ID prefixes Requires
baseline baseline, baseline:* Nothing — deterministic heuristic, used for smoke-testing
openai gpt-, o1-, o3-, o4-, o5- OPENAI_API_KEY
anthropic claude- ANTHROPIC_API_KEY
echo echo Nothing — returns the prompt (for debugging)

Adapters are pluggable: new ones may be registered via bench.models.register.

Every run produces a result document matching schema/benchmark-result.schema.json, which records the model ID, per-task scores, per-category aggregates, prompts, raw responses, and the benchmark version used.


8. Conformance

A benchmark submission is OGS-Bench v0.1 conformant if it:

  1. Uses the official task set at the declared benchmark_version.
  2. Reports both per-category and aggregate scores using the weights in §3.
  3. Publishes its result document, which MUST include raw model responses and the exact prompts (the reference runner does this by default).
  4. Discloses the adapter and any non-default prompting options (temperature, system message, tool use, etc.).

Submissions MAY additionally report results on third-party task packs, but those do not count toward the OGS-Bench v0.1 aggregate.


9. Versioning

OGS-Bench follows the same Semantic Versioning rules as OGS core:

Every task carries an explicit benchmark_version. Runners MUST refuse to score tasks whose MAJOR version they do not support.


10. References