OGS Benchmark Specification

Open Gastronomy Standard — Benchmark v0.1

1. Purpose

The OGS Benchmark (OGS-Bench) defines a reproducible, open methodology for measuring the gastronomy intelligence of large language models and other AI systems against the Open Gastronomy Standard. It is intended to become the official global benchmark for evaluating gastronomy-capable AI across:

Taste, aroma, and texture reasoning
Dish composition and cooking-method analysis
Food-and-beverage pairing judgment and explainability
Knowledge of classical cuisines, ingredients, and wine regions
Structured, schema-conformant JSON generation against OGS itself

OGS-Bench is model-agnostic. Any system that can consume text prompts and produce text (ideally JSON) responses can be scored — including "gpt-5.4", "gpt-4.1", open-source LLMs, rule-based systems, and human panels.

2. Design Principles

Principle	Description
Schema-grounded	Every task consumes and/or produces OGS-conformant data.
Category-balanced	The benchmark covers eight capability categories; an aggregate score is only meaningful when all categories are attempted.
Objective scoring	Each task specifies a deterministic scorer (exact match, set F1, numeric MAE, schema validation, etc.). Free-form judge-based scoring is explicitly out of scope for v0.1.
Explainable	Each task declares the capability it targets and the rationale for its gold answer.
Extensible	Third parties may publish their own task packs using the same schema under a non-`core` namespace.
Reproducible	Prompts are fully specified by the benchmark runner — there is no hidden prompting. A fixed `benchmark_version` pins the task set and scoring weights.
Ethical & license-clean	All reference data is either original to the OGS project or cited to permissively licensed sources. No proprietary tasting notes are used verbatim.

3. Capability Categories

OGS-Bench v0.1 defines eight capability categories. Each has its own ID prefix, scoring methods, and weight in the aggregate score.

Code	Category	What it measures	Default weight
`VOCAB`	Controlled Vocabulary	Knowledge of OGS enums: component roles, cooking methods, aromatic families, match types, explanation-code categories	0.10
`SENSE`	Sensory Estimation	Predicting basic tastes, structural attributes, and dominant aromatics on the 0–10 OGS scale	0.15
`COMP`	Composition Analysis	Identifying component roles, cooking methods, Maillard index, and reduction levels from a dish description	0.15
`PAIR`	Pairing Judgment	Selecting the best beverage for a dish from a short list, or ranking candidates	0.20
`XPLAIN`	Explanation Codes	Choosing the correct set of positive explanation codes that describe why a known good pairing works	0.10
`RISK`	Risk Identification	Detecting negative interactions (tannin-fish clash, intensity mismatch, etc.) in a proposed pairing	0.10
`STRUCT`	Structured Output	Producing OGS-valid JSON (ingredient, dish, beverage, or pairing) from a natural-language brief	0.15
`REGION`	Regional & Cultural Knowledge	Short-answer knowledge of cuisines, classic dishes, wine regions, grape varieties, and traditional pairings	0.05

Implementers MAY use alternate weights for their own leaderboards, but the default weights above MUST be reported for any result that claims to be an "OGS-Bench v0.1 aggregate score".

4. Task Identifiers

Benchmark task IDs follow the OGS ID convention with a new entity type benchmark-task:

ogs:core:benchmark-task:<category-code>-<NNN>

Example: ogs:core:benchmark-task:pair-004.

The core namespace is reserved for the official task set shipped with this specification. Third-party task packs MUST use a distinct namespace (e.g., ogs:acme:benchmark-task:pair-001).

5. Task Document Structure

Every task is a standalone JSON document matching schema/benchmark-task.schema.json.

5.1 Required fields

Field	Type	Description
`id`	string	OGS benchmark-task identifier
`ogs_version`	string	OGS core version the task targets
`benchmark_version`	string	Benchmark specification version
`category`	string	One of `vocabulary`, `sensory`, `composition`, `pairing`, `explanation`, `risk`, `structured_output`, `regional`
`capability`	string	Free-text short label for the specific capability probed
`difficulty`	string	`easy`, `medium`, `hard`, or `expert`
`prompt`	object	The instruction and input payload shown to the model (see §5.3)
`expected`	object	The gold answer used for scoring (see §5.4)
`scoring`	object	Scoring configuration (see §6)

5.2 Optional fields

Field	Type	Description
`name`	string	Human-readable task name
`description`	string	Long-form description
`rationale`	string	Explanation of why the gold answer is correct
`tags`	array	Free-form tags (e.g., `tannin`, `wine`, `italian`)
`references`	array	Source citations
`metadata`	object	Standard OGS metadata envelope

5.3 Prompt object

"prompt": {
  "instruction": "Select the OGS aromatic family for 'lemon zest'.",
  "input": {
    "item": "lemon zest"
  },
  "answer_format": "one_of:fruit,floral,herbal,spice,earth,wood,dairy,savory,confection,vegetal,marine",
  "response_schema": null
}

instruction: always required. This is the core ask to the model.
input: required, but MAY be an empty object. Contains structured payload the model should reason about.
answer_format: an advisory string describing how the response should be formatted (e.g., json, one_of:a,b,c, integer_0_10, set_of_codes).
response_schema: optional JSON Schema the response will be validated against when scoring.type is schema_validation or json_match.

5.4 Expected object

The shape of expected depends on the scoring type:

Scoring type	Shape of `expected`
`exact_match`	`{ "value": <string> }`
`numeric_mae` / `numeric_rmse`	`{ "value": <number>, "tolerance": <number?> }`
`numeric_vector_mae`	`{ "values": { "<key>": <number>, ... } }`
`set_f1`	`{ "values": [<string>, ...] }`
`ranked_selection`	`{ "best": <string>, "ranking": [<string>, ...]? }`
`json_match`	`{ "value": <object>, "required_keys": [<string>, ...]? }`
`schema_validation`	`{ "schema_ref": "<path>" }`

6. Scoring

Each task declares exactly one scoring.type. All scorers produce a real-valued score in [0, 1]. The benchmark runner aggregates these at four levels:

Task score — raw scorer output for a single task.
Capability score — unweighted mean of task scores in the same category.
Category score — same as capability score for v0.1 (one capability per category for now).
Aggregate score — weighted sum over categories using the weights in §3, expressed on a 0–100 scale.

6.1 Scorer specifications

Scorer	Definition
`exact_match`	`1.0` if the normalized string response equals `expected.value`, else `0.0`. Normalization: lowercase, strip, collapse whitespace.
`numeric_mae`	Score = `max(0, 1 - \\|pred - expected.value\\| / tolerance)` using `tolerance` (default `2.0` on the 0–10 OGS intensity scale).
`numeric_rmse`	Same as `numeric_mae` but penalizes large errors quadratically.
`numeric_vector_mae`	Mean of per-key `numeric_mae` scores over the shared keys. Missing keys count as maximum error.
`set_f1`	Standard set-F1 between the predicted set of tokens/codes and `expected.values`.
`ranked_selection`	`1.0` for correct top-1 choice; `0.5` if the correct item appears in top-2; `0.0` otherwise. If a ranking is requested, uses NDCG over the provided ranking.
`json_match`	Score is the fraction of `required_keys` whose predicted values match the expected values (deep-equal for primitives; for numbers, within 10 % relative tolerance).
`schema_validation`	`1.0` if the response validates against the referenced JSON Schema and is internally consistent (no unknown enum values, IDs well-formed). `0.5` for schema-valid but inconsistent; `0.0` otherwise.

6.2 Aggregate formula

Let C = {VOCAB, SENSE, COMP, PAIR, XPLAIN, RISK, STRUCT, REGION} and w_c the weight of category c. Let s_c be the mean of all task scores in category c. Then:

aggregate_score = 100 * Σ_c w_c * s_c

If a category has zero attempted tasks, the aggregate is reported as partial and the missing category weight is redistributed proportionally.

7. Running the Benchmark

The reference runner lives in bench/ and exposes a small CLI:

# List available tasks
python -m bench list

# Run a specific model on all tasks
python -m bench run --model gpt-5.4 --out results/gpt-5.4.json

# Run only a subset of categories
python -m bench run --model gpt-4.1 --categories pairing,risk --out results/gpt-4.1.pair.json

# Compare two result files
python -m bench report results/gpt-5.4.json results/gpt-4.1.json

Model adapters currently supported out of the box:

Adapter	Model-ID prefixes	Requires
`baseline`	`baseline`, `baseline:*`	Nothing — deterministic heuristic, used for smoke-testing
`openai`	`gpt-`, `o1-`, `o3-`, `o4-`, `o5-`	`OPENAI_API_KEY`
`anthropic`	`claude-`	`ANTHROPIC_API_KEY`
`echo`	`echo`	Nothing — returns the prompt (for debugging)

Adapters are pluggable: new ones may be registered via bench.models.register.

Every run produces a result document matching schema/benchmark-result.schema.json, which records the model ID, per-task scores, per-category aggregates, prompts, raw responses, and the benchmark version used.

8. Conformance

A benchmark submission is OGS-Bench v0.1 conformant if it:

Uses the official task set at the declared benchmark_version.
Reports both per-category and aggregate scores using the weights in §3.
Publishes its result document, which MUST include raw model responses and the exact prompts (the reference runner does this by default).
Discloses the adapter and any non-default prompting options (temperature, system message, tool use, etc.).

Submissions MAY additionally report results on third-party task packs, but those do not count toward the OGS-Bench v0.1 aggregate.

9. Versioning

OGS-Bench follows the same Semantic Versioning rules as OGS core:

MAJOR — breaking changes to task/result schemas, scoring definitions, or category weights.
MINOR — adding tasks, new categories, new scorers (backwards-compatible).
PATCH — clarifications, bug fixes, added rationale text.

Every task carries an explicit benchmark_version. Runners MUST refuse to score tasks whose MAJOR version they do not support.

10. References

ogs-core.md — Core specification
ogs-sense.md — Sensory model
ogs-comp.md — Composition model
ogs-match.md — Matching model
schema/benchmark-task.schema.json — Task document schema
schema/benchmark-result.schema.json — Result document schema
bench/ — Reference runner and tasks