Standard / bench
OGS Benchmark Specification
Open Gastronomy Standard — Benchmark v0.1
1. Purpose
The OGS Benchmark (OGS-Bench) defines a reproducible, open methodology for measuring the gastronomy intelligence of large language models and other AI systems against the Open Gastronomy Standard. It is intended to become the official global benchmark for evaluating gastronomy-capable AI across:
- Taste, aroma, and texture reasoning
- Dish composition and cooking-method analysis
- Food-and-beverage pairing judgment and explainability
- Knowledge of classical cuisines, ingredients, and wine regions
- Structured, schema-conformant JSON generation against OGS itself
OGS-Bench is model-agnostic. Any system that can consume text prompts and produce text (ideally JSON) responses can be scored — including "gpt-5.4", "gpt-4.1", open-source LLMs, rule-based systems, and human panels.
2. Design Principles
| Principle | Description |
|---|---|
| Schema-grounded | Every task consumes and/or produces OGS-conformant data. |
| Category-balanced | The benchmark covers eight capability categories; an aggregate score is only meaningful when all categories are attempted. |
| Objective scoring | Each task specifies a deterministic scorer (exact match, set F1, numeric MAE, schema validation, etc.). Free-form judge-based scoring is explicitly out of scope for v0.1. |
| Explainable | Each task declares the capability it targets and the rationale for its gold answer. |
| Extensible | Third parties may publish their own task packs using the same schema under a non-core namespace. |
| Reproducible | Prompts are fully specified by the benchmark runner — there is no hidden prompting. A fixed benchmark_version pins the task set and scoring weights. |
| Ethical & license-clean | All reference data is either original to the OGS project or cited to permissively licensed sources. No proprietary tasting notes are used verbatim. |
3. Capability Categories
OGS-Bench v0.1 defines eight capability categories. Each has its own ID prefix, scoring methods, and weight in the aggregate score.
| Code | Category | What it measures | Default weight |
|---|---|---|---|
VOCAB |
Controlled Vocabulary | Knowledge of OGS enums: component roles, cooking methods, aromatic families, match types, explanation-code categories | 0.10 |
SENSE |
Sensory Estimation | Predicting basic tastes, structural attributes, and dominant aromatics on the 0–10 OGS scale | 0.15 |
COMP |
Composition Analysis | Identifying component roles, cooking methods, Maillard index, and reduction levels from a dish description | 0.15 |
PAIR |
Pairing Judgment | Selecting the best beverage for a dish from a short list, or ranking candidates | 0.20 |
XPLAIN |
Explanation Codes | Choosing the correct set of positive explanation codes that describe why a known good pairing works | 0.10 |
RISK |
Risk Identification | Detecting negative interactions (tannin-fish clash, intensity mismatch, etc.) in a proposed pairing | 0.10 |
STRUCT |
Structured Output | Producing OGS-valid JSON (ingredient, dish, beverage, or pairing) from a natural-language brief | 0.15 |
REGION |
Regional & Cultural Knowledge | Short-answer knowledge of cuisines, classic dishes, wine regions, grape varieties, and traditional pairings | 0.05 |
Implementers MAY use alternate weights for their own leaderboards, but the default weights above MUST be reported for any result that claims to be an "OGS-Bench v0.1 aggregate score".
4. Task Identifiers
Benchmark task IDs follow the OGS ID convention with a new entity type
benchmark-task:
ogs:core:benchmark-task:<category-code>-<NNN>
Example: ogs:core:benchmark-task:pair-004.
The core namespace is reserved for the official task set shipped with this
specification. Third-party task packs MUST use a distinct namespace
(e.g., ogs:acme:benchmark-task:pair-001).
5. Task Document Structure
Every task is a standalone JSON document matching
schema/benchmark-task.schema.json.
5.1 Required fields
| Field | Type | Description |
|---|---|---|
id |
string | OGS benchmark-task identifier |
ogs_version |
string | OGS core version the task targets |
benchmark_version |
string | Benchmark specification version |
category |
string | One of vocabulary, sensory, composition, pairing, explanation, risk, structured_output, regional |
capability |
string | Free-text short label for the specific capability probed |
difficulty |
string | easy, medium, hard, or expert |
prompt |
object | The instruction and input payload shown to the model (see §5.3) |
expected |
object | The gold answer used for scoring (see §5.4) |
scoring |
object | Scoring configuration (see §6) |
5.2 Optional fields
| Field | Type | Description |
|---|---|---|
name |
string | Human-readable task name |
description |
string | Long-form description |
rationale |
string | Explanation of why the gold answer is correct |
tags |
array | Free-form tags (e.g., tannin, wine, italian) |
references |
array | Source citations |
metadata |
object | Standard OGS metadata envelope |
5.3 Prompt object
"prompt": {
"instruction": "Select the OGS aromatic family for 'lemon zest'.",
"input": {
"item": "lemon zest"
},
"answer_format": "one_of:fruit,floral,herbal,spice,earth,wood,dairy,savory,confection,vegetal,marine",
"response_schema": null
}
instruction: always required. This is the core ask to the model.input: required, but MAY be an empty object. Contains structured payload the model should reason about.answer_format: an advisory string describing how the response should be formatted (e.g.,json,one_of:a,b,c,integer_0_10,set_of_codes).response_schema: optional JSON Schema the response will be validated against whenscoring.typeisschema_validationorjson_match.
5.4 Expected object
The shape of expected depends on the scoring type:
| Scoring type | Shape of expected |
|---|---|
exact_match |
{ "value": <string> } |
numeric_mae / numeric_rmse |
{ "value": <number>, "tolerance": <number?> } |
numeric_vector_mae |
{ "values": { "<key>": <number>, ... } } |
set_f1 |
{ "values": [<string>, ...] } |
ranked_selection |
{ "best": <string>, "ranking": [<string>, ...]? } |
json_match |
{ "value": <object>, "required_keys": [<string>, ...]? } |
schema_validation |
{ "schema_ref": "<path>" } |
6. Scoring
Each task declares exactly one scoring.type. All scorers produce a real-valued
score in [0, 1]. The benchmark runner aggregates these at four levels:
- Task score — raw scorer output for a single task.
- Capability score — unweighted mean of task scores in the same category.
- Category score — same as capability score for v0.1 (one capability per category for now).
- Aggregate score — weighted sum over categories using the weights in §3, expressed on a 0–100 scale.
6.1 Scorer specifications
| Scorer | Definition |
|---|---|
exact_match |
1.0 if the normalized string response equals expected.value, else 0.0. Normalization: lowercase, strip, collapse whitespace. |
numeric_mae |
Score = max(0, 1 - \|pred - expected.value\| / tolerance) using tolerance (default 2.0 on the 0–10 OGS intensity scale). |
numeric_rmse |
Same as numeric_mae but penalizes large errors quadratically. |
numeric_vector_mae |
Mean of per-key numeric_mae scores over the shared keys. Missing keys count as maximum error. |
set_f1 |
Standard set-F1 between the predicted set of tokens/codes and expected.values. |
ranked_selection |
1.0 for correct top-1 choice; 0.5 if the correct item appears in top-2; 0.0 otherwise. If a ranking is requested, uses NDCG over the provided ranking. |
json_match |
Score is the fraction of required_keys whose predicted values match the expected values (deep-equal for primitives; for numbers, within 10 % relative tolerance). |
schema_validation |
1.0 if the response validates against the referenced JSON Schema and is internally consistent (no unknown enum values, IDs well-formed). 0.5 for schema-valid but inconsistent; 0.0 otherwise. |
6.2 Aggregate formula
Let C = {VOCAB, SENSE, COMP, PAIR, XPLAIN, RISK, STRUCT, REGION} and
w_c the weight of category c. Let s_c be the mean of all task scores in
category c. Then:
aggregate_score = 100 * Σ_c w_c * s_c
If a category has zero attempted tasks, the aggregate is reported as
partial and the missing category weight is redistributed proportionally.
7. Running the Benchmark
The reference runner lives in bench/ and exposes a small CLI:
# List available tasks
python -m bench list
# Run a specific model on all tasks
python -m bench run --model gpt-5.4 --out results/gpt-5.4.json
# Run only a subset of categories
python -m bench run --model gpt-4.1 --categories pairing,risk --out results/gpt-4.1.pair.json
# Compare two result files
python -m bench report results/gpt-5.4.json results/gpt-4.1.json
Model adapters currently supported out of the box:
| Adapter | Model-ID prefixes | Requires |
|---|---|---|
baseline |
baseline, baseline:* |
Nothing — deterministic heuristic, used for smoke-testing |
openai |
gpt-, o1-, o3-, o4-, o5- |
OPENAI_API_KEY |
anthropic |
claude- |
ANTHROPIC_API_KEY |
echo |
echo |
Nothing — returns the prompt (for debugging) |
Adapters are pluggable: new ones may be registered via bench.models.register.
Every run produces a result document matching
schema/benchmark-result.schema.json,
which records the model ID, per-task scores, per-category aggregates, prompts,
raw responses, and the benchmark version used.
8. Conformance
A benchmark submission is OGS-Bench v0.1 conformant if it:
- Uses the official task set at the declared
benchmark_version. - Reports both per-category and aggregate scores using the weights in §3.
- Publishes its result document, which MUST include raw model responses and the exact prompts (the reference runner does this by default).
- Discloses the adapter and any non-default prompting options (temperature, system message, tool use, etc.).
Submissions MAY additionally report results on third-party task packs, but those do not count toward the OGS-Bench v0.1 aggregate.
9. Versioning
OGS-Bench follows the same Semantic Versioning rules as OGS core:
- MAJOR — breaking changes to task/result schemas, scoring definitions, or category weights.
- MINOR — adding tasks, new categories, new scorers (backwards-compatible).
- PATCH — clarifications, bug fixes, added rationale text.
Every task carries an explicit benchmark_version. Runners MUST refuse to score
tasks whose MAJOR version they do not support.
10. References
- ogs-core.md — Core specification
- ogs-sense.md — Sensory model
- ogs-comp.md — Composition model
- ogs-match.md — Matching model
schema/benchmark-task.schema.json— Task document schemaschema/benchmark-result.schema.json— Result document schemabench/— Reference runner and tasks