OGS-Bench
OGS-Bench is the official, schema-grounded evaluation for models that claim gastronomy capability. Tasks inject OGS vocabulary and schemas into the prompt (open-book mode) so scores reflect reasoning with the standard, not memorisation of a private rubric. Aggregate scores roll up on a 0–100 scale with deterministic metrics (exact match, set F1, numeric error, NDCG, JSON validity, and more).
Categories (v0.3)
| Category | Measures |
|---|---|
| vocabulary | Knowledge of OGS controlled enums |
| sensory | Tastes, structure, aromatics on 0–10 scales |
| composition | Roles, cooking methods, Maillard, reduction |
| pairing | Selecting beverages for dishes |
| explanation | OGS explanation codes |
| risk | Negative pairing interactions |
| structured_output | Valid OGS JSON from a brief |
| regional | Cuisines, classics, wine regions, varieties |
| dietary | Dietary classifications and substitutions |
| safety | Food-safety reasoning |
| nutrition | Nutritional estimation |
| menu_engineering | Menu balance, pricing, composition |
Reproducibility
Each published run records task-set SHA, configuration, judge metadata, and per-task
scores in a single JSON document. The reference runner lives in the
bench/ package; see the spec for weights, CI estimation, and format rules.