OGS-Bench

OGS-Bench is the official, schema-grounded evaluation for models that claim gastronomy capability. Tasks inject OGS vocabulary and schemas into the prompt (open-book mode) so scores reflect reasoning with the standard, not memorisation of a private rubric. Aggregate scores roll up on a 0–100 scale with deterministic metrics (exact match, set F1, numeric error, NDCG, JSON validity, and more).

Leaderboard Specification

Categories (v0.3)

Category	Measures
vocabulary	Knowledge of OGS controlled enums
sensory	Tastes, structure, aromatics on 0–10 scales
composition	Roles, cooking methods, Maillard, reduction
pairing	Selecting beverages for dishes
explanation	OGS explanation codes
risk	Negative pairing interactions
structured_output	Valid OGS JSON from a brief
regional	Cuisines, classics, wine regions, varieties
dietary	Dietary classifications and substitutions
safety	Food-safety reasoning
nutrition	Nutritional estimation
menu_engineering	Menu balance, pricing, composition

Reproducibility

Each published run records task-set SHA, configuration, judge metadata, and per-task scores in a single JSON document. The reference runner lives in the bench/ package; see the spec for weights, CI estimation, and format rules.