OGS-Bench

OGS-Bench is the official, schema-grounded evaluation for models that claim gastronomy capability. Tasks inject OGS vocabulary and schemas into the prompt (open-book mode) so scores reflect reasoning with the standard, not memorisation of a private rubric. Aggregate scores roll up on a 0–100 scale with deterministic metrics (exact match, set F1, numeric error, NDCG, JSON validity, and more).

Leaderboard Specification

Categories (v0.3)

CategoryMeasures
vocabularyKnowledge of OGS controlled enums
sensoryTastes, structure, aromatics on 0–10 scales
compositionRoles, cooking methods, Maillard, reduction
pairingSelecting beverages for dishes
explanationOGS explanation codes
riskNegative pairing interactions
structured_outputValid OGS JSON from a brief
regionalCuisines, classics, wine regions, varieties
dietaryDietary classifications and substitutions
safetyFood-safety reasoning
nutritionNutritional estimation
menu_engineeringMenu balance, pricing, composition

Reproducibility

Each published run records task-set SHA, configuration, judge metadata, and per-task scores in a single JSON document. The reference runner lives in the bench/ package; see the spec for weights, CI estimation, and format rules.