v0.3.0 open-book interactive

OGS-Bench Leaderboard

The reference leaderboard for Open Gastronomy Standard (OGS-Bench). Models are scored on gastronomy reasoning with the OGS vocabulary and schemas injected into the prompt, so the task measures reasoning-given-the-spec rather than memorisation of it. Filter by category, difficulty, or language, try alternate weightings, and drill into any task to see every model's response.

Models
13
Tasks in latest set
213
Task-set SHAs seen
1
Generated
2026-04-20 06:14 UTC

Interactive view requires JavaScript. Showing a static summary:

#ModelAggregate (±CI) Latency (median)USD total TasksRun
1 claude-opus-4-7 anthropic 85.9 ±3.6 1.35 s $15.732 213 2026-04-18
2 gpt-5.4 openai 85.3 ±3.6 787 ms $1.012 213 2026-04-18
3 gemini-2.5-pro gemini 85.0 ±3.0 8.77 s $1.027 213 2026-04-19
4 gemini-2.5-flash gemini 80.8 ±3.8 2.96 s $0.255 213 2026-04-19
5 gpt-4.1 openai 80.5 ±4.0 691 ms $1.272 213 2026-04-18
6 claude-sonnet-4-6 anthropic 80.2 ±4.0 1.28 s $2.735 213 2026-04-18
7 gpt-5.4-mini openai 79.6 ±3.9 661 ms $0.180 213 2026-04-18
8 claude-haiku-4-5 anthropic 78.7 ±4.1 941 ms $0.657 213 2026-04-18
9 gpt-4.1-mini openai 76.4 ±4.1 634 ms $0.241 213 2026-04-18
10 gemini-2.5-flash-lite gemini 75.1 ±4.4 525 ms $0.065 213 2026-04-19
11 gpt-5.4-nano openai 71.3 ±4.4 653 ms $0.044 213 2026-04-18
12 gpt-4.1-nano openai 61.9 ±5.4 466 ms $0.057 213 2026-04-18
13 baseline baseline 43.7 ±5.2 213 2026-04-18