OGS-Bench Leaderboard
The reference leaderboard for Open Gastronomy Standard (OGS-Bench). Models are scored on gastronomy reasoning with the OGS vocabulary and schemas injected into the prompt, so the task measures reasoning-given-the-spec rather than memorisation of it. Filter by category, difficulty, or language, try alternate weightings, and drill into any task to see every model's response.
Interactive view requires JavaScript. Showing a static summary:
| # | Model | Aggregate (±CI) | Latency (median) | USD total | Tasks | Run |
|---|---|---|---|---|---|---|
| 1 | claude-opus-4-7
anthropic |
85.9 ±3.6 | 1.35 s | $15.732 | 213 | 2026-04-18 |
| 2 | gpt-5.4
openai |
85.3 ±3.6 | 787 ms | $1.012 | 213 | 2026-04-18 |
| 3 | gemini-2.5-pro
gemini |
85.0 ±3.0 | 8.77 s | $1.027 | 213 | 2026-04-19 |
| 4 | gemini-2.5-flash
gemini |
80.8 ±3.8 | 2.96 s | $0.255 | 213 | 2026-04-19 |
| 5 | gpt-4.1
openai |
80.5 ±4.0 | 691 ms | $1.272 | 213 | 2026-04-18 |
| 6 | claude-sonnet-4-6
anthropic |
80.2 ±4.0 | 1.28 s | $2.735 | 213 | 2026-04-18 |
| 7 | gpt-5.4-mini
openai |
79.6 ±3.9 | 661 ms | $0.180 | 213 | 2026-04-18 |
| 8 | claude-haiku-4-5
anthropic |
78.7 ±4.1 | 941 ms | $0.657 | 213 | 2026-04-18 |
| 9 | gpt-4.1-mini
openai |
76.4 ±4.1 | 634 ms | $0.241 | 213 | 2026-04-18 |
| 10 | gemini-2.5-flash-lite
gemini |
75.1 ±4.4 | 525 ms | $0.065 | 213 | 2026-04-19 |
| 11 | gpt-5.4-nano
openai |
71.3 ±4.4 | 653 ms | $0.044 | 213 | 2026-04-18 |
| 12 | gpt-4.1-nano
openai |
61.9 ±5.4 | 466 ms | $0.057 | 213 | 2026-04-18 |
| 13 | baseline
baseline |
43.7 ±5.2 | — | — | 213 | 2026-04-18 |