v0.3.0 open-book interactive

OGS-Bench Leaderboard

The reference leaderboard for Open Gastronomy Standard (OGS-Bench). Models are scored on gastronomy reasoning with the OGS vocabulary and schemas injected into the prompt, so the task measures reasoning-given-the-spec rather than memorisation of it. Filter by category, difficulty, or language, try alternate weightings, and drill into any task to see every model's response.

Models

Tasks in latest set

213

Task-set SHAs seen

Generated

2026-04-20 06:14 UTC

Interactive view requires JavaScript. Showing a static summary:

#	Model	Aggregate (±CI)	Latency (median)	USD total	Tasks	Run
1	`claude-opus-4-7` anthropic	85.9 ±3.6	1.35 s	$15.732	213	2026-04-18
2	`gpt-5.4` openai	85.3 ±3.6	787 ms	$1.012	213	2026-04-18
3	`gemini-2.5-pro` gemini	85.0 ±3.0	8.77 s	$1.027	213	2026-04-19
4	`gemini-2.5-flash` gemini	80.8 ±3.8	2.96 s	$0.255	213	2026-04-19
5	`gpt-4.1` openai	80.5 ±4.0	691 ms	$1.272	213	2026-04-18
6	`claude-sonnet-4-6` anthropic	80.2 ±4.0	1.28 s	$2.735	213	2026-04-18
7	`gpt-5.4-mini` openai	79.6 ±3.9	661 ms	$0.180	213	2026-04-18
8	`claude-haiku-4-5` anthropic	78.7 ±4.1	941 ms	$0.657	213	2026-04-18
9	`gpt-4.1-mini` openai	76.4 ±4.1	634 ms	$0.241	213	2026-04-18
10	`gemini-2.5-flash-lite` gemini	75.1 ±4.4	525 ms	$0.065	213	2026-04-19
11	`gpt-5.4-nano` openai	71.3 ±4.4	653 ms	$0.044	213	2026-04-18
12	`gpt-4.1-nano` openai	61.9 ±5.4	466 ms	$0.057	213	2026-04-18
13	`baseline` baseline	43.7 ±5.2	—	—	213	2026-04-18