Standard / i18n
OGS Internationalization Specification
Open Gastronomy Standard — i18n v0.2
1. Purpose
This specification defines how OGS documents represent human-readable text in multiple languages, scripts, and regional variants. It is the binding contract for every human-readable field across all OGS entity types: ingredient names, dish descriptions, pairing motivations, explanation-code labels, and the controlled vocabularies themselves.
The v0.2 i18n layer is additive and backwards-compatible: documents
written against OGS v0.1 (which used bare strings for all human-readable
fields) remain valid — plain strings are treated as Form 1 of the new
LString type, implicitly in the document's declared primary_language.
2. Design principles
| Principle | Rationale |
|---|---|
| One reusable shape for every localizable string. | Avoid a different i18n pattern per module. A single LString type, reused everywhere. |
| BCP 47 language tags, normative. | en, en-GB, zh-Hans, ja-Latn. Two-letter ISO 639-1 alone is not sufficient; scripts and regions matter. |
| Endonym-first. | Every entity declares a primary_language. Other renderings are derivatives. This preserves cultural attribution and drives fallback. |
| Progressive disclosure. | Three forms of LString (bare string, language map, rich array). Producers adopt the form that matches their conformance level. |
| Controlled vocabularies are thesauri. | OGS enums get labels blocks keyed by BCP 47 tag, with preferred/alternate labels and definition text. Translations ship as versioned data files. |
| Unicode NFC + UTF-8, normative. | Prevents é vs e + combining acute collisions, canonicalizes sort/search. |
| Separate i18n from markets, units, and currency. | Swedish-language ≠ sold-in-Sweden. These axes live in sibling modules (ogs-market.md, ogs-units.md). |
| Translations are data. | Localized labels for core vocabularies ship as PR-reviewable JSON files, not as spec prose. |
| Consumers specify preference; standard specifies matching. | Consumers provide a prioritized list of language ranges; resolution uses RFC 4647 Lookup. |
3. The six kinds of strings
Not every string needs the same treatment. OGS distinguishes:
| Kind | Example | Treatment |
|---|---|---|
| Opaque identifier | ogs:core:ingredient:black-truffle, seared, STRUCT_TANNIN_FAT |
Never translated. ASCII, kebab-case, language-neutral. |
| Canonical entity name (endonym) | Tartufo nero, 親子丼, Château Margaux |
Stored in the entity's primary_language; the authoritative name. |
| Exonym / translation | Black truffle (en) for Tartufo nero (it) |
An LString label with role: "preferred" or "descriptive" in another language. |
| Controlled-vocab label | seared → "seared" (en), "saisi" (fr) |
Lives in vocabulary files, keyed by BCP 47 tag. |
| Free-form narrative | Pairing motivation, entity description |
LString; may be human-written, machine-translated (with provenance). |
| Sensory / aromatic term | lemon_zest, wet_stone |
Opaque ID + multilingual thesaurus label (same as controlled-vocab label). |
4. Language tags
4.1 Format
All language tags MUST conform to IETF BCP 47 (RFC 5646). Common tags:
| Tag | Meaning |
|---|---|
en |
English (unspecified region) |
en-US, en-GB, en-IN |
English regional variants |
fr, fr-CA |
French / Canadian French |
zh-Hans, zh-Hant |
Simplified / Traditional Chinese |
sr-Cyrl, sr-Latn |
Serbian in Cyrillic / Latin script |
ja, ja-Latn |
Japanese / romanized Japanese |
ar, ar-Latn |
Arabic / romanized Arabic |
zxx |
No linguistic content (e.g. a producer code) |
Consumers MUST treat tags as case-insensitive but producers SHOULD follow the canonical form (language lowercase, script title-case, region uppercase).
4.2 Transliteration
When a label is a transliteration of a non-Latin original, producers SHOULD:
- Include the script subtag (
ja-Latn,ru-Latn,zh-Latn). - Include a
transliteration_schemesibling field with the convention name (hepburn,kunrei,pinyin,wade_giles,iast,iso_9,bgn_pcgn). - Set
role: "transliteration".
Producers MAY additionally use the RFC 6497 -t- extension
(e.g. ja-Latn-t-ja-hepburn) for machine-processable transliteration chains.
4.3 Scripts
For languages that routinely use multiple scripts (Japanese, Chinese,
Serbian, Mongolian, Azerbaijani), the script field on a label entry
carries the ISO 15924 code: Jpan, Hira, Kana, Latn, Cyrl, Hans,
Hant, Arab, Hebr.
When the script subtag is already present in the BCP 47 tag
(zh-Hant, sr-Latn), the script field is redundant and MAY be omitted.
5. LString — the localizable string type
Every human-readable field in OGS accepts LString. Three forms, progressively
more expressive. Validators MUST accept all three.
5.1 Form 1 — bare string (L1 shortcut)
{
"name": "Black Truffle"
}
The enclosing entity declares primary_language at the document level; any
bare-string LString is implicitly in that language.
5.2 Form 2 — language map (common case, L2)
{
"name": {
"it": "Tartufo nero",
"en": "Black truffle",
"fr": "Truffe noire"
}
}
Keys are BCP 47 tags. Values are NFC-normalized UTF-8 strings. The entity's
primary_language identifies which entry is the endonym.
5.3 Form 3 — rich label array (L3, full fidelity)
{
"name": {
"primary_language": "ja",
"labels": [
{
"value": "親子丼",
"language": "ja",
"script": "Jpan",
"role": "preferred"
},
{
"value": "おやこどん",
"language": "ja",
"script": "Hira",
"role": "alternate"
},
{
"value": "oyakodon",
"language": "ja-Latn",
"role": "transliteration",
"transliteration_scheme": "hepburn"
},
{
"value": "Chicken and Egg Rice Bowl",
"language": "en",
"role": "descriptive",
"translation": {
"method": "human_expert",
"source_language": "ja",
"source_version": "0.2.0",
"reviewed_at": "2026-04-18T00:00:00Z",
"confidence": 0.95
}
}
]
}
}
5.3.1 Label fields
| Field | Type | Required | Description |
|---|---|---|---|
value |
string | YES | NFC-normalized UTF-8 text. |
language |
string | YES | BCP 47 tag. |
script |
string | NO | ISO 15924 script code. |
role |
string | NO | preferred, alternate, transliteration, descriptive, loan, historical, deprecated, ipa. Defaults to preferred. |
transliteration_scheme |
string | NO | Name of the convention (e.g. hepburn). Required when role: "transliteration". |
translation |
object | NO | Per-label provenance (see §7). |
5.3.2 Label roles
| Role | Meaning |
|---|---|
preferred |
Default label for its language tag. At most one per language tag. |
alternate |
Synonym, regional variant, or historical name. Multiple allowed. |
transliteration |
Script conversion of the endonym (e.g. romaji of Japanese). |
descriptive |
Translated gloss rather than a direct name (e.g. "Chicken and Egg Rice Bowl" for 親子丼). |
loan |
Imported form used in another language ("coq au vin" in English). |
historical |
Former official name retained for reference. |
deprecated |
No longer preferred; kept for round-trip compatibility. |
ipa |
International Phonetic Alphabet pronunciation (language is typically und-fonipa). |
6. Entity-level language fields
Every OGS document MAY carry:
| Field | Type | Required | Description |
|---|---|---|---|
primary_language |
string (BCP 47) | RECOMMENDED at L2, REQUIRED at L3 | The authoritative / endonym language. Drives fallback. |
available_languages |
array of BCP 47 tags | NO | Advisory list of languages covered across all LString fields. |
When primary_language is absent, consumers SHOULD treat it as en for
historical (OGS v0.1) documents and as und otherwise.
7. Translation provenance
For any label whose role implies it is a translation
(preferred in a non-primary language, descriptive, loan, historical),
producers SHOULD carry a translation object:
"translation": {
"method": "human_expert | machine | machine_reviewed | community",
"translator": "ogs:core:agent:alice",
"reviewed_at": "2026-04-18T10:00:00Z",
"source_language": "it",
"source_version": "0.2.0",
"confidence": 0.95
}
Key semantics:
source_language— what this label was translated from. Not always equal toprimary_languageif translation chained through a pivot language.source_version— the version of the source text this translation was produced against. Enables staleness detection when the source changes.method—human_expert,machine,machine_reviewed,community.
L3 conformance REQUIRES translation provenance on every non-primary-language
label with role: "preferred" | "descriptive" | "loan".
8. Controlled vocabularies as multilingual thesauri
OGS controlled vocabularies (cooking methods, component roles, match types,
aromatic families, explanation codes, cuisines) ship with a labels block
per concept, keyed by BCP 47 tag:
{
"id": "seared",
"labels": {
"en": { "preferred": "seared", "definition": "Brief high-heat surface cooking." },
"fr": { "preferred": "saisi", "definition": "Cuisson brève à haute température en surface." },
"ja": { "preferred": "表面焼き", "definition": "高温で表面のみを短時間焼く調理法。" },
"sv": { "preferred": "bryna", "definition": "Snabb bryning vid hög värme." }
}
}
Per-concept entries MAY also carry alternates (array of strings),
hidden (array of search-only aliases, e.g. common misspellings), and
notes.
8.1 Shipping policy
- Core vocab files (
vocab/*.json) always carryenas a baseline. - Additional languages SHIP in
vocab/translations/{bcp47}.jsonbundles, each a partial object merged onto the core file at load time. - Adding a language is a MINOR version bump.
- Correcting a translation is a PATCH.
8.2 Consumer merge algorithm
merged[concept_id].labels = { ...core.labels, ...translations[lang].labels }
Later entries override earlier ones. Consumers MUST preserve per-concept
id values unchanged.
9. Fallback and language matching
Consumers SHOULD resolve localized fields using the RFC 4647 Lookup algorithm against a prioritized list of language ranges.
9.1 Resolution order
Given a consumer's preference list ["fr-CA", "fr", "en"], resolution for a
given LString:
- Try each preference range in order using RFC 4647 Lookup
(progressive truncation):
fr-CA→fr-*→fr→en-*→en. - If no match is found, fall back to the entity's
primary_language. - If the entity declares no
primary_language, fall back to the first label in document order (Form 3) or the first map entry (Form 2).
9.2 Silent machine translation
Consumers MUST NOT silently substitute machine-translated content for a missing label. Translations must be carried in the document with explicit provenance (§7).
10. Normalization, encoding, and validation
Normative:
- Encoding: UTF-8.
- Unicode normalization: Form C (NFC) on all
valuestrings. Validators MUST verify and MAY auto-normalize input. - Case folding for identifier comparisons: ASCII lowercase (identifiers are ASCII by construction).
- Case folding for label search: Unicode default case folding (CLDR).
- Sort order: consumer concern; reference to CLDR collation.
- Bidirectional text: stored as plain Unicode; rendering is the UI's job. Canonical storage MUST NOT contain U+200E/U+200F markers.
11. Identifiers and non-Latin content
OGS identifiers (ogs:<ns>:<type>:<id>) are ASCII-only. For entities whose
canonical name is not ASCII-representable, the <id> segment uses:
- A Latin transliteration of the endonym (
oyakodon,boeuf-bourguignon), or - An ASCII-safe descriptive form (
chicken-egg-rice-bowl).
The display name lives in name (an LString). Producers MUST NOT encode
the native script in the identifier. The ID is opaque; it exists for routing,
not for reading.
12. Conformance
The existing L1/L2/L3 levels (see ogs-core.md §7) gain i18n requirements:
| Level | i18n requirement |
|---|---|
| L1 (Structural) | UTF-8 encoding; NFC normalization. Bare-string LString values permitted (Form 1). |
| L2 (Vocabulary) | All BCP 47 tags appearing in documents MUST parse. Entity SHOULD declare primary_language. Controlled-vocab labels resolved via the shipped translation bundles. |
| L3 (Semantic) | Every entity MUST declare primary_language. Every human-readable field SHOULD use Form 2 or Form 3 with at least the primary language populated. Non-primary translations SHOULD carry translation provenance. |
13. What OGS deliberately does NOT do
Out of scope for the i18n module:
- ICU MessageFormat / Fluent / gettext pluralization and grammatical gender. Application layer.
- Locale-aware number and date formatting. Always canonical numeric JSON.
- Runtime machine translation. Translations must be durable and reviewable.
- Audio pronunciation. A
role: "ipa"label is supported; binary audio references wait for a future version. - Honorific registers. Application concern.
- Speech/TTS hints. Application concern.
- Automatic language detection. Producers and consumers declare tags explicitly.
14. Benchmark implications
Benchmark tasks (schema/benchmark-task.schema.json) gain:
primary_language(BCP 47).available_languages(advisory).LStringsupport onname,description,rationale, and (optionally) onprompt.instructionandexpected.valuefor multilingual tasks.
Aggregate leaderboards remain language-aware: a per-language sub-leaderboard is computed per BCP 47 tag present across tasks, so consumers can evaluate "how good is model X at gastronomy in French?" independently of English performance.
15. References
- ogs-core.md — Core specification
- IETF BCP 47 / RFC 5646 — Language tags
- RFC 4647 — Matching language tags (Lookup)
- RFC 6497 — BCP 47 extension T (transliteration)
- ISO 15924 — Script codes
- Unicode UAX #15 — Normalization forms
- W3C SKOS — Simple Knowledge Organization System
schema/lstring.schema.json— machine-readableLStringshapevocab/translations/— language bundles for controlled vocabularies