OGS Internationalization Specification

Open Gastronomy Standard — i18n v0.2

1. Purpose

This specification defines how OGS documents represent human-readable text in multiple languages, scripts, and regional variants. It is the binding contract for every human-readable field across all OGS entity types: ingredient names, dish descriptions, pairing motivations, explanation-code labels, and the controlled vocabularies themselves.

The v0.2 i18n layer is additive and backwards-compatible: documents written against OGS v0.1 (which used bare strings for all human-readable fields) remain valid — plain strings are treated as Form 1 of the new LString type, implicitly in the document's declared primary_language.

2. Design principles

Principle	Rationale
One reusable shape for every localizable string.	Avoid a different i18n pattern per module. A single `LString` type, reused everywhere.
BCP 47 language tags, normative.	`en`, `en-GB`, `zh-Hans`, `ja-Latn`. Two-letter ISO 639-1 alone is not sufficient; scripts and regions matter.
Endonym-first.	Every entity declares a `primary_language`. Other renderings are derivatives. This preserves cultural attribution and drives fallback.
Progressive disclosure.	Three forms of `LString` (bare string, language map, rich array). Producers adopt the form that matches their conformance level.
Controlled vocabularies are thesauri.	OGS enums get `labels` blocks keyed by BCP 47 tag, with `preferred`/`alternate` labels and `definition` text. Translations ship as versioned data files.
Unicode NFC + UTF-8, normative.	Prevents `é` vs `e + combining acute` collisions, canonicalizes sort/search.
Separate i18n from markets, units, and currency.	Swedish-language ≠ sold-in-Sweden. These axes live in sibling modules (`ogs-market.md`, `ogs-units.md`).
Translations are data.	Localized labels for `core` vocabularies ship as PR-reviewable JSON files, not as spec prose.
Consumers specify preference; standard specifies matching.	Consumers provide a prioritized list of language ranges; resolution uses RFC 4647 Lookup.

3. The six kinds of strings

Not every string needs the same treatment. OGS distinguishes:

Kind	Example	Treatment
Opaque identifier	`ogs:core:ingredient:black-truffle`, `seared`, `STRUCT_TANNIN_FAT`	Never translated. ASCII, kebab-case, language-neutral.
Canonical entity name (endonym)	`Tartufo nero`, `親子丼`, `Château Margaux`	Stored in the entity's `primary_language`; the authoritative name.
Exonym / translation	`Black truffle` (en) for `Tartufo nero` (it)	An `LString` label with `role: "preferred"` or `"descriptive"` in another language.
Controlled-vocab label	`seared` → "seared" (en), "saisi" (fr)	Lives in vocabulary files, keyed by BCP 47 tag.
Free-form narrative	Pairing `motivation`, entity `description`	`LString`; may be human-written, machine-translated (with provenance).
Sensory / aromatic term	`lemon_zest`, `wet_stone`	Opaque ID + multilingual thesaurus label (same as controlled-vocab label).

4. Language tags

4.1 Format

All language tags MUST conform to IETF BCP 47 (RFC 5646). Common tags:

Tag	Meaning
`en`	English (unspecified region)
`en-US`, `en-GB`, `en-IN`	English regional variants
`fr`, `fr-CA`	French / Canadian French
`zh-Hans`, `zh-Hant`	Simplified / Traditional Chinese
`sr-Cyrl`, `sr-Latn`	Serbian in Cyrillic / Latin script
`ja`, `ja-Latn`	Japanese / romanized Japanese
`ar`, `ar-Latn`	Arabic / romanized Arabic
`zxx`	No linguistic content (e.g. a producer code)

Consumers MUST treat tags as case-insensitive but producers SHOULD follow the canonical form (language lowercase, script title-case, region uppercase).

4.2 Transliteration

When a label is a transliteration of a non-Latin original, producers SHOULD:

Include the script subtag (ja-Latn, ru-Latn, zh-Latn).
Include a transliteration_scheme sibling field with the convention name (hepburn, kunrei, pinyin, wade_giles, iast, iso_9, bgn_pcgn).
Set role: "transliteration".

Producers MAY additionally use the RFC 6497 -t- extension (e.g. ja-Latn-t-ja-hepburn) for machine-processable transliteration chains.

4.3 Scripts

For languages that routinely use multiple scripts (Japanese, Chinese, Serbian, Mongolian, Azerbaijani), the script field on a label entry carries the ISO 15924 code: Jpan, Hira, Kana, Latn, Cyrl, Hans, Hant, Arab, Hebr.

When the script subtag is already present in the BCP 47 tag (zh-Hant, sr-Latn), the script field is redundant and MAY be omitted.

5. `LString` — the localizable string type

Every human-readable field in OGS accepts LString. Three forms, progressively more expressive. Validators MUST accept all three.

5.1 Form 1 — bare string (L1 shortcut)

{
  "name": "Black Truffle"
}

The enclosing entity declares primary_language at the document level; any bare-string LString is implicitly in that language.

5.2 Form 2 — language map (common case, L2)

{
  "name": {
    "it": "Tartufo nero",
    "en": "Black truffle",
    "fr": "Truffe noire"
  }
}

Keys are BCP 47 tags. Values are NFC-normalized UTF-8 strings. The entity's primary_language identifies which entry is the endonym.

5.3 Form 3 — rich label array (L3, full fidelity)

{
  "name": {
    "primary_language": "ja",
    "labels": [
      {
        "value": "親子丼",
        "language": "ja",
        "script": "Jpan",
        "role": "preferred"
      },
      {
        "value": "おやこどん",
        "language": "ja",
        "script": "Hira",
        "role": "alternate"
      },
      {
        "value": "oyakodon",
        "language": "ja-Latn",
        "role": "transliteration",
        "transliteration_scheme": "hepburn"
      },
      {
        "value": "Chicken and Egg Rice Bowl",
        "language": "en",
        "role": "descriptive",
        "translation": {
          "method": "human_expert",
          "source_language": "ja",
          "source_version": "0.2.0",
          "reviewed_at": "2026-04-18T00:00:00Z",
          "confidence": 0.95
        }
      }
    ]
  }
}

5.3.1 Label fields

Field	Type	Required	Description
`value`	string	YES	NFC-normalized UTF-8 text.
`language`	string	YES	BCP 47 tag.
`script`	string	NO	ISO 15924 script code.
`role`	string	NO	`preferred`, `alternate`, `transliteration`, `descriptive`, `loan`, `historical`, `deprecated`, `ipa`. Defaults to `preferred`.
`transliteration_scheme`	string	NO	Name of the convention (e.g. `hepburn`). Required when `role: "transliteration"`.
`translation`	object	NO	Per-label provenance (see §7).

5.3.2 Label roles

Role	Meaning
`preferred`	Default label for its language tag. At most one per language tag.
`alternate`	Synonym, regional variant, or historical name. Multiple allowed.
`transliteration`	Script conversion of the endonym (e.g. romaji of Japanese).
`descriptive`	Translated gloss rather than a direct name (e.g. "Chicken and Egg Rice Bowl" for 親子丼).
`loan`	Imported form used in another language ("coq au vin" in English).
`historical`	Former official name retained for reference.
`deprecated`	No longer preferred; kept for round-trip compatibility.
`ipa`	International Phonetic Alphabet pronunciation (language is typically `und-fonipa`).

6. Entity-level language fields

Every OGS document MAY carry:

Field	Type	Required	Description
`primary_language`	string (BCP 47)	RECOMMENDED at L2, REQUIRED at L3	The authoritative / endonym language. Drives fallback.
`available_languages`	array of BCP 47 tags	NO	Advisory list of languages covered across all `LString` fields.

When primary_language is absent, consumers SHOULD treat it as en for historical (OGS v0.1) documents and as und otherwise.

7. Translation provenance

For any label whose role implies it is a translation (preferred in a non-primary language, descriptive, loan, historical), producers SHOULD carry a translation object:

"translation": {
  "method": "human_expert | machine | machine_reviewed | community",
  "translator": "ogs:core:agent:alice",
  "reviewed_at": "2026-04-18T10:00:00Z",
  "source_language": "it",
  "source_version": "0.2.0",
  "confidence": 0.95
}

Key semantics:

source_language — what this label was translated from. Not always equal to primary_language if translation chained through a pivot language.
source_version — the version of the source text this translation was produced against. Enables staleness detection when the source changes.
method — human_expert, machine, machine_reviewed, community.

L3 conformance REQUIRES translation provenance on every non-primary-language label with role: "preferred" | "descriptive" | "loan".

8. Controlled vocabularies as multilingual thesauri

OGS controlled vocabularies (cooking methods, component roles, match types, aromatic families, explanation codes, cuisines) ship with a labels block per concept, keyed by BCP 47 tag:

{
  "id": "seared",
  "labels": {
    "en": { "preferred": "seared", "definition": "Brief high-heat surface cooking." },
    "fr": { "preferred": "saisi", "definition": "Cuisson brève à haute température en surface." },
    "ja": { "preferred": "表面焼き", "definition": "高温で表面のみを短時間焼く調理法。" },
    "sv": { "preferred": "bryna", "definition": "Snabb bryning vid hög värme." }
  }
}

Per-concept entries MAY also carry alternates (array of strings), hidden (array of search-only aliases, e.g. common misspellings), and notes.

8.1 Shipping policy

Core vocab files (vocab/*.json) always carry en as a baseline.
Additional languages SHIP in vocab/translations/{bcp47}.json bundles, each a partial object merged onto the core file at load time.
Adding a language is a MINOR version bump.
Correcting a translation is a PATCH.

8.2 Consumer merge algorithm

merged[concept_id].labels = { ...core.labels, ...translations[lang].labels }

Later entries override earlier ones. Consumers MUST preserve per-concept id values unchanged.

9. Fallback and language matching

Consumers SHOULD resolve localized fields using the RFC 4647 Lookup algorithm against a prioritized list of language ranges.

9.1 Resolution order

Given a consumer's preference list ["fr-CA", "fr", "en"], resolution for a given LString:

Try each preference range in order using RFC 4647 Lookup (progressive truncation): fr-CA → fr-* → fr → en-* → en.
If no match is found, fall back to the entity's primary_language.
If the entity declares no primary_language, fall back to the first label in document order (Form 3) or the first map entry (Form 2).

9.2 Silent machine translation

Consumers MUST NOT silently substitute machine-translated content for a missing label. Translations must be carried in the document with explicit provenance (§7).

10. Normalization, encoding, and validation

Normative:

Encoding: UTF-8.
Unicode normalization: Form C (NFC) on all value strings. Validators MUST verify and MAY auto-normalize input.
Case folding for identifier comparisons: ASCII lowercase (identifiers are ASCII by construction).
Case folding for label search: Unicode default case folding (CLDR).
Sort order: consumer concern; reference to CLDR collation.
Bidirectional text: stored as plain Unicode; rendering is the UI's job. Canonical storage MUST NOT contain U+200E/U+200F markers.

11. Identifiers and non-Latin content

OGS identifiers (ogs:<ns>:<type>:<id>) are ASCII-only. For entities whose canonical name is not ASCII-representable, the <id> segment uses:

A Latin transliteration of the endonym (oyakodon, boeuf-bourguignon), or
An ASCII-safe descriptive form (chicken-egg-rice-bowl).

The display name lives in name (an LString). Producers MUST NOT encode the native script in the identifier. The ID is opaque; it exists for routing, not for reading.

12. Conformance

The existing L1/L2/L3 levels (see ogs-core.md §7) gain i18n requirements:

Level	i18n requirement
L1 (Structural)	UTF-8 encoding; NFC normalization. Bare-string `LString` values permitted (Form 1).
L2 (Vocabulary)	All BCP 47 tags appearing in documents MUST parse. Entity SHOULD declare `primary_language`. Controlled-vocab labels resolved via the shipped translation bundles.
L3 (Semantic)	Every entity MUST declare `primary_language`. Every human-readable field SHOULD use Form 2 or Form 3 with at least the primary language populated. Non-primary translations SHOULD carry `translation` provenance.

13. What OGS deliberately does NOT do

Out of scope for the i18n module:

ICU MessageFormat / Fluent / gettext pluralization and grammatical gender. Application layer.
Locale-aware number and date formatting. Always canonical numeric JSON.
Runtime machine translation. Translations must be durable and reviewable.
Audio pronunciation. A role: "ipa" label is supported; binary audio references wait for a future version.
Honorific registers. Application concern.
Speech/TTS hints. Application concern.
Automatic language detection. Producers and consumers declare tags explicitly.

14. Benchmark implications

Benchmark tasks (schema/benchmark-task.schema.json) gain:

primary_language (BCP 47).
available_languages (advisory).
LString support on name, description, rationale, and (optionally) on prompt.instruction and expected.value for multilingual tasks.

Aggregate leaderboards remain language-aware: a per-language sub-leaderboard is computed per BCP 47 tag present across tasks, so consumers can evaluate "how good is model X at gastronomy in French?" independently of English performance.

15. References

ogs-core.md — Core specification
IETF BCP 47 / RFC 5646 — Language tags
RFC 4647 — Matching language tags (Lookup)
RFC 6497 — BCP 47 extension T (transliteration)
ISO 15924 — Script codes
Unicode UAX #15 — Normalization forms
W3C SKOS — Simple Knowledge Organization System
schema/lstring.schema.json — machine-readable LString shape
vocab/translations/ — language bundles for controlled vocabularies