Standard / i18n

OGS Internationalization Specification

Open Gastronomy Standard — i18n v0.2


1. Purpose

This specification defines how OGS documents represent human-readable text in multiple languages, scripts, and regional variants. It is the binding contract for every human-readable field across all OGS entity types: ingredient names, dish descriptions, pairing motivations, explanation-code labels, and the controlled vocabularies themselves.

The v0.2 i18n layer is additive and backwards-compatible: documents written against OGS v0.1 (which used bare strings for all human-readable fields) remain valid — plain strings are treated as Form 1 of the new LString type, implicitly in the document's declared primary_language.


2. Design principles

Principle Rationale
One reusable shape for every localizable string. Avoid a different i18n pattern per module. A single LString type, reused everywhere.
BCP 47 language tags, normative. en, en-GB, zh-Hans, ja-Latn. Two-letter ISO 639-1 alone is not sufficient; scripts and regions matter.
Endonym-first. Every entity declares a primary_language. Other renderings are derivatives. This preserves cultural attribution and drives fallback.
Progressive disclosure. Three forms of LString (bare string, language map, rich array). Producers adopt the form that matches their conformance level.
Controlled vocabularies are thesauri. OGS enums get labels blocks keyed by BCP 47 tag, with preferred/alternate labels and definition text. Translations ship as versioned data files.
Unicode NFC + UTF-8, normative. Prevents é vs e + combining acute collisions, canonicalizes sort/search.
Separate i18n from markets, units, and currency. Swedish-language ≠ sold-in-Sweden. These axes live in sibling modules (ogs-market.md, ogs-units.md).
Translations are data. Localized labels for core vocabularies ship as PR-reviewable JSON files, not as spec prose.
Consumers specify preference; standard specifies matching. Consumers provide a prioritized list of language ranges; resolution uses RFC 4647 Lookup.

3. The six kinds of strings

Not every string needs the same treatment. OGS distinguishes:

Kind Example Treatment
Opaque identifier ogs:core:ingredient:black-truffle, seared, STRUCT_TANNIN_FAT Never translated. ASCII, kebab-case, language-neutral.
Canonical entity name (endonym) Tartufo nero, 親子丼, Château Margaux Stored in the entity's primary_language; the authoritative name.
Exonym / translation Black truffle (en) for Tartufo nero (it) An LString label with role: "preferred" or "descriptive" in another language.
Controlled-vocab label seared → "seared" (en), "saisi" (fr) Lives in vocabulary files, keyed by BCP 47 tag.
Free-form narrative Pairing motivation, entity description LString; may be human-written, machine-translated (with provenance).
Sensory / aromatic term lemon_zest, wet_stone Opaque ID + multilingual thesaurus label (same as controlled-vocab label).

4. Language tags

4.1 Format

All language tags MUST conform to IETF BCP 47 (RFC 5646). Common tags:

Tag Meaning
en English (unspecified region)
en-US, en-GB, en-IN English regional variants
fr, fr-CA French / Canadian French
zh-Hans, zh-Hant Simplified / Traditional Chinese
sr-Cyrl, sr-Latn Serbian in Cyrillic / Latin script
ja, ja-Latn Japanese / romanized Japanese
ar, ar-Latn Arabic / romanized Arabic
zxx No linguistic content (e.g. a producer code)

Consumers MUST treat tags as case-insensitive but producers SHOULD follow the canonical form (language lowercase, script title-case, region uppercase).

4.2 Transliteration

When a label is a transliteration of a non-Latin original, producers SHOULD:

  1. Include the script subtag (ja-Latn, ru-Latn, zh-Latn).
  2. Include a transliteration_scheme sibling field with the convention name (hepburn, kunrei, pinyin, wade_giles, iast, iso_9, bgn_pcgn).
  3. Set role: "transliteration".

Producers MAY additionally use the RFC 6497 -t- extension (e.g. ja-Latn-t-ja-hepburn) for machine-processable transliteration chains.

4.3 Scripts

For languages that routinely use multiple scripts (Japanese, Chinese, Serbian, Mongolian, Azerbaijani), the script field on a label entry carries the ISO 15924 code: Jpan, Hira, Kana, Latn, Cyrl, Hans, Hant, Arab, Hebr.

When the script subtag is already present in the BCP 47 tag (zh-Hant, sr-Latn), the script field is redundant and MAY be omitted.


5. LString — the localizable string type

Every human-readable field in OGS accepts LString. Three forms, progressively more expressive. Validators MUST accept all three.

5.1 Form 1 — bare string (L1 shortcut)

{
  "name": "Black Truffle"
}

The enclosing entity declares primary_language at the document level; any bare-string LString is implicitly in that language.

5.2 Form 2 — language map (common case, L2)

{
  "name": {
    "it": "Tartufo nero",
    "en": "Black truffle",
    "fr": "Truffe noire"
  }
}

Keys are BCP 47 tags. Values are NFC-normalized UTF-8 strings. The entity's primary_language identifies which entry is the endonym.

5.3 Form 3 — rich label array (L3, full fidelity)

{
  "name": {
    "primary_language": "ja",
    "labels": [
      {
        "value": "親子丼",
        "language": "ja",
        "script": "Jpan",
        "role": "preferred"
      },
      {
        "value": "おやこどん",
        "language": "ja",
        "script": "Hira",
        "role": "alternate"
      },
      {
        "value": "oyakodon",
        "language": "ja-Latn",
        "role": "transliteration",
        "transliteration_scheme": "hepburn"
      },
      {
        "value": "Chicken and Egg Rice Bowl",
        "language": "en",
        "role": "descriptive",
        "translation": {
          "method": "human_expert",
          "source_language": "ja",
          "source_version": "0.2.0",
          "reviewed_at": "2026-04-18T00:00:00Z",
          "confidence": 0.95
        }
      }
    ]
  }
}

5.3.1 Label fields

Field Type Required Description
value string YES NFC-normalized UTF-8 text.
language string YES BCP 47 tag.
script string NO ISO 15924 script code.
role string NO preferred, alternate, transliteration, descriptive, loan, historical, deprecated, ipa. Defaults to preferred.
transliteration_scheme string NO Name of the convention (e.g. hepburn). Required when role: "transliteration".
translation object NO Per-label provenance (see §7).

5.3.2 Label roles

Role Meaning
preferred Default label for its language tag. At most one per language tag.
alternate Synonym, regional variant, or historical name. Multiple allowed.
transliteration Script conversion of the endonym (e.g. romaji of Japanese).
descriptive Translated gloss rather than a direct name (e.g. "Chicken and Egg Rice Bowl" for 親子丼).
loan Imported form used in another language ("coq au vin" in English).
historical Former official name retained for reference.
deprecated No longer preferred; kept for round-trip compatibility.
ipa International Phonetic Alphabet pronunciation (language is typically und-fonipa).

6. Entity-level language fields

Every OGS document MAY carry:

Field Type Required Description
primary_language string (BCP 47) RECOMMENDED at L2, REQUIRED at L3 The authoritative / endonym language. Drives fallback.
available_languages array of BCP 47 tags NO Advisory list of languages covered across all LString fields.

When primary_language is absent, consumers SHOULD treat it as en for historical (OGS v0.1) documents and as und otherwise.


7. Translation provenance

For any label whose role implies it is a translation (preferred in a non-primary language, descriptive, loan, historical), producers SHOULD carry a translation object:

"translation": {
  "method": "human_expert | machine | machine_reviewed | community",
  "translator": "ogs:core:agent:alice",
  "reviewed_at": "2026-04-18T10:00:00Z",
  "source_language": "it",
  "source_version": "0.2.0",
  "confidence": 0.95
}

Key semantics:

L3 conformance REQUIRES translation provenance on every non-primary-language label with role: "preferred" | "descriptive" | "loan".


8. Controlled vocabularies as multilingual thesauri

OGS controlled vocabularies (cooking methods, component roles, match types, aromatic families, explanation codes, cuisines) ship with a labels block per concept, keyed by BCP 47 tag:

{
  "id": "seared",
  "labels": {
    "en": { "preferred": "seared", "definition": "Brief high-heat surface cooking." },
    "fr": { "preferred": "saisi", "definition": "Cuisson brève à haute température en surface." },
    "ja": { "preferred": "表面焼き", "definition": "高温で表面のみを短時間焼く調理法。" },
    "sv": { "preferred": "bryna", "definition": "Snabb bryning vid hög värme." }
  }
}

Per-concept entries MAY also carry alternates (array of strings), hidden (array of search-only aliases, e.g. common misspellings), and notes.

8.1 Shipping policy

8.2 Consumer merge algorithm

merged[concept_id].labels = { ...core.labels, ...translations[lang].labels }

Later entries override earlier ones. Consumers MUST preserve per-concept id values unchanged.


9. Fallback and language matching

Consumers SHOULD resolve localized fields using the RFC 4647 Lookup algorithm against a prioritized list of language ranges.

9.1 Resolution order

Given a consumer's preference list ["fr-CA", "fr", "en"], resolution for a given LString:

  1. Try each preference range in order using RFC 4647 Lookup (progressive truncation): fr-CAfr-*fren-*en.
  2. If no match is found, fall back to the entity's primary_language.
  3. If the entity declares no primary_language, fall back to the first label in document order (Form 3) or the first map entry (Form 2).

9.2 Silent machine translation

Consumers MUST NOT silently substitute machine-translated content for a missing label. Translations must be carried in the document with explicit provenance (§7).


10. Normalization, encoding, and validation

Normative:


11. Identifiers and non-Latin content

OGS identifiers (ogs:<ns>:<type>:<id>) are ASCII-only. For entities whose canonical name is not ASCII-representable, the <id> segment uses:

  1. A Latin transliteration of the endonym (oyakodon, boeuf-bourguignon), or
  2. An ASCII-safe descriptive form (chicken-egg-rice-bowl).

The display name lives in name (an LString). Producers MUST NOT encode the native script in the identifier. The ID is opaque; it exists for routing, not for reading.


12. Conformance

The existing L1/L2/L3 levels (see ogs-core.md §7) gain i18n requirements:

Level i18n requirement
L1 (Structural) UTF-8 encoding; NFC normalization. Bare-string LString values permitted (Form 1).
L2 (Vocabulary) All BCP 47 tags appearing in documents MUST parse. Entity SHOULD declare primary_language. Controlled-vocab labels resolved via the shipped translation bundles.
L3 (Semantic) Every entity MUST declare primary_language. Every human-readable field SHOULD use Form 2 or Form 3 with at least the primary language populated. Non-primary translations SHOULD carry translation provenance.

13. What OGS deliberately does NOT do

Out of scope for the i18n module:

  1. ICU MessageFormat / Fluent / gettext pluralization and grammatical gender. Application layer.
  2. Locale-aware number and date formatting. Always canonical numeric JSON.
  3. Runtime machine translation. Translations must be durable and reviewable.
  4. Audio pronunciation. A role: "ipa" label is supported; binary audio references wait for a future version.
  5. Honorific registers. Application concern.
  6. Speech/TTS hints. Application concern.
  7. Automatic language detection. Producers and consumers declare tags explicitly.

14. Benchmark implications

Benchmark tasks (schema/benchmark-task.schema.json) gain:

Aggregate leaderboards remain language-aware: a per-language sub-leaderboard is computed per BCP 47 tag present across tasks, so consumers can evaluate "how good is model X at gastronomy in French?" independently of English performance.


15. References