Understanding Unicode and CLDR
The Unicode Consortium and its Common Locale Data Repository (CLDR) form the foundation of modern internationalization, providing standardized data for displaying, formatting, and sorting text in every language and region worldwide. Unicode defines character encoding—the universal mapping between characters and numbers that enables computers to represent and process text from all writing systems, from Latin to Arabic to Chinese. CLDR builds on Unicode by providing locale-specific data: how to format dates, numbers, and currencies; plural rules for grammatical correctness; collation rules for sorting; translations of language, region, and currency names; and time zone databases. Every modern operating system, browser, and programming language relies on CLDR data—when JavaScript's Intl.DateTimeFormat formats a date as "12. Februar 2026" in German, it's using CLDR data. Understanding CLDR empowers developers to leverage battle-tested locale data instead of reinventing solutions, ensures consistency across platforms by using the same authoritative source, and enables contributions back to CLDR when locale data needs correction. This deep dive explores CLDR structure, locale identifiers, data usage patterns, and how frameworks integrate CLDR into their internationalization APIs.
Unicode Character Encoding
Unicode assigns a unique code point to every character across all writing systems:
JavaScript1// Unicode code points 2'A'.codePointAt(0).toString(16); // "41" 3'é'.codePointAt(0).toString(16); // "e9" 4'中'.codePointAt(0).toString(16); // "4e2d" 5'😀'.codePointAt(0).toString(16); // "1f600" 6 7// String.fromCodePoint creates characters from code points 8String.fromCodePoint(0x41); // "A" 9String.fromCodePoint(0xe9); // "é" 10String.fromCodePoint(0x4e2d); // "中" 11String.fromCodePoint(0x1f600); // "😀"
UTF-8, UTF-16, UTF-32
Unicode can be encoded in different formats:
| Encoding | Characteristics | Use Case |
|---|---|---|
| UTF-8 | Variable width (1-4 bytes), ASCII-compatible | Web, files, APIs (most common) |
| UTF-16 | Variable width (2 or 4 bytes) | JavaScript, Java, Windows |
| UTF-32 | Fixed width (4 bytes) | Internal processing |
JavaScript1// JavaScript strings are UTF-16 2const text = "Hello 你好 😀"; 3 4// Length counts UTF-16 code units, not characters 5text.length; // 10 (not 9!) 6 7// Emoji takes 2 UTF-16 code units (surrogate pair) 8'😀'.length; // 2 9 10// Proper character counting 11[...text].length; // 9 (correct!) 12Array.from(text).length; // 9 (correct!)
Unicode Normalization
The same visual character can have multiple Unicode representations:
JavaScript1// Composed form (single code point) 2const composed = 'é'; // U+00E9 3 4// Decomposed form (base + combining diacritic) 5const decomposed = 'é'; // U+0065 U+0301 6 7// They look the same but aren't equal 8composed === decomposed; // false 9composed.length; // 1 10decomposed.length; // 2 11 12// Normalize for comparison 13composed.normalize('NFC') === decomposed.normalize('NFC'); // true 14 15// Normalization forms 16text.normalize('NFC'); // Canonical Composition (most common) 17text.normalize('NFD'); // Canonical Decomposition 18text.normalize('NFKC'); // Compatibility Composition 19text.normalize('NFKD'); // Compatibility Decomposition
CLDR Data Structure
CLDR data is organized hierarchically by locale:
common/
main/
en/ # English
numbers.json # Number formatting
currencies.json # Currency data
dateFields.json # Date field names
ca-gregorian.json # Calendar data
de/ # German
numbers.json
currencies.json
...
ar/ # Arabic
numbers.json
currencies.json
...
supplemental/
plurals.json # Plural rules
ordinals.json # Ordinal rules
likelySubtags.json # Locale resolution
numberingSystems.json # Numbering systems
Example: CLDR Number Data
JSON1{ 2 "main": { 3 "de": { 4 "numbers": { 5 "symbols-numberSystem-latn": { 6 "decimal": ",", 7 "group": ".", 8 "percentSign": "%", 9 "plusSign": "+", 10 "minusSign": "-" 11 }, 12 "decimalFormats-numberSystem-latn": { 13 "standard": "#,##0.###" 14 }, 15 "currencyFormats-numberSystem-latn": { 16 "standard": "#,##0.00 ¤" 17 } 18 } 19 } 20 } 21}
This data tells us that in German:
- Decimal separator:
,(comma) - Thousands separator:
.(period) - Currency format:
1.234,56 €(amount before symbol)
BCP 47 Language Tags
BCP 47 defines the standard format for locale identifiers:
language[-script][-region][-variant][-extension]
Components
| Component | Examples | Description |
|---|---|---|
| Language | en, de, zh | ISO 639 language code |
| Script | Hans, Hant, Cyrl | ISO 15924 script code |
| Region | US, GB, CN | ISO 3166 country code |
| Variant | valencia, posix | Registered variant |
| Extension | u-ca-buddhist | Unicode extension |
Examples
JavaScript1// Simple language 2'en' // English 3 4// Language + region 5'en-US' // English (United States) 6'en-GB' // English (United Kingdom) 7'pt-BR' // Portuguese (Brazil) 8'pt-PT' // Portuguese (Portugal) 9 10// Language + script + region 11'zh-Hans-CN' // Chinese, Simplified script, China 12'zh-Hant-TW' // Chinese, Traditional script, Taiwan 13'sr-Cyrl-RS' // Serbian, Cyrillic script, Serbia 14'sr-Latn-RS' // Serbian, Latin script, Serbia 15 16// With Unicode extension 17'de-DE-u-ca-gregory' // German with Gregorian calendar 18'ar-SA-u-nu-arab' // Arabic with Arabic-Indic numerals 19'th-TH-u-ca-buddhist-nu-thai' // Thai with Buddhist calendar and Thai numerals
Parsing Language Tags
JavaScript1const locale = new Intl.Locale('zh-Hans-CN-u-ca-chinese'); 2 3locale.language; // "zh" 4locale.script; // "Hans" 5locale.region; // "CN" 6locale.calendar; // "chinese" 7 8// Get resolved options 9locale.toString(); // "zh-Hans-CN-u-ca-chinese" 10locale.baseName; // "zh-Hans-CN"
Likely Subtags
CLDR includes "likely subtags" data for resolving minimal locale identifiers:
JavaScript1// Input: "en" (just language) 2// CLDR adds likely script and region: "en-Latn-US" 3 4// Input: "zh" 5// CLDR adds: "zh-Hans-CN" (Simplified Chinese in China) 6 7// Input: "zh-TW" 8// CLDR adds: "zh-Hant-TW" (Traditional Chinese in Taiwan) 9 10const locale = new Intl.Locale('zh-TW'); 11locale.maximize(); // Adds likely subtags 12// Returns: Locale { language: "zh", script: "Hant", region: "TW" } 13 14locale.minimize(); // Removes redundant subtags 15// Returns: Locale { language: "zh", region: "TW" }
Practical Use Case
JavaScript1function getCanonicalLocale(userInput) { 2 const locale = new Intl.Locale(userInput); 3 4 // Maximize to get full form 5 const maximized = locale.maximize(); 6 7 return maximized.toString(); 8} 9 10getCanonicalLocale('en'); // "en-Latn-US" 11getCanonicalLocale('zh'); // "zh-Hans-CN" 12getCanonicalLocale('zh-TW'); // "zh-Hant-TW" 13getCanonicalLocale('ar'); // "ar-Arab-EG"
Locale Negotiation
Matching user preferences with available locales:
JavaScript1function negotiateLocale(requested, available) { 2 // Try exact match 3 if (available.includes(requested)) { 4 return requested; 5 } 6 7 // Try without region 8 const lang = requested.split('-')[0]; 9 const langMatch = available.find(loc => loc.startsWith(lang)); 10 if (langMatch) { 11 return langMatch; 12 } 13 14 // Fall back to default 15 return available[0]; 16} 17 18// Examples 19const available = ['en-US', 'es-ES', 'fr-FR', 'de-DE']; 20 21negotiateLocale('en-GB', available); // "en-US" (same language) 22negotiateLocale('es-MX', available); // "es-ES" (same language) 23negotiateLocale('pt-BR', available); // "en-US" (fallback)
Browser API
JavaScript1// Get user's preferred locales 2navigator.languages; 3// ["en-US", "en", "es"] 4 5// Negotiate best match 6const supported = ['en', 'es', 'fr', 'de']; 7const preferred = navigator.languages; 8 9const match = preferred.find(locale => 10 supported.some(sup => locale.startsWith(sup)) 11); 12 13console.log(match); // "en-US" or "en"
Intl.DisplayNames for Negotiation
JavaScript1const displayNames = new Intl.DisplayNames(['en'], { type: 'language' }); 2 3displayNames.of('en-US'); // "American English" 4displayNames.of('es-ES'); // "European Spanish" 5displayNames.of('zh-CN'); // "Simplified Chinese"
CLDR JSON Usage
Modern CLDR is available as JSON for easy consumption:
Terminalnpm install cldr-data
Loading CLDR Data
JavaScript1const cldrData = require('cldr-data'); 2 3// Load specific locale data 4const dePlurals = cldrData('supplemental/plurals'); 5const deNumbers = cldrData('main/de/numbers'); 6const deCurrencies = cldrData('main/de/currencies'); 7 8console.log(dePlurals); 9// { 10// supplemental: { 11// "plurals-type-cardinal": { 12// de: { 13// "pluralRule-count-one": "i = 1 and v = 0 @integer 1", 14// "pluralRule-count-other": "@integer 0, 2~16, 100, ..." 15// } 16// } 17// } 18// }
Using CLDR with Globalize
JavaScript1import Globalize from 'globalize'; 2import cldrData from 'cldr-data'; 3 4// Load CLDR data 5Globalize.load( 6 cldrData('supplemental/likelySubtags'), 7 cldrData('supplemental/plurals'), 8 cldrData('main/de/numbers'), 9 cldrData('main/de/currencies') 10); 11 12// Use with German locale 13const de = Globalize('de'); 14 15de.formatNumber(1234567.89); 16// "1.234.567,89" 17 18de.formatCurrency(1234.56, 'EUR'); 19// "1.234,56 €"
ICU Library Data Sourcing
ICU (International Components for Unicode) uses CLDR as its data source:
ICU4J (Java)
JAVA1import com.ibm.icu.text.NumberFormat; 2import com.ibm.icu.util.ULocale; 3 4// Uses CLDR data internally 5NumberFormat nf = NumberFormat.getInstance(new ULocale("de_DE")); 6System.out.println(nf.format(1234567.89)); 7// "1.234.567,89"
ICU4C (C/C++)
CPP1#include <unicode/numfmt.h> 2 3UErrorCode status = U_ZERO_ERROR; 4NumberFormat *nf = NumberFormat::createInstance(Locale("de_DE"), status); 5 6UnicodeString result; 7nf->format(1234567.89, result); 8// "1.234.567,89"
Node.js Intl (V8)
Node.js uses ICU, which uses CLDR:
JavaScript1// V8's Intl implementation uses ICU/CLDR 2new Intl.NumberFormat('de-DE').format(1234567.89); 3// "1.234.567,89" 4 5// Full ICU included since Node v13 6process.versions.icu; 7// "72.1" (or newer)
How Frameworks Use CLDR
React (via FormatJS)
JavaScript1import { IntlProvider, FormattedNumber } from 'react-intl'; 2 3// FormatJS uses CLDR data via Intl API 4<IntlProvider locale="de"> 5 <FormattedNumber value={1234567.89} /> 6</IntlProvider> 7// Renders: "1.234.567,89"
Angular (built-in i18n)
TypeScript1import { registerLocaleData } from '@angular/common'; 2import localeDe from '@angular/common/locales/de'; 3 4// Angular's locale data is extracted from CLDR 5registerLocaleData(localeDe); 6 7// Use in templates 8{{ 1234567.89 | number:'1.2-2':'de' }} 9// "1.234.567,89"
Vue (via vue-i18n)
JavaScript1import { createI18n } from 'vue-i18n'; 2 3const i18n = createI18n({ 4 locale: 'de', 5 numberFormats: { 6 de: { 7 currency: { 8 style: 'currency', 9 currency: 'EUR' 10 } 11 } 12 } 13}); 14 15// Uses Intl API (which uses CLDR) 16{{ $n(1234.56, 'currency') }} 17// "1.234,56 €"
CLDR Data Updates
CLDR releases twice yearly (April and October):
Updating Dependencies
Terminal1# Update CLDR data package 2npm update cldr-data 3 4# Update ICU data (Node.js) 5# Rebuild Node with latest ICU 6nvm install node --latest-npm --with-intl=full-icu 7 8# Update browser (automatic via browser updates) 9# Chrome, Firefox, Safari update CLDR with browser releases
Version Compatibility
| CLDR Version | Release Date | Used By |
|---|---|---|
| CLDR 44 | April 2024 | Node 20+, Chrome 120+ |
| CLDR 43 | October 2023 | Node 18+, Chrome 115+ |
| CLDR 42 | April 2023 | Node 16+, Chrome 110+ |
Contributing to CLDR
CLDR accepts contributions for locale data improvements:
Types of Contributions
- Translations: Language and region names
- Number/Date Formats: Locale-specific formatting rules
- Plural Rules: Grammatical plural categories
- Collation: Sorting rules for languages
- Time Zones: Timezone names and translations
Contribution Process
- Survey Tool: Use CLDR Survey Tool during data collection periods
- Tickets: File tickets at unicode.org/cldr for data errors
- Voting: Participate in locale data voting (requires Unicode membership for some locales)
https://cldr.unicode.org/
https://st.unicode.org/cldr-apps/
Example: Reporting Incorrect Data
MARKDOWN1**Issue**: German currency format shows symbol after amount, but should be before 2 3**Current**: 1.234,56 € 4**Expected**: € 1.234,56 5 6**Locale**: de-DE 7**Data File**: main/de/currencies.json 8**Field**: currencyFormats-numberSystem-latn
CLDR Tools and Resources
Official Resources
- CLDR Data: https://github.com/unicode-org/cldr-json
- Documentation: https://cldr.unicode.org/
- Specification: http://unicode.org/reports/tr35/
NPM Packages
Terminal1npm install cldr-data # Raw CLDR JSON data 2npm install cldr-core # Core CLDR data 3npm install cldr-dates-full # Date/time data 4npm install cldr-numbers-full # Number formatting data 5npm install cldr-localenames-full # Locale display names
CLI Tools
Terminal1# Install CLDR command-line tools 2npm install -g cldr-data-downloader 3 4# Download specific locale data 5cldr-data-downloader -l de,fr,es -d cldr-data
FAQ
Q: Do I need to bundle CLDR data with my app?
A: No. Modern browsers and Node.js include CLDR data via the Intl API. Just use Intl.DateTimeFormat, Intl.NumberFormat, etc.
Q: How do I add a locale not in CLDR? A: You can't easily. CLDR requires linguistic expertise and community consensus. For minor dialects, use the closest standard locale as a base.
Q: Why does my app show different formatting than expected? A: Check your browser/Node.js version. Older versions have outdated CLDR data. Update to get the latest locale data.
Q: Can I customize CLDR data for my app? A: Yes, but it's complex. Libraries like Globalize allow custom data, but you lose automatic updates. Better to contribute corrections to CLDR.
Q: How do I know which CLDR version my platform uses?
A: Check Intl polyfill versions, Node.js ICU version (process.versions.icu), or browser release notes.
Q: Should I use CLDR directly or via Intl API? A: Use the Intl API. It's a standard, well-supported interface to CLDR data. Direct CLDR usage is complex and unnecessary for most apps.
Q: How does IntlPull use CLDR? A: IntlPull uses CLDR for locale validation, plural rule enforcement, format preview generation, and ensuring translations match locale-specific conventions.
