IntlPull
Technical
13 min read

Unicode & CLDR: The Developer's Deep Dive into Internationalization Data

Deep dive into Unicode CLDR for developers. Learn locale data structure, BCP 47 language tags, locale negotiation, and ICU library integration.

IntlPull Team
IntlPull Team
Feb 12, 2026
On this page
Summary

Deep dive into Unicode CLDR for developers. Learn locale data structure, BCP 47 language tags, locale negotiation, and ICU library integration.

Understanding Unicode and CLDR

The Unicode Consortium and its Common Locale Data Repository (CLDR) form the foundation of modern internationalization, providing standardized data for displaying, formatting, and sorting text in every language and region worldwide. Unicode defines character encoding—the universal mapping between characters and numbers that enables computers to represent and process text from all writing systems, from Latin to Arabic to Chinese. CLDR builds on Unicode by providing locale-specific data: how to format dates, numbers, and currencies; plural rules for grammatical correctness; collation rules for sorting; translations of language, region, and currency names; and time zone databases. Every modern operating system, browser, and programming language relies on CLDR data—when JavaScript's Intl.DateTimeFormat formats a date as "12. Februar 2026" in German, it's using CLDR data. Understanding CLDR empowers developers to leverage battle-tested locale data instead of reinventing solutions, ensures consistency across platforms by using the same authoritative source, and enables contributions back to CLDR when locale data needs correction. This deep dive explores CLDR structure, locale identifiers, data usage patterns, and how frameworks integrate CLDR into their internationalization APIs.

Unicode Character Encoding

Unicode assigns a unique code point to every character across all writing systems:

JavaScript
1// Unicode code points
2'A'.codePointAt(0).toString(16);  // "41"
3'é'.codePointAt(0).toString(16);  // "e9"
4'中'.codePointAt(0).toString(16);  // "4e2d"
5'😀'.codePointAt(0).toString(16); // "1f600"
6
7// String.fromCodePoint creates characters from code points
8String.fromCodePoint(0x41);      // "A"
9String.fromCodePoint(0xe9);      // "é"
10String.fromCodePoint(0x4e2d);    // "中"
11String.fromCodePoint(0x1f600);   // "😀"

UTF-8, UTF-16, UTF-32

Unicode can be encoded in different formats:

EncodingCharacteristicsUse Case
UTF-8Variable width (1-4 bytes), ASCII-compatibleWeb, files, APIs (most common)
UTF-16Variable width (2 or 4 bytes)JavaScript, Java, Windows
UTF-32Fixed width (4 bytes)Internal processing
JavaScript
1// JavaScript strings are UTF-16
2const text = "Hello 你好 😀";
3
4// Length counts UTF-16 code units, not characters
5text.length;  // 10 (not 9!)
6
7// Emoji takes 2 UTF-16 code units (surrogate pair)
8'😀'.length;  // 2
9
10// Proper character counting
11[...text].length;  // 9 (correct!)
12Array.from(text).length;  // 9 (correct!)

Unicode Normalization

The same visual character can have multiple Unicode representations:

JavaScript
1// Composed form (single code point)
2const composed = 'é';  // U+00E9
3
4// Decomposed form (base + combining diacritic)
5const decomposed = 'é';  // U+0065 U+0301
6
7// They look the same but aren't equal
8composed === decomposed;  // false
9composed.length;          // 1
10decomposed.length;        // 2
11
12// Normalize for comparison
13composed.normalize('NFC') === decomposed.normalize('NFC');  // true
14
15// Normalization forms
16text.normalize('NFC');   // Canonical Composition (most common)
17text.normalize('NFD');   // Canonical Decomposition
18text.normalize('NFKC');  // Compatibility Composition
19text.normalize('NFKD');  // Compatibility Decomposition

CLDR Data Structure

CLDR data is organized hierarchically by locale:

common/
  main/
    en/                   # English
      numbers.json        # Number formatting
      currencies.json     # Currency data
      dateFields.json     # Date field names
      ca-gregorian.json   # Calendar data
    de/                   # German
      numbers.json
      currencies.json
      ...
    ar/                   # Arabic
      numbers.json
      currencies.json
      ...
  supplemental/
    plurals.json          # Plural rules
    ordinals.json         # Ordinal rules
    likelySubtags.json    # Locale resolution
    numberingSystems.json # Numbering systems

Example: CLDR Number Data

JSON
1{
2  "main": {
3    "de": {
4      "numbers": {
5        "symbols-numberSystem-latn": {
6          "decimal": ",",
7          "group": ".",
8          "percentSign": "%",
9          "plusSign": "+",
10          "minusSign": "-"
11        },
12        "decimalFormats-numberSystem-latn": {
13          "standard": "#,##0.###"
14        },
15        "currencyFormats-numberSystem-latn": {
16          "standard": "#,##0.00 ¤"
17        }
18      }
19    }
20  }
21}

This data tells us that in German:

  • Decimal separator: , (comma)
  • Thousands separator: . (period)
  • Currency format: 1.234,56 € (amount before symbol)

BCP 47 Language Tags

BCP 47 defines the standard format for locale identifiers:

language[-script][-region][-variant][-extension]

Components

ComponentExamplesDescription
Languageen, de, zhISO 639 language code
ScriptHans, Hant, CyrlISO 15924 script code
RegionUS, GB, CNISO 3166 country code
Variantvalencia, posixRegistered variant
Extensionu-ca-buddhistUnicode extension

Examples

JavaScript
1// Simple language
2'en'           // English
3
4// Language + region
5'en-US'        // English (United States)
6'en-GB'        // English (United Kingdom)
7'pt-BR'        // Portuguese (Brazil)
8'pt-PT'        // Portuguese (Portugal)
9
10// Language + script + region
11'zh-Hans-CN'   // Chinese, Simplified script, China
12'zh-Hant-TW'   // Chinese, Traditional script, Taiwan
13'sr-Cyrl-RS'   // Serbian, Cyrillic script, Serbia
14'sr-Latn-RS'   // Serbian, Latin script, Serbia
15
16// With Unicode extension
17'de-DE-u-ca-gregory'         // German with Gregorian calendar
18'ar-SA-u-nu-arab'            // Arabic with Arabic-Indic numerals
19'th-TH-u-ca-buddhist-nu-thai' // Thai with Buddhist calendar and Thai numerals

Parsing Language Tags

JavaScript
1const locale = new Intl.Locale('zh-Hans-CN-u-ca-chinese');
2
3locale.language;        // "zh"
4locale.script;          // "Hans"
5locale.region;          // "CN"
6locale.calendar;        // "chinese"
7
8// Get resolved options
9locale.toString();      // "zh-Hans-CN-u-ca-chinese"
10locale.baseName;        // "zh-Hans-CN"

Likely Subtags

CLDR includes "likely subtags" data for resolving minimal locale identifiers:

JavaScript
1// Input: "en" (just language)
2// CLDR adds likely script and region: "en-Latn-US"
3
4// Input: "zh"
5// CLDR adds: "zh-Hans-CN" (Simplified Chinese in China)
6
7// Input: "zh-TW"
8// CLDR adds: "zh-Hant-TW" (Traditional Chinese in Taiwan)
9
10const locale = new Intl.Locale('zh-TW');
11locale.maximize();  // Adds likely subtags
12// Returns: Locale { language: "zh", script: "Hant", region: "TW" }
13
14locale.minimize();  // Removes redundant subtags
15// Returns: Locale { language: "zh", region: "TW" }

Practical Use Case

JavaScript
1function getCanonicalLocale(userInput) {
2  const locale = new Intl.Locale(userInput);
3
4  // Maximize to get full form
5  const maximized = locale.maximize();
6
7  return maximized.toString();
8}
9
10getCanonicalLocale('en');      // "en-Latn-US"
11getCanonicalLocale('zh');      // "zh-Hans-CN"
12getCanonicalLocale('zh-TW');   // "zh-Hant-TW"
13getCanonicalLocale('ar');      // "ar-Arab-EG"

Locale Negotiation

Matching user preferences with available locales:

JavaScript
1function negotiateLocale(requested, available) {
2  // Try exact match
3  if (available.includes(requested)) {
4    return requested;
5  }
6
7  // Try without region
8  const lang = requested.split('-')[0];
9  const langMatch = available.find(loc => loc.startsWith(lang));
10  if (langMatch) {
11    return langMatch;
12  }
13
14  // Fall back to default
15  return available[0];
16}
17
18// Examples
19const available = ['en-US', 'es-ES', 'fr-FR', 'de-DE'];
20
21negotiateLocale('en-GB', available);  // "en-US" (same language)
22negotiateLocale('es-MX', available);  // "es-ES" (same language)
23negotiateLocale('pt-BR', available);  // "en-US" (fallback)

Browser API

JavaScript
1// Get user's preferred locales
2navigator.languages;
3// ["en-US", "en", "es"]
4
5// Negotiate best match
6const supported = ['en', 'es', 'fr', 'de'];
7const preferred = navigator.languages;
8
9const match = preferred.find(locale =>
10  supported.some(sup => locale.startsWith(sup))
11);
12
13console.log(match);  // "en-US" or "en"

Intl.DisplayNames for Negotiation

JavaScript
1const displayNames = new Intl.DisplayNames(['en'], { type: 'language' });
2
3displayNames.of('en-US');  // "American English"
4displayNames.of('es-ES');  // "European Spanish"
5displayNames.of('zh-CN');  // "Simplified Chinese"

CLDR JSON Usage

Modern CLDR is available as JSON for easy consumption:

Terminal
npm install cldr-data

Loading CLDR Data

JavaScript
1const cldrData = require('cldr-data');
2
3// Load specific locale data
4const dePlurals = cldrData('supplemental/plurals');
5const deNumbers = cldrData('main/de/numbers');
6const deCurrencies = cldrData('main/de/currencies');
7
8console.log(dePlurals);
9// {
10//   supplemental: {
11//     "plurals-type-cardinal": {
12//       de: {
13//         "pluralRule-count-one": "i = 1 and v = 0 @integer 1",
14//         "pluralRule-count-other": "@integer 0, 2~16, 100, ..."
15//       }
16//     }
17//   }
18// }

Using CLDR with Globalize

JavaScript
1import Globalize from 'globalize';
2import cldrData from 'cldr-data';
3
4// Load CLDR data
5Globalize.load(
6  cldrData('supplemental/likelySubtags'),
7  cldrData('supplemental/plurals'),
8  cldrData('main/de/numbers'),
9  cldrData('main/de/currencies')
10);
11
12// Use with German locale
13const de = Globalize('de');
14
15de.formatNumber(1234567.89);
16// "1.234.567,89"
17
18de.formatCurrency(1234.56, 'EUR');
19// "1.234,56 €"

ICU Library Data Sourcing

ICU (International Components for Unicode) uses CLDR as its data source:

ICU4J (Java)

JAVA
1import com.ibm.icu.text.NumberFormat;
2import com.ibm.icu.util.ULocale;
3
4// Uses CLDR data internally
5NumberFormat nf = NumberFormat.getInstance(new ULocale("de_DE"));
6System.out.println(nf.format(1234567.89));
7// "1.234.567,89"

ICU4C (C/C++)

CPP
1#include <unicode/numfmt.h>
2
3UErrorCode status = U_ZERO_ERROR;
4NumberFormat *nf = NumberFormat::createInstance(Locale("de_DE"), status);
5
6UnicodeString result;
7nf->format(1234567.89, result);
8// "1.234.567,89"

Node.js Intl (V8)

Node.js uses ICU, which uses CLDR:

JavaScript
1// V8's Intl implementation uses ICU/CLDR
2new Intl.NumberFormat('de-DE').format(1234567.89);
3// "1.234.567,89"
4
5// Full ICU included since Node v13
6process.versions.icu;
7// "72.1" (or newer)

How Frameworks Use CLDR

React (via FormatJS)

JavaScript
1import { IntlProvider, FormattedNumber } from 'react-intl';
2
3// FormatJS uses CLDR data via Intl API
4<IntlProvider locale="de">
5  <FormattedNumber value={1234567.89} />
6</IntlProvider>
7// Renders: "1.234.567,89"

Angular (built-in i18n)

TypeScript
1import { registerLocaleData } from '@angular/common';
2import localeDe from '@angular/common/locales/de';
3
4// Angular's locale data is extracted from CLDR
5registerLocaleData(localeDe);
6
7// Use in templates
8{{ 1234567.89 | number:'1.2-2':'de' }}
9// "1.234.567,89"

Vue (via vue-i18n)

JavaScript
1import { createI18n } from 'vue-i18n';
2
3const i18n = createI18n({
4  locale: 'de',
5  numberFormats: {
6    de: {
7      currency: {
8        style: 'currency',
9        currency: 'EUR'
10      }
11    }
12  }
13});
14
15// Uses Intl API (which uses CLDR)
16{{ $n(1234.56, 'currency') }}
17// "1.234,56 €"

CLDR Data Updates

CLDR releases twice yearly (April and October):

Updating Dependencies

Terminal
1# Update CLDR data package
2npm update cldr-data
3
4# Update ICU data (Node.js)
5# Rebuild Node with latest ICU
6nvm install node --latest-npm --with-intl=full-icu
7
8# Update browser (automatic via browser updates)
9# Chrome, Firefox, Safari update CLDR with browser releases

Version Compatibility

CLDR VersionRelease DateUsed By
CLDR 44April 2024Node 20+, Chrome 120+
CLDR 43October 2023Node 18+, Chrome 115+
CLDR 42April 2023Node 16+, Chrome 110+

Contributing to CLDR

CLDR accepts contributions for locale data improvements:

Types of Contributions

  1. Translations: Language and region names
  2. Number/Date Formats: Locale-specific formatting rules
  3. Plural Rules: Grammatical plural categories
  4. Collation: Sorting rules for languages
  5. Time Zones: Timezone names and translations

Contribution Process

  1. Survey Tool: Use CLDR Survey Tool during data collection periods
  2. Tickets: File tickets at unicode.org/cldr for data errors
  3. Voting: Participate in locale data voting (requires Unicode membership for some locales)
https://cldr.unicode.org/
https://st.unicode.org/cldr-apps/

Example: Reporting Incorrect Data

MARKDOWN
1**Issue**: German currency format shows symbol after amount, but should be before
2
3**Current**: 1.234,56 €
4**Expected**: € 1.234,56
5
6**Locale**: de-DE
7**Data File**: main/de/currencies.json
8**Field**: currencyFormats-numberSystem-latn

CLDR Tools and Resources

Official Resources

NPM Packages

Terminal
1npm install cldr-data              # Raw CLDR JSON data
2npm install cldr-core              # Core CLDR data
3npm install cldr-dates-full        # Date/time data
4npm install cldr-numbers-full      # Number formatting data
5npm install cldr-localenames-full  # Locale display names

CLI Tools

Terminal
1# Install CLDR command-line tools
2npm install -g cldr-data-downloader
3
4# Download specific locale data
5cldr-data-downloader -l de,fr,es -d cldr-data

FAQ

Q: Do I need to bundle CLDR data with my app? A: No. Modern browsers and Node.js include CLDR data via the Intl API. Just use Intl.DateTimeFormat, Intl.NumberFormat, etc.

Q: How do I add a locale not in CLDR? A: You can't easily. CLDR requires linguistic expertise and community consensus. For minor dialects, use the closest standard locale as a base.

Q: Why does my app show different formatting than expected? A: Check your browser/Node.js version. Older versions have outdated CLDR data. Update to get the latest locale data.

Q: Can I customize CLDR data for my app? A: Yes, but it's complex. Libraries like Globalize allow custom data, but you lose automatic updates. Better to contribute corrections to CLDR.

Q: How do I know which CLDR version my platform uses? A: Check Intl polyfill versions, Node.js ICU version (process.versions.icu), or browser release notes.

Q: Should I use CLDR directly or via Intl API? A: Use the Intl API. It's a standard, well-supported interface to CLDR data. Direct CLDR usage is complex and unnecessary for most apps.

Q: How does IntlPull use CLDR? A: IntlPull uses CLDR for locale validation, plural rule enforcement, format preview generation, and ensuring translations match locale-specific conventions.

Tags
unicode
cldr
icu
locale-data
i18n
encoding
standards
IntlPull Team
IntlPull Team
Engineering

Building tools to help teams ship products globally. Follow us for more insights on localization and i18n.