IntlPull
Guide
16 min read

Character Encoding for Developers: UTF-8, Unicode, ASCII Explained (2026)

Stop encoding bugs before they happen. Learn UTF-8, Unicode, ASCII, and how to prevent mojibake, corrupted characters, and database encoding issues.

IntlPull Team
IntlPull Team
03 Feb 2026, 11:44 AM [PST]
On this page
Summary

Stop encoding bugs before they happen. Learn UTF-8, Unicode, ASCII, and how to prevent mojibake, corrupted characters, and database encoding issues.

The Bug That Cost $100K

It was 2 AM. The CEO's email landed: "Why does our French site show gibberish?"

Instead of "Café", users saw "Café". Instead of "Résumé", they saw "Résumé".

Classic mojibake. The developer had set the database to Latin-1, the API to UTF-8, and the frontend assumed ASCII. Three different encodings, one completely broken user experience.

The fix? Five minutes. The damage? Lost sales, angry users, 2 weeks of emergency patches.

This guide prevents that. We'll cover what character encoding actually is, why it matters, and how to never screw it up again.

What Even Is Character Encoding?

Computers don't understand letters. They understand numbers.

Character encoding is the map: character → number.

Example:

  • The letter "A" needs to become a number computers can store
  • ASCII says: "A" = 65
  • Unicode says: "A" = U+0041
  • UTF-8 says: "A" = the byte 0x41

Simple, right? Except there are like 50 different encoding standards, each with different rules.

The Three You Actually Need to Know

1. ASCII (The Ancient One)

What it is: American Standard Code for Information Interchange Invented: 1963 Characters: 128 (0-127)

What it covers:

  • English letters (A-Z, a-z)
  • Numbers (0-9)
  • Basic punctuation (.,!?)
  • Control characters (newline, tab)

What it doesn't cover:

  • Accented characters (é, ñ, ü)
  • Non-Latin scripts (中文, العربية, हिन्दी)
  • Emojis (💩)
  • Basically anything useful for i18n

When to use it: Never. It's 2026. Unless you're programming a 1980s terminal.

Example:

A → 65
B → 66
Z → 90
a → 97
0 → 48

2. Unicode (The Library)

What it is: A massive catalog of every character in every language Current version: Unicode 15.1 (2023), 149,813 characters Think of it as: The phonebook, not the phone

Important: Unicode is NOT an encoding. It's a character set.

Unicode assigns each character a code point (a number):

  • "A" = U+0041
  • "é" = U+00E9
  • "中" = U+4E2D
  • "🔥" = U+1F525

But it doesn't say how to store those numbers. That's where UTF-8 comes in.

Unicode planes:

  • BMP (Basic Multilingual Plane): U+0000 to U+FFFF (most common characters)
  • SMP (Supplementary Multilingual Plane): U+10000 to U+1FFFF (emojis, rare scripts)
  • SIP, TIP, SSP: Ancient scripts, math symbols, musical notation

3. UTF-8 (The One True Encoding)

What it is: A way to encode Unicode characters as bytes Invented: 1992 by Ken Thompson and Rob Pike Market share: 98% of all websites (as of 2026)

Why it won?

  1. Backward compatible with ASCII: First 128 characters are identical
  2. Variable width: Uses 1-4 bytes depending on character
  3. Self-synchronizing: If you jump into the middle of a UTF-8 stream, you can find the next character boundary
  4. Efficient: English text is same size as ASCII, but supports all languages

How it works:

Character | Code Point | UTF-8 Bytes | Size
----------|-----------|-------------|-----
A         | U+0041    | 0x41        | 1 byte
é         | U+00E9    | 0xC3 0xA9   | 2 bytes
中        | U+4E2D    | 0xE4 0xB8 0xAD | 3 bytes
🔥        | U+1F525   | 0xF0 0x9F 0x94 0xA5 | 4 bytes

Other UTF encodings you'll see:

  • UTF-16: Uses 2 or 4 bytes. Common in Windows, Java, JavaScript internals
  • UTF-32: Always 4 bytes. Wasteful but simple
  • UTF-7: Exists, but you'll never use it

The rule: Use UTF-8 everywhere. Period.

Common Encoding Disasters

Disaster 1: Mojibake (文字化け)

Symptom: Text looks like random garbage.

Example:

Expected: "Café"
Actual:   "Café"

What happened:

  1. Text was encoded as UTF-8: Café0x43 0x61 0x66 0xC3 0xA9
  2. Reader interpreted it as Latin-1 (ISO-8859-1)
  3. Latin-1 doesn't understand multi-byte chars
  4. Each byte became a separate character: "C", "a", "f", "Ã", "©"

Fix:

JavaScript
1// Detect encoding (not 100% reliable, but helps)
2import jschardet from 'jschardet';
3
4const buffer = fs.readFileSync('file.txt');
5const detected = jschardet.detect(buffer);
6console.log(detected.encoding); // 'UTF-8', 'ISO-8859-1', etc.
7
8// Convert to UTF-8
9const iconv = require('iconv-lite');
10const utf8String = iconv.decode(buffer, detected.encoding);

Disaster 2: Database Encoding Mismatch

Symptom: Data looks fine in code, broken in database (or vice versa).

Example (MySQL):

SQL
1-- ❌ Wrong: latin1 database
2CREATE DATABASE mydb CHARACTER SET latin1;
3
4-- ✅ Right: utf8mb4 (supports emojis)
5CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

The MySQL UTF-8 trap:

  • MySQL's utf8 charset is NOT real UTF-8 (max 3 bytes)
  • Emojis need 4 bytes → utf8 can't store them
  • Always use utf8mb4 (UTF-8, max 4 bytes)

Check your database:

SQL
1SHOW VARIABLES LIKE 'character_set_%';
2SHOW VARIABLES LIKE 'collation_%';
3
4-- Should all be utf8mb4

Connection string must match:

JavaScript
1// Node.js MySQL
2const connection = mysql.createConnection({
3  host: 'localhost',
4  user: 'root',
5  database: 'mydb',
6  charset: 'utf8mb4' // ← CRITICAL
7});

Postgres:

SQL
1-- Check encoding
2SHOW SERVER_ENCODING;
3SHOW CLIENT_ENCODING;
4
5-- Set to UTF-8 (usually default)
6SET CLIENT_ENCODING TO 'UTF8';

Disaster 3: JSON Encoding Hell

Symptom: Special characters turn into \uXXXX escape sequences.

Example:

JavaScript
const data = { message: "Hello 世界" };
console.log(JSON.stringify(data));
// {"message":"Hello \u4e16\u754c"}

Why: JSON.stringify escapes non-ASCII by default.

Fix:

JavaScript
1// Don't escape Unicode
2JSON.stringify(data, null, 2);
3// Still escapes... that's actually correct!
4
5// The real issue: receiving end must parse correctly
6const parsed = JSON.parse('{"message":"Hello \u4e16\u754c"}');
7console.log(parsed.message); // "Hello 世界" ✅

Actual problem usually:

JavaScript
1// ❌ Wrong: Sending JSON as Latin-1
2res.setHeader('Content-Type', 'application/json; charset=ISO-8859-1');
3
4// ✅ Right: UTF-8
5res.setHeader('Content-Type', 'application/json; charset=UTF-8');

Disaster 4: CSV Export Corruption

Symptom: Export to CSV, open in Excel, all special characters broken.

Why: Excel defaults to your system encoding (often Windows-1252, not UTF-8).

Fix: Add BOM (Byte Order Mark)

JavaScript
1// Add UTF-8 BOM so Excel knows it's UTF-8
2const BOM = '\uFEFF';
3const csv = BOM + 'Name,City\n' +
4             'José,São Paulo\n' +
5             '李明,北京\n';
6
7fs.writeFileSync('export.csv', csv, 'utf8');

Or force UTF-8 on import: Excel → Data → From Text → File Origin: 65001 (UTF-8)

Disaster 5: URL Encoding Issues

Symptom: URLs with non-ASCII chars break.

Example:

Raw:     /search?q=café
Broken:  /search?q=caf�
Correct: /search?q=caf%C3%A9

Fix: Always encode URLs

JavaScript
1// ❌ Wrong
2const url = `/search?q=${query}`;
3
4// ✅ Right
5const url = `/search?q=${encodeURIComponent(query)}`;
6
7// Example
8encodeURIComponent('café'); // 'caf%C3%A9'
9encodeURIComponent('中文'); // '%E4%B8%AD%E6%96%87'

Decoding:

JavaScript
const query = new URLSearchParams(window.location.search).get('q');
// Automatically decoded ✅

How to Debug Encoding Issues

Step 1: Find Where Encoding Goes Wrong

Encoding issues happen at boundaries:

  • Reading files
  • Database queries
  • HTTP requests/responses
  • String concatenation from different sources

Debug script:

JavaScript
1function debugEncoding(text) {
2  console.log('Text:', text);
3  console.log('Length:', text.length);
4  console.log('Bytes:', Buffer.from(text, 'utf8'));
5  console.log('Hex:', Buffer.from(text, 'utf8').toString('hex'));
6
7  // Check each character
8  for (let i = 0; i < text.length; i++) {
9    const char = text[i];
10    const code = text.charCodeAt(i);
11    const unicode = 'U+' + code.toString(16).toUpperCase().padStart(4, '0');
12    console.log(`[${i}] ${char}${code} (${unicode})`);
13  }
14}
15
16debugEncoding('Café');
17// [0] C → 67 (U+0043)
18// [1] a → 97 (U+0061)
19// [2] f → 102 (U+0066)
20// [3] é → 233 (U+00E9)

Step 2: Inspect Byte Sequences

If you see mojibake, check the bytes:

JavaScript
1const broken = 'Café';
2console.log(Buffer.from(broken, 'utf8').toString('hex'));
3// 43 61 66 c3 83 c2 a9
4
5// Compare to correct:
6const correct = 'Café';
7console.log(Buffer.from(correct, 'utf8').toString('hex'));
8// 43 61 66 c3 a9

Notice the double-encoding: c3 83 c2 a9 vs c3 a9.

What happened:

  1. "é" = UTF-8 bytes: c3 a9
  2. Those bytes were interpreted as Latin-1 → "é"
  3. "é" was re-encoded to UTF-8 → c3 83 c2 a9

Fix: Double decode

JavaScript
const broken = 'Café';
const fixed = Buffer.from(broken, 'latin1').toString('utf8');
console.log(fixed); // 'Café' ✅

Step 3: Check Every Layer

Web app checklist:

1. HTML:

HTML
<!-- ✅ Add this to every page -->
<meta charset="UTF-8">

2. HTTP Headers:

JavaScript
// ✅ Server response
res.setHeader('Content-Type', 'text/html; charset=UTF-8');

3. Database:

SQL
1-- ✅ MySQL
2CREATE TABLE users (
3  name VARCHAR(255)
4) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
5
6-- ✅ PostgreSQL (usually default)
7CREATE DATABASE mydb ENCODING 'UTF8';

4. Database Connection:

JavaScript
1// ✅ Specify in connection string
2const pool = new Pool({
3  connectionString: 'postgres://user:pass@localhost/mydb?client_encoding=UTF8'
4});

5. File I/O:

JavaScript
// ✅ Explicitly set encoding
fs.writeFileSync('file.txt', content, 'utf8');
fs.readFileSync('file.txt', 'utf8');

6. API Requests:

JavaScript
1// ✅ Set Content-Type header
2fetch('/api/data', {
3  method: 'POST',
4  headers: {
5    'Content-Type': 'application/json; charset=UTF-8'
6  },
7  body: JSON.stringify({ text: 'Café' })
8});

Best Practices

1. UTF-8 Everywhere

Mantra: UTF-8 from storage to display.

Setup checklist:

  • ✅ Database: UTF-8 / utf8mb4
  • ✅ Database connection: charset=utf8mb4
  • ✅ Files: Save as UTF-8 (check your editor settings)
  • ✅ HTML: <meta charset="UTF-8">
  • ✅ HTTP: Content-Type: ...; charset=UTF-8
  • ✅ Code: Read/write files with encoding='utf8'

2. Validate Input

Reject invalid UTF-8 sequences:

JavaScript
1function isValidUTF8(str) {
2  try {
3    // Try encoding round-trip
4    const encoded = new TextEncoder().encode(str);
5    const decoded = new TextDecoder('utf-8', { fatal: true }).decode(encoded);
6    return decoded === str;
7  } catch (e) {
8    return false;
9  }
10}
11
12// Usage in API
13app.post('/api/comment', (req, res) => {
14  const { text } = req.body;
15
16  if (!isValidUTF8(text)) {
17    return res.status(400).json({ error: 'Invalid UTF-8 encoding' });
18  }
19
20  // Continue...
21});

3. Normalize Unicode

Problem: Multiple ways to encode the same character.

Example:

JavaScript
1// "é" can be:
2const composed = 'é';    // Single code point U+00E9
3const decomposed = 'é';  // e (U+0065) + ´ (U+0301)
4
5console.log(composed === decomposed); // false 😱
6console.log(composed.length); // 1
7console.log(decomposed.length); // 2

Fix: Normalize before comparing

JavaScript
1const a = 'café'.normalize('NFC');
2const b = 'café'.normalize('NFC');
3console.log(a === b); // true ✅
4
5// Forms:
6// NFC (Canonical Composition) - use this for display
7// NFD (Canonical Decomposition) - use for searching
8// NFKC/NFKD (Compatibility) - use for normalization

Normalize in database queries:

JavaScript
1// Search ignoring normalization
2const searchTerm = userInput.normalize('NFD');
3const results = await db.query(
4  'SELECT * FROM products WHERE LOWER(name) LIKE LOWER($1)',
5  [`%${searchTerm}%`]
6);

4. Length Limits

Be careful with character limits:

JavaScript
1// ❌ Wrong: Byte length != character length
2const text = '中文测试';
3console.log(text.length); // 4 characters
4console.log(Buffer.from(text, 'utf8').length); // 12 bytes
5
6// Database varchar(10) in bytes = only 3 Chinese chars!

Fix: Count characters, not bytes

JavaScript
1function truncate(str, maxChars) {
2  if (str.length <= maxChars) return str;
3  return str.slice(0, maxChars) + '...';
4}
5
6// Or use Array.from to handle emojis correctly
7function truncateEmoji(str, maxChars) {
8  const chars = Array.from(str);
9  if (chars.length <= maxChars) return str;
10  return chars.slice(0, maxChars).join('') + '...';
11}
12
13truncateEmoji('Hello 👨‍👩‍👧‍👦 World', 10);
14// "Hello 👨‍👩‍👧‍👦 Wo..."

5. Handle Emojis Correctly

Problem: Emojis are complex.

JavaScript
1const emoji = '👨‍👩‍👧‍👦'; // Family emoji
2console.log(emoji.length); // 11 😱
3
4// Why? It's multiple code points joined with Zero-Width Joiners

Fix: Use proper Unicode segmentation

JavaScript
1// Split by grapheme clusters
2const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
3const segments = Array.from(segmenter.segment('Hello 👨‍👩‍👧‍👦 World'));
4console.log(segments.map(s => s.segment));
5// ['H', 'e', 'l', 'l', 'o', ' ', '👨‍👩‍👧‍👦', ' ', 'W', 'o', 'r', 'l', 'd']
6
7// Count "characters" correctly
8const charCount = segments.length; // 13 ✅

Testing for Encoding Issues

Test Data

Use these strings to test encoding:

JavaScript
1const testStrings = [
2  'Hello World',                    // ASCII baseline
3  'Café résumé naïve',              // Latin-1 extensions
4  'Привет мир',                     // Cyrillic
5  '你好世界',                        // Chinese
6  'مرحبا بالعالم',                  // Arabic (RTL)
7  '🔥💯👍',                          // Emojis
8  '👨‍👩‍👧‍👦',                            // Complex emoji (ZWJ sequence)
9  '\u0000\u0001\u001F',          // Control characters
10  'test\r\nline\rbreaks\n',     // Line breaks
11];
12
13testStrings.forEach(str => {
14  // Send through your system
15  const result = yourFunction(str);
16  assert(result === str, 'Encoding corruption detected');
17});

Automated Testing

JavaScript
1// Encoding round-trip test
2describe('Encoding', () => {
3  it('should preserve UTF-8 through database', async () => {
4    const testString = 'Café 中文 🔥';
5
6    await db.insert({ text: testString });
7    const result = await db.query('SELECT text FROM table');
8
9    expect(result[0].text).toBe(testString);
10  });
11
12  it('should handle API round-trip', async () => {
13    const testString = 'Résumé 日本語';
14
15    const response = await fetch('/api/echo', {
16      method: 'POST',
17      headers: { 'Content-Type': 'application/json; charset=UTF-8' },
18      body: JSON.stringify({ text: testString })
19    });
20
21    const data = await response.json();
22    expect(data.text).toBe(testString);
23  });
24});

Platform-Specific Issues

Windows

Problem: Windows uses different encodings by default.

  • Command Prompt: Code page 437 or Windows-1252
  • PowerShell: UTF-16 LE

Fix:

POWERSHELL
# Set PowerShell to UTF-8
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8

macOS/Linux

Problem: Usually UTF-8 by default, but check:

Terminal
1locale
2# Should show UTF-8
3
4# If not:
5export LC_ALL=en_US.UTF-8
6export LANG=en_US.UTF-8

Python

Python 3:

Python
# ✅ Default is UTF-8 (usually)
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

Python 2 (legacy):

Python
1# ❌ Default is ASCII (nightmare)
2# Always specify encoding
3import codecs
4with codecs.open('file.txt', 'r', 'utf-8') as f:
5    content = f.read()

Java

Problem: Java uses UTF-16 internally.

JAVA
1// ✅ Read UTF-8 files
2BufferedReader reader = new BufferedReader(
3    new InputStreamReader(
4        new FileInputStream("file.txt"),
5        StandardCharsets.UTF_8
6    )
7);
8
9// ✅ Write UTF-8 files
10BufferedWriter writer = new BufferedWriter(
11    new OutputStreamWriter(
12        new FileOutputStream("file.txt"),
13        StandardCharsets.UTF_8
14    )
15);

PHP

PHP
1<?php
2// ✅ Set default encoding
3ini_set('default_charset', 'UTF-8');
4mb_internal_encoding('UTF-8');
5
6// ✅ Database connection
7$pdo = new PDO(
8    'mysql:host=localhost;dbname=mydb;charset=utf8mb4',
9    'user',
10    'password'
11);

IntlPull's Encoding Validation

When you push translations to IntlPull, we automatically:

  • ✅ Validate UTF-8 encoding
  • ✅ Check for invalid byte sequences
  • ✅ Normalize Unicode (NFC form)
  • ✅ Detect encoding mismatches
  • ✅ Flag potential mojibake
Terminal
1npx @intlpullhq/cli upload
2
3# Output:
4# ✅ All strings valid UTF-8
5# ⚠️ Warning: String "Café" looks like double-encoded UTF-8
6# 💡 Suggestion: Check database encoding

This catches encoding issues before they reach production.

The TL;DR

Rules to live by:

  1. Use UTF-8 everywhere. No exceptions.
  2. Set encoding explicitly at every boundary (files, DB, HTTP).
  3. Test with non-ASCII strings (Chinese, Arabic, emojis).
  4. Normalize before comparing (.normalize('NFC')).
  5. Count characters correctly (use Array.from() for emojis).

Common mistakes:

  • MySQL utf8 (use utf8mb4)
  • Forgetting charset=UTF-8 in HTTP headers
  • Comparing without normalization
  • String length for character limits
  • Opening Windows files without encoding specified

When you see mojibake:

  1. Check database encoding
  2. Check connection encoding
  3. Check HTTP headers
  4. Try double-decode (latin1 → utf8)

Need help managing multilingual content?

Try IntlPull. Automatically validates encoding, detects mojibake, and normalizes Unicode in your translations. Free tier available.

Or just remember: UTF-8 everywhere. You'll be fine.

Tags
character-encoding
utf-8
unicode
i18n
technical
debugging
IntlPull Team
IntlPull Team
Engineering

Building tools to help teams ship products globally. Follow us for more insights on localization and i18n.