The Bug That Cost $100K
It was 2 AM. The CEO's email landed: "Why does our French site show gibberish?"
Instead of "Café", users saw "Café". Instead of "Résumé", they saw "Résumé".
Classic mojibake. The developer had set the database to Latin-1, the API to UTF-8, and the frontend assumed ASCII. Three different encodings, one completely broken user experience.
The fix? Five minutes. The damage? Lost sales, angry users, 2 weeks of emergency patches.
This guide prevents that. We'll cover what character encoding actually is, why it matters, and how to never screw it up again.
What Even Is Character Encoding?
Computers don't understand letters. They understand numbers.
Character encoding is the map: character → number.
Example:
- The letter "A" needs to become a number computers can store
- ASCII says: "A" = 65
- Unicode says: "A" = U+0041
- UTF-8 says: "A" = the byte
0x41
Simple, right? Except there are like 50 different encoding standards, each with different rules.
The Three You Actually Need to Know
1. ASCII (The Ancient One)
What it is: American Standard Code for Information Interchange Invented: 1963 Characters: 128 (0-127)
What it covers:
- English letters (A-Z, a-z)
- Numbers (0-9)
- Basic punctuation (.,!?)
- Control characters (newline, tab)
What it doesn't cover:
- Accented characters (é, ñ, ü)
- Non-Latin scripts (中文, العربية, हिन्दी)
- Emojis (💩)
- Basically anything useful for i18n
When to use it: Never. It's 2026. Unless you're programming a 1980s terminal.
Example:
A → 65
B → 66
Z → 90
a → 97
0 → 48
2. Unicode (The Library)
What it is: A massive catalog of every character in every language Current version: Unicode 15.1 (2023), 149,813 characters Think of it as: The phonebook, not the phone
Important: Unicode is NOT an encoding. It's a character set.
Unicode assigns each character a code point (a number):
- "A" = U+0041
- "é" = U+00E9
- "中" = U+4E2D
- "🔥" = U+1F525
But it doesn't say how to store those numbers. That's where UTF-8 comes in.
Unicode planes:
- BMP (Basic Multilingual Plane): U+0000 to U+FFFF (most common characters)
- SMP (Supplementary Multilingual Plane): U+10000 to U+1FFFF (emojis, rare scripts)
- SIP, TIP, SSP: Ancient scripts, math symbols, musical notation
3. UTF-8 (The One True Encoding)
What it is: A way to encode Unicode characters as bytes Invented: 1992 by Ken Thompson and Rob Pike Market share: 98% of all websites (as of 2026)
Why it won?
- Backward compatible with ASCII: First 128 characters are identical
- Variable width: Uses 1-4 bytes depending on character
- Self-synchronizing: If you jump into the middle of a UTF-8 stream, you can find the next character boundary
- Efficient: English text is same size as ASCII, but supports all languages
How it works:
Character | Code Point | UTF-8 Bytes | Size
----------|-----------|-------------|-----
A | U+0041 | 0x41 | 1 byte
é | U+00E9 | 0xC3 0xA9 | 2 bytes
中 | U+4E2D | 0xE4 0xB8 0xAD | 3 bytes
🔥 | U+1F525 | 0xF0 0x9F 0x94 0xA5 | 4 bytes
Other UTF encodings you'll see:
- UTF-16: Uses 2 or 4 bytes. Common in Windows, Java, JavaScript internals
- UTF-32: Always 4 bytes. Wasteful but simple
- UTF-7: Exists, but you'll never use it
The rule: Use UTF-8 everywhere. Period.
Common Encoding Disasters
Disaster 1: Mojibake (文字化け)
Symptom: Text looks like random garbage.
Example:
Expected: "Café"
Actual: "Café"
What happened:
- Text was encoded as UTF-8:
Café→0x43 0x61 0x66 0xC3 0xA9 - Reader interpreted it as Latin-1 (ISO-8859-1)
- Latin-1 doesn't understand multi-byte chars
- Each byte became a separate character: "C", "a", "f", "Ã", "©"
Fix:
JavaScript1// Detect encoding (not 100% reliable, but helps) 2import jschardet from 'jschardet'; 3 4const buffer = fs.readFileSync('file.txt'); 5const detected = jschardet.detect(buffer); 6console.log(detected.encoding); // 'UTF-8', 'ISO-8859-1', etc. 7 8// Convert to UTF-8 9const iconv = require('iconv-lite'); 10const utf8String = iconv.decode(buffer, detected.encoding);
Disaster 2: Database Encoding Mismatch
Symptom: Data looks fine in code, broken in database (or vice versa).
Example (MySQL):
SQL1-- ❌ Wrong: latin1 database 2CREATE DATABASE mydb CHARACTER SET latin1; 3 4-- ✅ Right: utf8mb4 (supports emojis) 5CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
The MySQL UTF-8 trap:
- MySQL's
utf8charset is NOT real UTF-8 (max 3 bytes) - Emojis need 4 bytes →
utf8can't store them - Always use
utf8mb4(UTF-8, max 4 bytes)
Check your database:
SQL1SHOW VARIABLES LIKE 'character_set_%'; 2SHOW VARIABLES LIKE 'collation_%'; 3 4-- Should all be utf8mb4
Connection string must match:
JavaScript1// Node.js MySQL 2const connection = mysql.createConnection({ 3 host: 'localhost', 4 user: 'root', 5 database: 'mydb', 6 charset: 'utf8mb4' // ← CRITICAL 7});
Postgres:
SQL1-- Check encoding 2SHOW SERVER_ENCODING; 3SHOW CLIENT_ENCODING; 4 5-- Set to UTF-8 (usually default) 6SET CLIENT_ENCODING TO 'UTF8';
Disaster 3: JSON Encoding Hell
Symptom: Special characters turn into \uXXXX escape sequences.
Example:
JavaScriptconst data = { message: "Hello 世界" }; console.log(JSON.stringify(data)); // {"message":"Hello \u4e16\u754c"}
Why: JSON.stringify escapes non-ASCII by default.
Fix:
JavaScript1// Don't escape Unicode 2JSON.stringify(data, null, 2); 3// Still escapes... that's actually correct! 4 5// The real issue: receiving end must parse correctly 6const parsed = JSON.parse('{"message":"Hello \u4e16\u754c"}'); 7console.log(parsed.message); // "Hello 世界" ✅
Actual problem usually:
JavaScript1// ❌ Wrong: Sending JSON as Latin-1 2res.setHeader('Content-Type', 'application/json; charset=ISO-8859-1'); 3 4// ✅ Right: UTF-8 5res.setHeader('Content-Type', 'application/json; charset=UTF-8');
Disaster 4: CSV Export Corruption
Symptom: Export to CSV, open in Excel, all special characters broken.
Why: Excel defaults to your system encoding (often Windows-1252, not UTF-8).
Fix: Add BOM (Byte Order Mark)
JavaScript1// Add UTF-8 BOM so Excel knows it's UTF-8 2const BOM = '\uFEFF'; 3const csv = BOM + 'Name,City\n' + 4 'José,São Paulo\n' + 5 '李明,北京\n'; 6 7fs.writeFileSync('export.csv', csv, 'utf8');
Or force UTF-8 on import: Excel → Data → From Text → File Origin: 65001 (UTF-8)
Disaster 5: URL Encoding Issues
Symptom: URLs with non-ASCII chars break.
Example:
Raw: /search?q=café
Broken: /search?q=caf�
Correct: /search?q=caf%C3%A9
Fix: Always encode URLs
JavaScript1// ❌ Wrong 2const url = `/search?q=${query}`; 3 4// ✅ Right 5const url = `/search?q=${encodeURIComponent(query)}`; 6 7// Example 8encodeURIComponent('café'); // 'caf%C3%A9' 9encodeURIComponent('中文'); // '%E4%B8%AD%E6%96%87'
Decoding:
JavaScriptconst query = new URLSearchParams(window.location.search).get('q'); // Automatically decoded ✅
How to Debug Encoding Issues
Step 1: Find Where Encoding Goes Wrong
Encoding issues happen at boundaries:
- Reading files
- Database queries
- HTTP requests/responses
- String concatenation from different sources
Debug script:
JavaScript1function debugEncoding(text) { 2 console.log('Text:', text); 3 console.log('Length:', text.length); 4 console.log('Bytes:', Buffer.from(text, 'utf8')); 5 console.log('Hex:', Buffer.from(text, 'utf8').toString('hex')); 6 7 // Check each character 8 for (let i = 0; i < text.length; i++) { 9 const char = text[i]; 10 const code = text.charCodeAt(i); 11 const unicode = 'U+' + code.toString(16).toUpperCase().padStart(4, '0'); 12 console.log(`[${i}] ${char} → ${code} (${unicode})`); 13 } 14} 15 16debugEncoding('Café'); 17// [0] C → 67 (U+0043) 18// [1] a → 97 (U+0061) 19// [2] f → 102 (U+0066) 20// [3] é → 233 (U+00E9)
Step 2: Inspect Byte Sequences
If you see mojibake, check the bytes:
JavaScript1const broken = 'Café'; 2console.log(Buffer.from(broken, 'utf8').toString('hex')); 3// 43 61 66 c3 83 c2 a9 4 5// Compare to correct: 6const correct = 'Café'; 7console.log(Buffer.from(correct, 'utf8').toString('hex')); 8// 43 61 66 c3 a9
Notice the double-encoding: c3 83 c2 a9 vs c3 a9.
What happened:
- "é" = UTF-8 bytes:
c3 a9 - Those bytes were interpreted as Latin-1 → "é"
- "é" was re-encoded to UTF-8 →
c3 83 c2 a9
Fix: Double decode
JavaScriptconst broken = 'Café'; const fixed = Buffer.from(broken, 'latin1').toString('utf8'); console.log(fixed); // 'Café' ✅
Step 3: Check Every Layer
Web app checklist:
1. HTML:
HTML<!-- ✅ Add this to every page --> <meta charset="UTF-8">
2. HTTP Headers:
JavaScript// ✅ Server response res.setHeader('Content-Type', 'text/html; charset=UTF-8');
3. Database:
SQL1-- ✅ MySQL 2CREATE TABLE users ( 3 name VARCHAR(255) 4) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 5 6-- ✅ PostgreSQL (usually default) 7CREATE DATABASE mydb ENCODING 'UTF8';
4. Database Connection:
JavaScript1// ✅ Specify in connection string 2const pool = new Pool({ 3 connectionString: 'postgres://user:pass@localhost/mydb?client_encoding=UTF8' 4});
5. File I/O:
JavaScript// ✅ Explicitly set encoding fs.writeFileSync('file.txt', content, 'utf8'); fs.readFileSync('file.txt', 'utf8');
6. API Requests:
JavaScript1// ✅ Set Content-Type header 2fetch('/api/data', { 3 method: 'POST', 4 headers: { 5 'Content-Type': 'application/json; charset=UTF-8' 6 }, 7 body: JSON.stringify({ text: 'Café' }) 8});
Best Practices
1. UTF-8 Everywhere
Mantra: UTF-8 from storage to display.
Setup checklist:
- ✅ Database: UTF-8 / utf8mb4
- ✅ Database connection: charset=utf8mb4
- ✅ Files: Save as UTF-8 (check your editor settings)
- ✅ HTML:
<meta charset="UTF-8"> - ✅ HTTP:
Content-Type: ...; charset=UTF-8 - ✅ Code: Read/write files with
encoding='utf8'
2. Validate Input
Reject invalid UTF-8 sequences:
JavaScript1function isValidUTF8(str) { 2 try { 3 // Try encoding round-trip 4 const encoded = new TextEncoder().encode(str); 5 const decoded = new TextDecoder('utf-8', { fatal: true }).decode(encoded); 6 return decoded === str; 7 } catch (e) { 8 return false; 9 } 10} 11 12// Usage in API 13app.post('/api/comment', (req, res) => { 14 const { text } = req.body; 15 16 if (!isValidUTF8(text)) { 17 return res.status(400).json({ error: 'Invalid UTF-8 encoding' }); 18 } 19 20 // Continue... 21});
3. Normalize Unicode
Problem: Multiple ways to encode the same character.
Example:
JavaScript1// "é" can be: 2const composed = 'é'; // Single code point U+00E9 3const decomposed = 'é'; // e (U+0065) + ´ (U+0301) 4 5console.log(composed === decomposed); // false 😱 6console.log(composed.length); // 1 7console.log(decomposed.length); // 2
Fix: Normalize before comparing
JavaScript1const a = 'café'.normalize('NFC'); 2const b = 'café'.normalize('NFC'); 3console.log(a === b); // true ✅ 4 5// Forms: 6// NFC (Canonical Composition) - use this for display 7// NFD (Canonical Decomposition) - use for searching 8// NFKC/NFKD (Compatibility) - use for normalization
Normalize in database queries:
JavaScript1// Search ignoring normalization 2const searchTerm = userInput.normalize('NFD'); 3const results = await db.query( 4 'SELECT * FROM products WHERE LOWER(name) LIKE LOWER($1)', 5 [`%${searchTerm}%`] 6);
4. Length Limits
Be careful with character limits:
JavaScript1// ❌ Wrong: Byte length != character length 2const text = '中文测试'; 3console.log(text.length); // 4 characters 4console.log(Buffer.from(text, 'utf8').length); // 12 bytes 5 6// Database varchar(10) in bytes = only 3 Chinese chars!
Fix: Count characters, not bytes
JavaScript1function truncate(str, maxChars) { 2 if (str.length <= maxChars) return str; 3 return str.slice(0, maxChars) + '...'; 4} 5 6// Or use Array.from to handle emojis correctly 7function truncateEmoji(str, maxChars) { 8 const chars = Array.from(str); 9 if (chars.length <= maxChars) return str; 10 return chars.slice(0, maxChars).join('') + '...'; 11} 12 13truncateEmoji('Hello 👨👩👧👦 World', 10); 14// "Hello 👨👩👧👦 Wo..."
5. Handle Emojis Correctly
Problem: Emojis are complex.
JavaScript1const emoji = '👨👩👧👦'; // Family emoji 2console.log(emoji.length); // 11 😱 3 4// Why? It's multiple code points joined with Zero-Width Joiners
Fix: Use proper Unicode segmentation
JavaScript1// Split by grapheme clusters 2const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' }); 3const segments = Array.from(segmenter.segment('Hello 👨👩👧👦 World')); 4console.log(segments.map(s => s.segment)); 5// ['H', 'e', 'l', 'l', 'o', ' ', '👨👩👧👦', ' ', 'W', 'o', 'r', 'l', 'd'] 6 7// Count "characters" correctly 8const charCount = segments.length; // 13 ✅
Testing for Encoding Issues
Test Data
Use these strings to test encoding:
JavaScript1const testStrings = [ 2 'Hello World', // ASCII baseline 3 'Café résumé naïve', // Latin-1 extensions 4 'Привет мир', // Cyrillic 5 '你好世界', // Chinese 6 'مرحبا بالعالم', // Arabic (RTL) 7 '🔥💯👍', // Emojis 8 '👨👩👧👦', // Complex emoji (ZWJ sequence) 9 '\u0000\u0001\u001F', // Control characters 10 'test\r\nline\rbreaks\n', // Line breaks 11]; 12 13testStrings.forEach(str => { 14 // Send through your system 15 const result = yourFunction(str); 16 assert(result === str, 'Encoding corruption detected'); 17});
Automated Testing
JavaScript1// Encoding round-trip test 2describe('Encoding', () => { 3 it('should preserve UTF-8 through database', async () => { 4 const testString = 'Café 中文 🔥'; 5 6 await db.insert({ text: testString }); 7 const result = await db.query('SELECT text FROM table'); 8 9 expect(result[0].text).toBe(testString); 10 }); 11 12 it('should handle API round-trip', async () => { 13 const testString = 'Résumé 日本語'; 14 15 const response = await fetch('/api/echo', { 16 method: 'POST', 17 headers: { 'Content-Type': 'application/json; charset=UTF-8' }, 18 body: JSON.stringify({ text: testString }) 19 }); 20 21 const data = await response.json(); 22 expect(data.text).toBe(testString); 23 }); 24});
Platform-Specific Issues
Windows
Problem: Windows uses different encodings by default.
- Command Prompt: Code page 437 or Windows-1252
- PowerShell: UTF-16 LE
Fix:
POWERSHELL# Set PowerShell to UTF-8 [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
macOS/Linux
Problem: Usually UTF-8 by default, but check:
Terminal1locale 2# Should show UTF-8 3 4# If not: 5export LC_ALL=en_US.UTF-8 6export LANG=en_US.UTF-8
Python
Python 3:
Python# ✅ Default is UTF-8 (usually) with open('file.txt', 'r', encoding='utf-8') as f: content = f.read()
Python 2 (legacy):
Python1# ❌ Default is ASCII (nightmare) 2# Always specify encoding 3import codecs 4with codecs.open('file.txt', 'r', 'utf-8') as f: 5 content = f.read()
Java
Problem: Java uses UTF-16 internally.
JAVA1// ✅ Read UTF-8 files 2BufferedReader reader = new BufferedReader( 3 new InputStreamReader( 4 new FileInputStream("file.txt"), 5 StandardCharsets.UTF_8 6 ) 7); 8 9// ✅ Write UTF-8 files 10BufferedWriter writer = new BufferedWriter( 11 new OutputStreamWriter( 12 new FileOutputStream("file.txt"), 13 StandardCharsets.UTF_8 14 ) 15);
PHP
PHP1<?php 2// ✅ Set default encoding 3ini_set('default_charset', 'UTF-8'); 4mb_internal_encoding('UTF-8'); 5 6// ✅ Database connection 7$pdo = new PDO( 8 'mysql:host=localhost;dbname=mydb;charset=utf8mb4', 9 'user', 10 'password' 11);
IntlPull's Encoding Validation
When you push translations to IntlPull, we automatically:
- ✅ Validate UTF-8 encoding
- ✅ Check for invalid byte sequences
- ✅ Normalize Unicode (NFC form)
- ✅ Detect encoding mismatches
- ✅ Flag potential mojibake
Terminal1npx @intlpullhq/cli upload 2 3# Output: 4# ✅ All strings valid UTF-8 5# ⚠️ Warning: String "Café" looks like double-encoded UTF-8 6# 💡 Suggestion: Check database encoding
This catches encoding issues before they reach production.
The TL;DR
Rules to live by:
- Use UTF-8 everywhere. No exceptions.
- Set encoding explicitly at every boundary (files, DB, HTTP).
- Test with non-ASCII strings (Chinese, Arabic, emojis).
- Normalize before comparing (
.normalize('NFC')). - Count characters correctly (use
Array.from()for emojis).
Common mistakes:
- MySQL
utf8(useutf8mb4) - Forgetting
charset=UTF-8in HTTP headers - Comparing without normalization
- String length for character limits
- Opening Windows files without encoding specified
When you see mojibake:
- Check database encoding
- Check connection encoding
- Check HTTP headers
- Try double-decode (
latin1 → utf8)
Need help managing multilingual content?
Try IntlPull. Automatically validates encoding, detects mojibake, and normalizes Unicode in your translations. Free tier available.
Or just remember: UTF-8 everywhere. You'll be fine.
