Fundamentals

Character Encoding (UTF-8)

The system for representing text characters as bytes, with UTF-8 being the universal standard.

Definition

Character encoding defines how text characters are represented as bytes in computers. UTF-8 is the dominant encoding for the web and modern software, capable of representing every character in Unicode (including all human languages, emoji, and symbols) while being backward-compatible with ASCII. Using UTF-8 consistently prevents 'mojibake' (garbled text) and encoding errors.

Examples

  • UTF-8 encoding for 'Hello': 48 65 6C 6C 6F (ASCII-compatible)
  • UTF-8 for Japanese '日本語': E6 97 A5 E6 9C AC E8 AA 9E
  • Mojibake example: 'é' appearing instead of 'e' (encoding mismatch)
  • BOM (Byte Order Mark): Optional UTF-8 header (EF BB BF)

Frequently Asked Questions

Why should I always use UTF-8?

UTF-8 supports all languages and characters, is the web standard (>98% of websites), is backwards-compatible with ASCII, and prevents encoding errors. There's no reason to use other encodings for new projects.

What causes mojibake (garbled text)?

Mojibake occurs when text encoded in one character set is decoded using another. Example: UTF-8 text read as ISO-8859-1. Fix: ensure consistent UTF-8 encoding in database, files, HTTP headers, and HTML meta tags.

Related Terms

Ready to simplify your i18n workflow?

Start managing translations with IntlPull.

    Character Encoding (UTF-8) - Definition & Examples | IntlPull Glossary | IntlPull