Character encoding | Human <-> PC

Farhan Tanvir Utshaw
2 min readOct 2, 2023

--

Computers only understand 0 & 1. We can represent decimals using base2. What about characters? We need an agreed mapping between characters and values.

Key takeaways:

  • One or more characters make grapheme
  • To decode bytes into graphemes, we need to know what encoding was used. Otherwise, you will experience mojibake.

Grapheme: Unit for human writing system vs Character: code point

Grapheme is the smallest meaningful unit of writing in a language — that’s what you see. Character is how that’s stored in the memory. One or more code points, combining together, represent a grapheme.

characters together make grapheme

Unicode code points

ASCII | 7bits | 128 possible values | # of characters = # of bytes

ASCII has a single encoding technique: convert each character to a decimal value between 0 to 127, which will then be represented in binary for storing in machines.

UTF-32 | All code points are of 4 bytes | rarely used

Takes a code point & converts that to binary which takes 4 bytes. Wasteful, as it allocates the same 4 bytes, having mostly zeros, even for small code points e.g. A, B

UTF-8 | Most adopted | Code points get varying bytes

Code points with lower values map to 1 byte, and higher values are given 2 to 4 bytes. English alphabets are given the same mapping in ASCII and UTF-8. As a result, UTF-8 is backward compatible as ASCII programs can read UTF-8 easily. So, English letters get fewer bytes while other languages are assigned more.

--

--

No responses yet