Character encoding

Farhan Tanvir Utshaw
1 min readOct 2, 2023

--

Computers only understand 0 & 1. We can represent decimals using base2. What about characters? We need an agreed mapping between characters and values.

ASCII encoding works like this: Each character is converted into a corresponding binary. One grapheme -> 1 byte. Works only for English as 8 bits can represent up to 256 symbols only.

UTF-32

One grapheme -> 4 bytes. So, it’s wasteful. Even if the character only “needed” 1 byte it’s assigning 4 bytes for it. i.e. English letters.

UTF-8 | Most adopted encoding scheme

Solve the issues with UTF-32. It maps each grapheme to [1,4] bytes.

Code points with lower values are mapped to 1 bytes while larger ones are given more. English alphabets are given the same mapping in ASCII and UTF-8. As a result, UTF-8 is backward compatible as ASCII programs can read UTF-8 easily. So, English letters get fewer bytes while other languages are assigned more.

--

--

No responses yet