Character encoding | Human <-> PC
Computers only understand 0 & 1. We can represent decimals using base2. What about characters? We need an agreed mapping between characters and values.
Key takeaways:
- One or more characters make grapheme
- To decode bytes into graphemes, we need to know what encoding was used. Otherwise, you will experience mojibake.
Grapheme: Unit for human writing system vs Character: code point
Grapheme is the smallest meaningful unit of writing in a language — that’s what you see. Character is how that’s stored in the memory. One or more code points, combining together, represent a grapheme.
Unicode code points
ASCII | 7bits | 128 possible values | # of characters = # of bytes
ASCII has a single encoding technique: convert each character to a decimal value between 0 to 127, which will then be represented in binary for storing in machines.
UTF-32 | All code points are of 4 bytes | rarely used
Takes a code point & converts that to binary which takes 4 bytes. Wasteful, as it allocates the same 4 bytes, having mostly zeros, even for small code points e.g. A, B
UTF-8 | Most adopted | Code points get varying bytes
Code points with lower values map to 1 byte, and higher values are given 2 to 4 bytes. English alphabets are given the same mapping in ASCII and UTF-8. As a result, UTF-8 is backward compatible as ASCII programs can read UTF-8 easily. So, English letters get fewer bytes while other languages are assigned more.