Hello! I'm so excited for this article. We're going to learn about an important concept: character encoding. It acts as a bridge between human language and computer language. character encoding Let's say you and I speak different languages. I understand only numbers, and you understand written symbols. How can we talk to each other? This is exactly the challenge people faced when computers were first created. Fundamentally, computers only understand numbers, specifically a pattern of electrical signals that represent 0s (off) and 1s (on). Still, we can use computers with human-understandable language like symbols, characters, and letters. Clearly, something acts as a bridge or translator, converting human-understandable symbols to binary numbers and vice-versa. This process is called character encoding, which acts as the translator. character encoding Ok wait we first need to understand what is a character before understanding character encoding, so lets do that. character character encoding The Foundations of Human Communication Before explaining encoding, let's first understand what we're encoding. A "character" is the smallest meaningful unit in a written system. In English, characters include the letters A through Z, digits 0-9, punctuation marks like periods and commas, and special symbols like @ and #. character We need to realize that different writing systems have different characters: Latin script has 26 basic letters Hindi (Devanagari script) has 47 primary characters (33 consonants and 14 vowels) plus various modifiers and conjunct forms Chinese has tens of thousands of unique characters Russian Cyrillic has 33 letters Japanese uses thousands of kanji characters plus hiragana and katakana syllabaries Latin script has 26 basic letters Hindi (Devanagari script) has 47 primary characters (33 consonants and 14 vowels) plus various modifiers and conjunct forms Chinese has tens of thousands of unique characters Russian Cyrillic has 33 letters Japanese uses thousands of kanji characters plus hiragana and katakana syllabaries One more thing to note: we humans make sense of these symbols visually. We see an "A" or any other symbol and immediately recognize it, but that's not the case with computers because they can't "see" or understand these symbols directly. Ok, now that we know what do we mean by a "character", let's focus on the word "encoding". Let’s start with understanding why do we even need encoding at all? Why Encoding is Necessary As we know, computers work using electricity. Specifically, they control the flow of electrical current through millions of tiny switches called transistors. These switches can be in one of two states: on or off, which we represent as 1 or 0 respectively. transistors transistors This is the system computers use, just like we humans use different systems like English and Hindi. Computers use a binary (two-state) system. binary (two-state) system Computers can only store and process patterns or sequences of 0s and 1s. They can't directly store the curve line of an "S" or the three horizontal strokes of an "E." They can only store numbers. This creates a huge problem: how do we represent human text using only numbers that computers can process? This is exactly what character encoding solves. Ok, we understood Character encoding is necessary because computers only understand binary (0s and 1s), while humans communicate with complex visual symbols. We need a translation system between these two languages. Now, we can start understanding the encodings or solutions to solve the issue. Morse Code: The First Practical Character Encoding Back in the 1830s and 1840s, people were facing the same challenge when trying to send messages over telegraph wires. The telegraph could only transmit pulses—either the circuit was connected (on) or disconnected (off). Samuel Morse and Alfred Vail developed Morse code to solve this problem. They created a system where each letter of the alphabet was represented by a unique combination of short pulses (dots) and long pulses (dashes): A: ·− (dot-dash) B: −··· (dash-dot-dot-dot) E: · (a single dot) S: ··· (three dots) A: ·− (dot-dash) B: −··· (dash-dot-dot-dot) E: · (a single dot) S: ··· (three dots) This was essentially a character encoding system—it translated human-readable characters into patterns that could be transmitted electronically and then decoded back into letters by the receiver. This actually influenced computer encoding systems. ASCII: The First Universal Standard ASCII (American Standard Code for Information Interchange) was created in 1963 and was revolutionary because it became the first widely adopted standard for character encoding between different computers. ASCII used 7 bits per character, allowing for $2^7 = 128$ different characters. Let's see some common characters and their respective binary sequences: 'A' = 65 (binary: 1000001) 'B' = 66 (binary: 1000010) 'a' = 97 (binary: 1100001) '1' = 49 (binary: 0110001) Space = 32 (binary: 0100000) 'A' = 65 (binary: 1000001) 1000001 'B' = 66 (binary: 1000010) 1000010 'a' = 97 (binary: 1100001) 1100001 '1' = 49 (binary: 0110001) 0110001 Space = 32 (binary: 0100000) 0100000 The design of ASCII was quite thoughtful: Control characters were assigned values 0-31 Most punctuation marks got values 32-64 Uppercase letters ran from 65-90 Lowercase letters ran from 97-122 Control characters were assigned values 0-31 Most punctuation marks got values 32-64 Uppercase letters ran from 65-90 Lowercase letters ran from 97-122 Notice the pattern: the lowercase 'a' (97) is exactly 32 greater than uppercase 'A' (65). This pattern holds for all letters, making it easy to convert between uppercase and lowercase by simply adding or subtracting 32. Okay, we understand how ASCII helps us represent characters as binary sequences, but let's take an example to understand the entire flow of how this conversion happens. How ASCII Works in Practice When you type the word "Hello" on a keyboard using an ASCII-based system: The computer registers that you pressed the 'H' key. It looks up the ASCII code for 'H', which is 72. It stores the number 72 in binary: 1001000. It continues for each letter: 'e' (101), 'l' (108), 'l' (108), 'o' (111). The complete word "Hello" becomes the sequence: 72, 101, 108, 108, 111. In binary: 1001000, 1100101, 1101100, 1101100, 1101111. The computer registers that you pressed the 'H' key. It looks up the ASCII code for 'H', which is 72. It stores the number 72 in binary: 1001000. It continues for each letter: 'e' (101), 'l' (108), 'l' (108), 'o' (111). The complete word "Hello" becomes the sequence: 72, 101, 108, 108, 111. In binary: 1001000, 1100101, 1101100, 1101100, 1101111. These binary numbers would be stored in computer memory, transmitted over networks, or saved to files. When another computer needs to display this text, it does the reverse process, converting each number back to its corresponding character. Ok, so we understood that ASCII created the first standardized character encoding, allowing different computers to exchange text information. It assigned a unique number (0-127) to each English character, punctuation mark, and control symbol. But there's a big problem with ASCII, can you guess? ASCII Extended: The First Limitations ASCII was great, but its 128-character limit became a serious problem as computing spread globally. It couldn't represent characters with accents (like é or ñ), much less non-Latin scripts like Cyrillic, Greek, Arabic, or Asian writing systems. To address this limitation, ASCII Extended was developed. Since computers were increasingly built to handle data in 8-bit bytes (which can store values from 0-255), it was natural to extend ASCII by using the 8th bit. This allowed for an additional 128 characters (values 128-255). Because of this, different countries and regions created their own extensions, resulting in a collection of incompatible encoding systems: Code page 437: The original IBM PC character set with box-drawing characters and some European letters ISO 8859-1 (Latin-1): Western European languages ISO 8859-2: Central and Eastern European languages ISO 8859-5: Cyrillic script Windows-1252: Microsoft's slightly modified version of Latin-1 Code page 437: The original IBM PC character set with box-drawing characters and some European letters Code page 437 ISO 8859-1 (Latin-1): Western European languages ISO 8859-1 ISO 8859-2: Central and Eastern European languages ISO 8859-2 ISO 8859-5: Cyrillic script ISO 8859-5 Windows-1252: Microsoft's slightly modified version of Latin-1 Windows-1252 Each of these encodings used the same values (0-127) for standard ASCII characters but assigned different characters to the extended values (128-255). This creates another problem that we call "Mojibake." The Problem of Mojibake When text encoded in one system was interpreted using another encoding, the result was garbled text known as "mojibake" (a Japanese term meaning "character transformation"). Imagine sending a message in one secret code, but the person receiving it thinks it's in a completely different code! For example, if the German word "Grüße" (greetings) was encoded using ISO 8859-1 and then viewed on a system using ISO 8859-5 (Cyrillic), the "ü" and "ß" would appear as completely different, likely nonsensical, characters. This is because the same number (in the range 128-255) represents different symbols in each encoding. This was especially problematic for: Email messages sent between different countries Websites viewed on computers with different language settings Documents shared between different operating systems Software used in international contexts Email messages sent between different countries Websites viewed on computers with different language settings Documents shared between different operating systems Software used in international contexts SO, Extended ASCII attempted to solve the limitation of representing non-English characters, but it created fragmentation with different countries adopting incompatible standards. This led to text displaying incorrectly when viewed with the wrong encoding system. Unicode: A Character Set for All Languages To fix the issue of Extended ASCII, engineers from Apple and Xerox started to work on a new approach. Their goal was to create a single character encoding standard that could represent every character from every writing system ever used by humans. What is Unicode Unicode isn't actually an encoding system but it's a character set that assigns a unique identification number, known as a code point, to every character. code point These code points are generally written with a prefix of "U+" followed by a hexadecimal number: Latin capital letter A: U+0041 Greek capital letter alpha (Α): U+0391 Hebrew letter alef (א): U+05D0 Arabic letter alef (ا): U+0627 Devanagari letter A (अ): U+0905 Chinese character for "person" (人): U+4EBA Emoji grinning face (😀): U+1F600 Latin capital letter A: U+0041 Greek capital letter alpha (Α): U+0391 Hebrew letter alef (א): U+05D0 Arabic letter alef (ا): U+0627 Devanagari letter A (अ): U+0905 Chinese character for "person" (人): U+4EBA Emoji grinning face (😀): U+1F600 You can check out all the code points here here Initially, Unicode used 16 bits per character, which allowed for 65,536 different characters. However, it soon became clear that 16 bits wouldn't be enough for all the world's writing systems. It's now expanded and has code points ranging from U+0000 to U+10FFFF, giving over 1.1 million characters. Unicode contains about 150,000 characters covering 150 modern and historic writing systems, plus symbols, emojis, and other special characters. So, it's safe to say that we probably won't run out of characters anytime soon! Unicode solved the fragmentation problem by creating a single universal character set that could represent all human writing systems. It assigns each character a unique code point, regardless of the language or script it comes from. Unicode Transformation Formats (UTFs): You need to remember that computers can only store data in 0s and 1s. While Unicode provides a way to represent characters for multiple writing systems, it doesn't explain how we store these code points in computer memory or files. To solve this, we got multiple encoding systems for Unicode. These encoding methods define how to convert Unicode code points into binary sequences that computers can store and process. UTF-32: Simple but Inefficient The simplest approach to encoding Unicode is UTF-32, which uses exactly 4 bytes (32 bits) for every character. This makes processing simple—each character takes the same amount of space, and you can easily jump to the nth character in a string by multiplying n by 4. But, UTF-32 is very inefficient for most text. Since the vast majority of commonly used characters have code points that fit in 2 bytes or even 1 byte, using 4 bytes for everything wastes a lot of space. It's like using a huge truck to deliver a single letter! UTF-16: A Compromise UTF-16 tries to balance simplicity and efficiency by using 2 bytes (16 bits) for the most common characters (those in the Basic Multilingual Plane, with code points up to U+FFFF), and 4 bytes for the less common characters. This was the encoding used by early implementations of Unicode, including Windows NT, Java, and JavaScript. But it still had drawbacks—it wasn't compatible with ASCII, and it had complications related to "byte order" (whether the most significant byte comes first or last). UTF-8: The Elegant Solution UTF-8, designed by Ken Thompson and Rob Pike in 1992, has become the dominant encoding on the modern web and in many operating systems. Its design is remarkably elegant: It uses a variable number of bytes per character: 1 byte for code points 0-127 (ASCII characters) 2 bytes for code points 128-2047 (most Latin-script languages, Greek, Cyrillic, Hebrew, Arabic, etc.) 3 bytes for code points 2048-65535 (most Chinese, Japanese, and Korean characters) 4 bytes for code points above 65535 (rare characters, historical scripts, emojis) The bytes are structured to make error detection and synchronization possible: Single-byte characters start with a 0 bit: 0xxxxxxx The first byte of a multi-byte sequence indicates its length with the number of leading 1 bits, followed by a 0: 110xxxxx for 2 bytes, 1110xxxx for 3 bytes, etc. Continuation bytes always start with the pattern 10xxxxxx It's backward compatible with ASCII—any ASCII text is already valid UTF-8, without any changes. This is a huge advantage! It uses a variable number of bytes per character: 1 byte for code points 0-127 (ASCII characters) 2 bytes for code points 128-2047 (most Latin-script languages, Greek, Cyrillic, Hebrew, Arabic, etc.) 3 bytes for code points 2048-65535 (most Chinese, Japanese, and Korean characters) 4 bytes for code points above 65535 (rare characters, historical scripts, emojis) It uses a variable number of bytes per character: 1 byte for code points 0-127 (ASCII characters) 2 bytes for code points 128-2047 (most Latin-script languages, Greek, Cyrillic, Hebrew, Arabic, etc.) 3 bytes for code points 2048-65535 (most Chinese, Japanese, and Korean characters) 4 bytes for code points above 65535 (rare characters, historical scripts, emojis) 1 byte for code points 0-127 (ASCII characters) 1 byte for code points 0-127 (ASCII characters) 2 bytes for code points 128-2047 (most Latin-script languages, Greek, Cyrillic, Hebrew, Arabic, etc.) 2 bytes for code points 128-2047 (most Latin-script languages, Greek, Cyrillic, Hebrew, Arabic, etc.) 3 bytes for code points 2048-65535 (most Chinese, Japanese, and Korean characters) 3 bytes for code points 2048-65535 (most Chinese, Japanese, and Korean characters) 4 bytes for code points above 65535 (rare characters, historical scripts, emojis) 4 bytes for code points above 65535 (rare characters, historical scripts, emojis) The bytes are structured to make error detection and synchronization possible: Single-byte characters start with a 0 bit: 0xxxxxxx The first byte of a multi-byte sequence indicates its length with the number of leading 1 bits, followed by a 0: 110xxxxx for 2 bytes, 1110xxxx for 3 bytes, etc. Continuation bytes always start with the pattern 10xxxxxx The bytes are structured to make error detection and synchronization possible: Single-byte characters start with a 0 bit: 0xxxxxxx The first byte of a multi-byte sequence indicates its length with the number of leading 1 bits, followed by a 0: 110xxxxx for 2 bytes, 1110xxxx for 3 bytes, etc. Continuation bytes always start with the pattern 10xxxxxx Single-byte characters start with a 0 bit: 0xxxxxxx 0xxxxxxx The first byte of a multi-byte sequence indicates its length with the number of leading 1 bits, followed by a 0: 110xxxxx for 2 bytes, 1110xxxx for 3 bytes, etc. 110xxxxx 1110xxxx Continuation bytes always start with the pattern 10xxxxxx 10xxxxxx It's backward compatible with ASCII—any ASCII text is already valid UTF-8, without any changes. This is a huge advantage! It's backward compatible with ASCII—any ASCII text is already valid UTF-8, without any changes. This is a huge advantage! Let's break down how the character "é" (Latin small letter e with acute accent), with Unicode code point U+00E9, is encoded in UTF-8: First, we convert the code point to binary: U+00E9 = 11101001 Since this value (233 in decimal) is greater than 127, we need more than one byte. The value fits within the range of 128-2047, so we need 2 bytes. The 2-byte pattern in UTF-8 is: 110xxxxx 10xxxxxx where the x's will be replaced by our actual bits. We need to fit our 11 bits (00000111101001) into these positions. Working from right to left: First, we split our binary value: 00000111101001 becomes 0000011 and 1101001 The last 6 bits (101001) go into the second byte's xxxxxx positions: 10101001 The remaining bits (0000011) are placed in the first byte's xxxxx positions: 11000011 So, the UTF-8 encoding of "é" is the two bytes: 11000011 10101001 In hexadecimal, that's C3 A9. First, we convert the code point to binary: U+00E9 = 11101001 First, we convert the code point to binary: U+00E9 = 11101001 Since this value (233 in decimal) is greater than 127, we need more than one byte. The value fits within the range of 128-2047, so we need 2 bytes. Since this value (233 in decimal) is greater than 127, we need more than one byte. The value fits within the range of 128-2047, so we need 2 bytes. The 2-byte pattern in UTF-8 is: 110xxxxx 10xxxxxx where the x's will be replaced by our actual bits. The 2-byte pattern in UTF-8 is: 110xxxxx 10xxxxxx 110xxxxx 10xxxxxx where the x's will be replaced by our actual bits. We need to fit our 11 bits (00000111101001) into these positions. Working from right to left: First, we split our binary value: 00000111101001 becomes 0000011 and 1101001 The last 6 bits (101001) go into the second byte's xxxxxx positions: 10101001 The remaining bits (0000011) are placed in the first byte's xxxxx positions: 11000011 We need to fit our 11 bits (00000111101001) into these positions. Working from right to left: First, we split our binary value: 00000111101001 becomes 0000011 and 1101001 The last 6 bits (101001) go into the second byte's xxxxxx positions: 10101001 The remaining bits (0000011) are placed in the first byte's xxxxx positions: 11000011 First, we split our binary value: 00000111101001 becomes 0000011 and 1101001 The last 6 bits (101001) go into the second byte's xxxxxx positions: 10101001 The remaining bits (0000011) are placed in the first byte's xxxxx positions: 11000011 So, the UTF-8 encoding of "é" is the two bytes: 11000011 10101001 In hexadecimal, that's C3 A9. So, the UTF-8 encoding of "é" is the two bytes: 11000011 10101001 000011 1010 In hexadecimal, that's C3 A9. In hexadecimal, that's C3 A9. UTF-8 has become the dominant encoding method for Unicode because of its elegant design. It uses a variable number of bytes, is backward compatible with ASCII, and efficiently represents characters from all languages. How Text Gets from Keyboard to Screen Let's trace the journey of character encoding through a simple example—typing the letter 'A' on your keyboard and seeing it appear on screen: Input: You press the 'A' key on your keyboard. Keyboard controller: Sends a scan code to the computer. Operating system: Translates the scan code to a character based on your keyboard layout. Text processing: The application receives this as the character 'A'. Unicode mapping: The application maps 'A' to Unicode code point U+0041. Encoding: If the text needs to be stored or transmitted, it's encoded (likely as UTF-8). Storage: The encoded bytes are written to memory or disk. Rendering: When displayed, the process reverses—the bytes are read, decoded back to the code point U+0041, and rendered as the glyph 'A' using a font. Input: You press the 'A' key on your keyboard. Input Keyboard controller: Sends a scan code to the computer. Keyboard controller Operating system: Translates the scan code to a character based on your keyboard layout. Operating system Text processing: The application receives this as the character 'A'. Text processing Unicode mapping: The application maps 'A' to Unicode code point U+0041. Unicode mapping Encoding: If the text needs to be stored or transmitted, it's encoded (likely as UTF-8). Encoding Storage: The encoded bytes are written to memory or disk. Storage Rendering: When displayed, the process reverses—the bytes are read, decoded back to the code point U+0041, and rendered as the glyph 'A' using a font. Rendering Conclusion: The Evolution of Character Encoding The journey from ASCII to Unicode and UTF is really fascinating. As technology spread worldwide, the need to represent diverse writing systems became crucial. ASCII served us well for English text, but its limitations became apparent as computing went global. Extended ASCII attempted to address this but created fragmentation with incompatible standards. Unicode solved this fragmentation by creating a universal character set that could represent all human writing systems. The UTF encoding formats, particularly UTF-8, provided efficient ways to implement Unicode in actual computer systems. UTF-8 has become the dominant encoding standard because: It's backward compatible with ASCII It efficiently represents characters from all languages It uses a variable number of bytes, saving storage space It's designed for error detection and synchronization It's backward compatible with ASCII It efficiently represents characters from all languages It uses a variable number of bytes, saving storage space It's designed for error detection and synchronization Today, character encoding continues to evolve as new symbols and writing systems are added to Unicode. The next time you type in any language or use an emoji, remember the complex system of character encoding that makes it possible for computers to understand these human symbols.