A process of converting data from one format to another is called encoding. In this process the data and format both can be variable, means for any kind of data to any kind of format, if we do any conversion, the process will be called encoding. Based on the nature of data there are quite a few terms that are popular like media encoding, character encoding. Media Encoding refers to converting audio, video files to different formats e.g. mp3, avi, wav, etc. Similarly, the character encoding is also very popular which refers to representing character data to machine-readable binary format e.g. ASCII, US-ASCII, UTF-8. Although term encoding refers to a verb or process, sometimes used as a noun meant encoding itself. People tend to say mp3 encoding, or mp4 encoding, where they meant that data or media file is encoding in that particular format. Integer Encoding In this article, we will be discussing a particular encoding for integer values. We try to understand how integer values can be represented in binary and specifically how effectively it can be done. Everyone must have heard and read terms bytes and bits. A bit is an atomic unit a computer can measure or understand. Computers read a bit either as value 1 or value 0. The collection of 8 bits together is called a byte where each bit can be represented as either 0 or 1. So there are total combinations of those bits. Computers and programming languages are designed to read and write the collection of those eight bits a.k.a Byte together instead of reading/writing individual bits. 2 ^ 8 = 256 Each combination of those bits represents a decimal value. e.g. represents zero, and represents one. Was not that complex 😄. But it gets complicated with other discrete values as bit combination represents number and represents . Understanding this representation is not that complex and can be demystified with the following table. Where each bit represents a particular decimal value and we add up all values for which bits are turned . 00000000 00000001 00000101 5 100000001 129 1 Encoding Large Integers — The problem With the help of the above diagram, it’s easy to convert any decimal number to binary representation and vice versa. But what about numbers greater than 256 or larger than the capacity of storage in a single byte. You may have thought obviously more bytes and you are not wrong. We can store larger integer values say in two bytes which will be . So the binary representation of will be which will be stored in two bytes and can be stored as and . And decimal value larger than can also be stored in three bytes and so on. There is no limitation. 2^16 = 65536 7896 0001111011011000 00011110 11011000 65536 But there is a problem since bytes are read and write in a sequence to the memory while reading if found two bytes and we can’t be sure if these are and or two bytes combined as . 00011110 11011000 30 216 7896 There can be different solutions to the problem, one is to pre-define the size of the value you wrote and then read that number of bytes back. This concept is called primitive types in any programming language. Programming languages define different types and have associated fixed number of bytes, e.g. in C type take 2 bytes and take 8 bytes. When we store and read those values, compiler already aware of their type and read that much byte sequence. short long Another solution can be defined as a delimiter or an identifier when a particular byte sequence defines the boundary. For data types with variable length, this solution is mostly used. For example in C and other languages string values are delimited by a null byte. So while reading a , the compiler keeps reading the bytes until found a null byte. string This solution can also be used for larger integers or varying size integer encoding but will cause wastage storage because most of the computer operations are involved in a huge number of computations on numbers. Base 128 Varints — The solution To overcome the problem of representing varying length integer values s technique is used. This does not require a delimiter byte, instead use a single bit for this purpose. With this technique we can save a lot of storage as well we store any number of bytes associated with an integer value. Base 128 Varint This encoding technique is one from many of and another common name associated with Base 128 Varints is . This encoding is used DWARF file format, WebAssembly and Google Protobuf to represent integer values in smaller bytes possible. variable-length encoding LEB128 — Little Endian 128 The process of encoding any integer to is not that complex and consists of the following steps. LEB128 Convert the number to binary presentation Arrange bits in the group of 7 bits keeping 8th bit empty Fill up every 8th bit of each byte with 1 for least significant bytes (LSB) Fill most significant byte (MSB) remaining bits with 0 Reverse the order of bytes You have LEB128 encoded binary representation While reading the byte sequence: Read one byte from the sequence of bytes Check if its 8th bit of that byte is set to 1 keep it in a buffer and then repeat step 1. When reading a byte of which 8th bit is set to 0 we know that all bytes associated with one integer value are completed. Reverse the order of bytes in the buffer Now we have to remove 8th bit of each byte in the buffer Join remaining 7 bits of each byte together And here we have our original integer value Was not that so easy, Take the example of a number which have a binary representation . Let’s encode it with with the help of this diagram and the above steps. 963412 111010110011 01010100 LEB128 What about LEB128 for signed integers Signed numbers are the ones where you have to store the information if numbers were positive or negative. This information is stored in the most significant bit. While the whole binary representation is done as scheme. So normal binary representation of in one byte is but for we will do two’s complement which will be . That’s how it is calculated: two’s compliment 12 00001100 -12 11110100 Decimal Binary 12 00001100 11110011 1 's compliment (invert the bits) 11110100 2 's compliment (add 1) The rest of the calculations for LEB128 remains the same on top of this resulted binary (2’s complement of same positive number). For decoding it back to an integer you must already aware if it was a signed integer while encoding or not. This is the reason all programming languages treat signed and unsigned data types separately. And then you just decode normally and take two’s complement of the output. That will give absolute value that you can negate to get a signed number. In case you want to be generic signed encoding algorithm for positive or negative values you can do by taking two’s complement of any value. Is that all Data encoding is a vast field and the topic of integer encoding itself has a lot of dimensions. Base 128 Varint is not the only integer encoding technique. There are a lot more, few popular ones are , , . Fibonacci Encoding Elias gamma coding HPACK If you know the distribution of input values and use of the output that you can decide upon which encoding is best for you. You can use any, develop new for yourself, or even can optimize existing for your need. Implementation Here is one implementation of LEB128 algorithm. Please bear in mind there can be many different approaches to solve this problem or to implement this algorithm, so feel free to develop your own. result = bytearray() number.bit_length() > : single_byte = (number & ) | result.append(single_byte) number = number >> result.append(number) result needle = offset pair_count = result = : single_byte = byte_array[needle] single_byte & == : single_byte = single_byte & single_byte = single_byte << (pair_count * ) result = result | single_byte needle = needle + pair_count = pair_count + single_byte = single_byte << (pair_count * ) result = result | single_byte result # unsigned_leb128_encode_decode.py # Python 3.7.4 : def encode_unsigned_leb128 (number) # While number is greater than a byte while 7 # Get first 7 bits and append 1 on 8th bit 0b01111111 0b10000000 # Append the byte to result # Truncate right 7 bits 7 # Append remaining byte to result # As we appended earlier no need to reverse the bytes return : def decode_unsigned_leb128 (byte_array, offset= ) 0 0 0 while True # If first bit is 1 if 0b10000000 0 break # Remove first bit 0b01111111 # Push number of bits we already have calculated 7 # Merge byte with result 1 1 # Merge last byte with result 7 return unsigned_leb128_encode_decode decode_unsigned_leb128, encode_unsigned_leb128 bits_multiple_of_7 = ((number.bit_length() + ) // ) * twos_complement = ( << bits_multiple_of_7) - number encode_unsigned_leb128(twos_complement) number = decode_unsigned_leb128(byte_array, offset) bits_multiple_of_7 = (number.bit_length() // ) * twos_complement = ( << bits_multiple_of_7) - number twos_complement # signed_leb128_encode_decode.py # Python 3.7.4 from import : def encode_signed_leb128 (number) # Higher multiple of 7 7 7 7 1 return : def decode_signed_leb128 (byte_array, offset= ) 0 # Lower multiple of 7 7 7 1 return You can download the . Share your thoughts and specially if your feedback if you found any issue with above code. gist form this link Previously published at https://basicdrift.com/explore-encoding-base-128-varints-41665a0dca36