Programmer and technical designer. Currently work on Lisk, a blockhain application platform.
A process of converting data from one format to another is called encoding. In this process the data and format both can be variable, means for any kind of data to any kind of format, if we do any conversion, the process will be called encoding. Based on the nature of data there are quite a few terms that are popular like media encoding, character encoding. Media Encoding refers to converting audio, video files to different formats e.g. mp3, avi, wav, etc.
Similarly, the character encoding is also very popular which refers to representing character data to machine-readable binary format e.g. ASCII, US-ASCII, UTF-8. Although term encoding refers to a verb or process, sometimes used as a noun meant encoding itself. People tend to say mp3 encoding, or mp4 encoding, where they meant that data or media file is encoding in that particular format.
In this article, we will be discussing a particular encoding for integer values. We try to understand how integer values can be represented in binary and specifically how effectively it can be done.
Everyone must have heard and read terms bytes and bits. A bit is an atomic unit a computer can measure or understand. Computers read a bit either as value 1 or value 0. The collection of 8 bits together is called a byte where each bit can be represented as either 0 or 1. So there are total
combinations of those bits. Computers and programming languages are designed to read and write the collection of those eight bits a.k.a Byte together instead of reading/writing individual bits.
2 ^ 8 = 256
Each combination of those bits represents a decimal value. e.g.
represents zero, and
represents one. Was not that complex 😄. But it gets complicated with other discrete values as bit combination
. Understanding this representation is not that complex and can be demystified with the following table. Where each bit represents a particular decimal value and we add up all values for which bits are turned
With the help of the above diagram, it’s easy to convert any decimal number to binary representation and vice versa. But what about numbers greater than 256 or larger than the capacity of storage in a single byte. You may have thought obviously more bytes and you are not wrong.
We can store larger integer values say in two bytes which will be
. So the binary representation of
2^16 = 65536
which will be stored in two bytes and can be stored as
. And decimal value larger than
can also be stored in three bytes and so on. There is no limitation.
But there is a problem since bytes are read and write in a sequence to the memory while reading if found two bytes
we can’t be sure if these are
or two bytes combined as
There can be different solutions to the problem, one is to pre-define the size of the value you wrote and then read that number of bytes back. This concept is called primitive types in any programming language. Programming languages define different types and have associated fixed number of bytes, e.g. in C
type take 2 bytes and
take 8 bytes. When we store and read those values, compiler already aware of their type and read that much byte sequence.
Another solution can be defined as a delimiter or an identifier when a particular byte sequence defines the boundary. For data types with variable length, this solution is mostly used. For example in C and other languages string values are delimited by a null byte. So while reading a
, the compiler keeps reading the bytes until found a null byte.
This solution can also be used for larger integers or varying size integer encoding but will cause wastage storage because most of the computer operations are involved in a huge number of computations on numbers.
To overcome the problem of representing varying length integer values Base 128 Varints technique is used. This does not require a delimiter byte, instead use a single bit for this purpose. With this technique we can save a lot of storage as well we store any number of bytes associated with an integer value.
This encoding technique is one from many of variable-length encoding and another common name associated with Base 128 Varints is LEB128 — Little Endian 128. This encoding is used DWARF file format, WebAssembly and Google Protobuf to represent integer values in smaller bytes possible.
The process of encoding any integer to
is not that complex and consists of the following steps.
While reading the byte sequence:
Was not that so easy, Take the example of a number
which have a binary representation
. Let’s encode it with
with the help of this diagram and the above steps.
Signed numbers are the ones where you have to store the information if numbers were positive or negative. This information is stored in the most significant bit. While the whole binary representation is done as two’s complimentscheme. So normal binary representation of
in one byte is
we will do two’s complement which will be
. That’s how it is calculated:
12 Decimal 00001100 Binary 11110011 1's compliment (invert the bits) 11110100 2's compliment (add 1)
The rest of the calculations for LEB128 remains the same on top of this resulted binary (2’s complement of same positive number).
For decoding it back to an integer you must already aware if it was a signed integer while encoding or not. This is the reason all programming languages treat signed and unsigned data types separately. And then you just decode normally and take two’s complement of the output. That will give absolute value that you can negate to get a signed number.
In case you want to be generic signed encoding algorithm for positive or negative values you can do by taking two’s complement of any value.
Data encoding is a vast field and the topic of integer encoding itself has a lot of dimensions. Base 128 Varint is not the only integer encoding technique. There are a lot more, few popular ones are Fibonacci Encoding, Elias gamma coding, HPACK.
If you know the distribution of input values and use of the output that you can decide upon which encoding is best for you. You can use any, develop new for yourself, or even can optimize existing for your need.
Here is one implementation of LEB128 algorithm. Please bear in mind there can be many different approaches to solve this problem or to implement this algorithm, so feel free to develop your own.
# unsigned_leb128_encode_decode.py # Python 3.7.4 def encode_unsigned_leb128(number): result = bytearray() # While number is greater than a byte while number.bit_length() > 7: # Get first 7 bits and append 1 on 8th bit single_byte = (number & 0b01111111) | 0b10000000 # Append the byte to result result.append(single_byte) # Truncate right 7 bits number = number >> 7 # Append remaining byte to result result.append(number) # As we appended earlier no need to reverse the bytes return result def decode_unsigned_leb128(byte_array, offset=0): needle = offset pair_count = 0 result = 0 while True: single_byte = byte_array[needle] # If first bit is 1 if single_byte & 0b10000000 == 0: break # Remove first bit single_byte = single_byte & 0b01111111 # Push number of bits we already have calculated single_byte = single_byte << (pair_count * 7) # Merge byte with result result = result | single_byte needle = needle + 1 pair_count = pair_count + 1 # Merge last byte with result single_byte = single_byte << (pair_count * 7) result = result | single_byte return result
# signed_leb128_encode_decode.py # Python 3.7.4 from unsigned_leb128_encode_decode import decode_unsigned_leb128, encode_unsigned_leb128 def encode_signed_leb128(number): # Higher multiple of 7 bits_multiple_of_7 = ((number.bit_length() + 7) // 7) * 7 twos_complement = (1 << bits_multiple_of_7) - number return encode_unsigned_leb128(twos_complement) def decode_signed_leb128(byte_array, offset=0): number = decode_unsigned_leb128(byte_array, offset) # Lower multiple of 7 bits_multiple_of_7 = (number.bit_length() // 7) * 7 twos_complement = (1 << bits_multiple_of_7) - number return twos_complement
You can download the gist form this link. Share your thoughts and specially if your feedback if you found any issue with above code.
Previously published at https://basicdrift.com/explore-encoding-base-128-varints-41665a0dca36