A process of converting data from one format to another is called encoding. In this process the data and format both can be variable, means for any kind of data to any kind of format, if we do any conversion, the process will be called encoding. Based on the nature of data there are quite a few terms that are popular like media encoding, character encoding. Media Encoding refers to converting audio, video files to different formats e.g. mp3, avi, wav, etc.
Similarly, the character encoding is also very popular which refers to representing character data to machine-readable binary format e.g. ASCII, US-ASCII, UTF-8. Although term encoding refers to a verb or process, sometimes used as a noun meant encoding itself. People tend to say mp3 encoding, or mp4 encoding, where they meant that data or media file is encoding in that particular format.
In this article, we will be discussing a particular encoding for integer values. We try to understand how integer values can be represented in binary and specifically how effectively it can be done.
Everyone must have heard and read terms bytes and bits. A bit is an atomic unit a computer can measure or understand. Computers read a bit either as value 1 or value 0. The collection of 8 bits together is called a byte where each bit can be represented as either 0 or 1. So there are total
2 ^ 8 = 256
combinations of those bits. Computers and programming languages are designed to read and write the collection of those eight bits a.k.a Byte together instead of reading/writing individual bits.Each combination of those bits represents a decimal value. e.g.
00000000
represents zero, and 00000001
represents one. Was not that complex 😄. But it gets complicated with other discrete values as bit combination 00000101
represents number 5
and 100000001
represents 129
. Understanding this representation is not that complex and can be demystified with the following table. Where each bit represents a particular decimal value and we add up all values for which bits are turned 1
.With the help of the above diagram, it’s easy to convert any decimal number to binary representation and vice versa. But what about numbers greater than 256 or larger than the capacity of storage in a single byte. You may have thought obviously more bytes and you are not wrong.
We can store larger integer values say in two bytes which will be
2^16 = 65536
. So the binary representation of 7896
will be 0001111011011000
which will be stored in two bytes and can be stored as 00011110
and 11011000
. And decimal value larger than 65536
can also be stored in three bytes and so on. There is no limitation.But there is a problem since bytes are read and write in a sequence to the memory while reading if found two bytes
00011110
and 11011000
we can’t be sure if these are 30
and 216
or two bytes combined as 7896
.There can be different solutions to the problem, one is to pre-define the size of the value you wrote and then read that number of bytes back. This concept is called primitive types in any programming language. Programming languages define different types and have associated fixed number of bytes, e.g. in C
short
type take 2 bytes and long
take 8 bytes. When we store and read those values, compiler already aware of their type and read that much byte sequence.Another solution can be defined as a delimiter or an identifier when a particular byte sequence defines the boundary. For data types with variable length, this solution is mostly used. For example in C and other languages string values are delimited by a null byte. So while reading a
string
, the compiler keeps reading the bytes until found a null byte. This solution can also be used for larger integers or varying size integer encoding but will cause wastage storage because most of the computer operations are involved in a huge number of computations on numbers.
To overcome the problem of representing varying length integer values Base 128 Varints technique is used. This does not require a delimiter byte, instead use a single bit for this purpose. With this technique we can save a lot of storage as well we store any number of bytes associated with an integer value.
This encoding technique is one from many of variable-length encoding and another common name associated with Base 128 Varints is LEB128 — Little Endian 128. This encoding is used DWARF file format, WebAssembly and Google Protobuf to represent integer values in smaller bytes possible.
The process of encoding any integer to
LEB128
is not that complex and consists of the following steps.While reading the byte sequence:
Was not that so easy, Take the example of a number
963412
which have a binary representation 111010110011 01010100
. Let’s encode it with LEB128
with the help of this diagram and the above steps.Signed numbers are the ones where you have to store the information if numbers were positive or negative. This information is stored in the most significant bit. While the whole binary representation is done as two’s complimentscheme. So normal binary representation of
12
in one byte is 00001100
but for -12
we will do two’s complement which will be 11110100
. That’s how it is calculated:12 Decimal
00001100 Binary
11110011 1's compliment (invert the bits)
11110100 2's compliment (add 1)
The rest of the calculations for LEB128 remains the same on top of this resulted binary (2’s complement of same positive number).
For decoding it back to an integer you must already aware if it was a signed integer while encoding or not. This is the reason all programming languages treat signed and unsigned data types separately. And then you just decode normally and take two’s complement of the output. That will give absolute value that you can negate to get a signed number.
In case you want to be generic signed encoding algorithm for positive or negative values you can do by taking two’s complement of any value.
Data encoding is a vast field and the topic of integer encoding itself has a lot of dimensions. Base 128 Varint is not the only integer encoding technique. There are a lot more, few popular ones are Fibonacci Encoding, Elias gamma coding, HPACK.
If you know the distribution of input values and use of the output that you can decide upon which encoding is best for you. You can use any, develop new for yourself, or even can optimize existing for your need.
Here is one implementation of LEB128 algorithm. Please bear in mind there can be many different approaches to solve this problem or to implement this algorithm, so feel free to develop your own.
# unsigned_leb128_encode_decode.py
# Python 3.7.4
def encode_unsigned_leb128(number):
result = bytearray()
# While number is greater than a byte
while number.bit_length() > 7:
# Get first 7 bits and append 1 on 8th bit
single_byte = (number & 0b01111111) | 0b10000000
# Append the byte to result
result.append(single_byte)
# Truncate right 7 bits
number = number >> 7
# Append remaining byte to result
result.append(number)
# As we appended earlier no need to reverse the bytes
return result
def decode_unsigned_leb128(byte_array, offset=0):
needle = offset
pair_count = 0
result = 0
while True:
single_byte = byte_array[needle]
# If first bit is 1
if single_byte & 0b10000000 == 0:
break
# Remove first bit
single_byte = single_byte & 0b01111111
# Push number of bits we already have calculated
single_byte = single_byte << (pair_count * 7)
# Merge byte with result
result = result | single_byte
needle = needle + 1
pair_count = pair_count + 1
# Merge last byte with result
single_byte = single_byte << (pair_count * 7)
result = result | single_byte
return result
# signed_leb128_encode_decode.py
# Python 3.7.4
from unsigned_leb128_encode_decode import decode_unsigned_leb128, encode_unsigned_leb128
def encode_signed_leb128(number):
# Higher multiple of 7
bits_multiple_of_7 = ((number.bit_length() + 7) // 7) * 7
twos_complement = (1 << bits_multiple_of_7) - number
return encode_unsigned_leb128(twos_complement)
def decode_signed_leb128(byte_array, offset=0):
number = decode_unsigned_leb128(byte_array, offset)
# Lower multiple of 7
bits_multiple_of_7 = (number.bit_length() // 7) * 7
twos_complement = (1 << bits_multiple_of_7) - number
return twos_complement
You can download the gist form this link. Share your thoughts and specially if your feedback if you found any issue with above code.
Previously published at https://basicdrift.com/explore-encoding-base-128-varints-41665a0dca36