We all use JPG photos on our websites and apps. In this post, I’ll show how you can reduce image sizes by an additional 20–50% with a single line of code. This is accomplished by carefully analyzing the way JPG works and changing its logic.
Can you tell the difference between the two bear photos below? The first is a photo compressed with JPG at quality=90. The other is still a JPG photo, that for the life of me looks identical, and takes up half the space compared to the other.
You can download them to see for yourself. Show them in full screen, and check their file size. No trickery involved — the output is still a JPG image, that looks the same, and takes up much less space. If you want to see I’m not playing tricks with some cherry-picked example, you can play around andupload your own photos right here.
I remember almost 15 years ago when I was an undergrad in Electrical Engineering, I sat in a classroom and was taught how JPEG works. I was in awe at the ingenuity of the engineers that handcrafted a method that takes an image and describes its content with much fewer numbers than you’d get by say, enumerating the values of each pixel independently.
I wasn’t wrong in being impressed. Despite the provocative title for this post — I’m actually still impressed with the engineers who in the 1980s figured out a pretty good way to lower an image file size while making visually minor changes. But they left a lot of optimization opportunities untouched. And to see that, we need to understand how it works.
You can skip this portion if you’re not interested in a trip down memory lane, but in order to improve JPG, you really need to understand how it works. So I’m going to briefly skim over what happens in regular old JPG compression. All the numeric examples are ripped straight off of Wikipedia, which has an
JPG applies the same logic for every block of 8x8 pixel values and applies the same logic for each color channel independently. So you can understand JPG just by looking at what it does to a single 8x8 block of pixel values.
For example, let’s look at one block, and pretend this is the pixel values:
DCT transform is a very close cousin to Fourier transform. It transforms pixel intensity values to spatial coefficients. You can read about it
After applying DCT transform, you end up with another 8x8 value, but they tend to have higher values in the upper left part of the block. For the example above, you’d get the following values 8x8 values:
Not only do the values on the bottom right tend to be smaller, but they also tend to describe information about the block that is often less visually noticeable when omitted!
This is because these values describe changes in the pixel intensity that go back and forth very quickly between neighboring pixels, and our visual system tends to “average them out”.
So engineers in the 1980s sat and came up with something called a “quantization table”. It’s just a bunch of constants, that express our belief that keeping track of the values in the upper left corner of the DCT coefficient block is more important than keeping information about stuff that goes on in the bottom right.
We’re just going to divide each value in the block by its corresponding quantization table value, and round the result. We use smaller values on the top left, and higher values on the bottom right.
This process is done regardless of the content of the image, and regardless of how visually noticeable this rounding process is in context. It just captures the intuition of engineers about how some coefficients are going to be less noticeable than others and doesn’t take into account all that are the actual contents in this specific image.
Quantization table. Take all the values in the above block, and divide by these numbers
Here’s the result when you do this division:
Get it? You get a bunch of 0s, which you can get away with not saving. Also, the few values that aren’t 0 tend to repeat or be very small, which takes fewer bits to save. There’s only a little bit more detail like running Huffman encoding and run-length encoding — but that’s details that matter exceeds the scope of this post. You now know pretty much how JPG works.
I think just describing how the process works, it’s a bit obvious what JPG misses out on.
I might write a follow-up blog post on the details of the model we’ve trained.
But the upshot is, in each weakness of the original JPG I’ve described above, you get an opportunity to improve results.
Modern convolutional neural nets can do a great job at approximating the way we perceive images visually (with some strong disclaimers around adversarial examples which fool them and not humans).
Once we learn how to visually determine if our change to the image content was visually identifiable, you can think of the very broad goal of “improve JPG” as well defined optimization process of “tweak every pixel in the image, and see if the result is visually unnoticeable and results in smaller DCT coefficients in a given block”.
Let’s focus on how the first part is achievable — the latter part can be thought of as any numeric optimization procedure. What can we do to look at two patches of an image, and ask “do they look the same to a visual system that humans have?”.
So, how do we use deep learning models to tell us how visually similar two image patches are? It turns out that taking a convolutional neural net that was trained on e.g. image classification, and lopping chopping it off midway through, you get internal representations of the image that are useful for visual comparison.
So, if you think of a small patch of an image, let’s say 128x128 pixels. You can tweak their values as you see fit to make JPG take up less space. So let’s say you have some guess about a change to the image that would make it take up less space like zero out a bunch of coefficients in a certain block.
You end up with:
Now, you can take some distance measures like their euclidean distance, and call that your rate of distortion. That already works pretty well, but something even better you can do is have humans annotate “on a scale of 1 to 5, how different are these two patches?”, and train a secondary model that learns to predict given these two sets of 256 numbers, a single number fo similarity score.
This post is getting a bit long, so I’ll write up further details on how we optimize the image in the future. If you don’t care about the details and just want to fiddle with a live demo, you can try out the result for yourself right
Also published here.