Here at Mozilla, we’ve been hard at work on the new AV1 video codec. AV1 aims to improve coding efficiency by 25% over HEVC (h.265) and VP9, and is developed by the Alliance of Open Media of which Mozilla is a part of.
AV1 is a derivative of VP9 and now includes a large number additional coding tools and experiments imported from Daala, Thor and VP10. These experiments interact with each other in intricate and unexpected ways, and must be carefully tested on a wide variety of content. This can take a long time to complete. A single video frame can sometimes take more than an hour to encode and as part of our routine testing, we encode 30 clips, each containing 60 frames. The encoding process is massively parallel and runs on a large number of AWS instances, but even with all that hardware, it can take hours or even days to run a single test job.
This computational cost makes it inconvenient for developers to encode and analyze videos locally. But it also uncovers an interesting question. If all of our testing and build infrastructure runs in the cloud, why not run all of our analysis tools in the cloud (or browser) as well?
Our first attempt at this is the bitstream analyzer. The analyzer decodes AV1 bitstreams and displays a variety of details about a bitstream. This information can help codec engineers more easily identify and fix bugs. The input to the analyzer is usually small (an encoded bitstream), but the output is very large. For instance, a single 1080p video frame produces 4MB of raw image data and a large amount of analyzer metadata. This is usually not a problem if the analyzer runs locally, but if the analyzer runs remotely on a server, then bandwidth and latency becomes a big concern.
The analyzer is made up of two components: decoder.js which is the Emscripten compiled version of the codec and an HTML based UI front-end.
To analyze a video, all we need to do is specify a video file (in *.ivf format.) and an appropriate decoder.js file to decode it.
In the link above, analyzer.html loads the decoder and decodes 2 bit streams a.ivf and b.ivf with it. Alternatively, multiple decoders can be used to analyze videos:
Or, of course you could:
- Follow the directions here: http://aomedia.org/contributor-guide/ (which I recommend you do if you want to help out.)
- Check out a particular revision of the codec that is compatible with the encoded video you want to analyze.
- Build and run the hypothetical local analyzer.
- If you want to share some of analyze results, take a screenshot and share it, or ask a colleague to repeat steps 1 through 3. Which of course, they are less likely to do, because it takes more than a click.
I work at a browser company, so I may be biased, but I think this is the web at its finest.
Emscripten is often used to port games or C/C++ libraries to the web, and this is really no different, but it’s a slightly different use case that I haven’t seen before. We use Emscripten to make our continuous integration build artifacts runnable and shareable, how cool is that? Beat that convenience, native!
Playing with the Analyzer
So let’s take the bitstream analyzer out for a spin. Below, we’ll be comparing two bitstreams: crosswalk_10.ivf and crosswalk_60.ivf. These two videos are encoded with the same encoder version, but at two different quality levels, 10 and 60 QP respectively (lower the number, higher the quality). The analyzer visualizes block details as a set of layers stacked on top of each other.
Block Split Layer
The maximum block size in AV1 is 64x64, while the minimum is 4x4. (There are experiments to extend this range.) The encoder uses a large number of factors to decide how to recursively partition 64x64 blocks. But in general, we can see from the images below, that areas with more detail have smaller block sizes, while areas with less detail have larger block sizes. In lower quality settings, the average block size is larger, but the same general rule applies. The size of a block is important because that’s the level at which the encoder signals skip information, motion vectors, prediction modes, transform types and other types of information. The smaller the block size, the more details the encoder can signal, but that also means the encoder spends more bits signaling those details. The two figures below show the block split decisions for the first frame at two quality levels.
The first frame of the video is an intra frame, meaning that each block is predicted spatially from blocks around it (top and to the left). The second frame of the video is an inter frame, meaning that it’s predicted temporally, from frames before (or after) it. The block split decisions for the second frame (below) are interesting, they only reflect areas of the image that change between the two frames. The heads of the two people in the foreground pan towards the right, so the areas that change are around the contour of the heads. The face, although it moves to the right, doesn’t need fine grained blocks because it can be predicted from the previous frame at a coarse level, while the areas around the head can’t.
The analyzer can plot the area covered by each block size as a stacked bar chart. The first frame is this video sequence is unique because it’s an intra frame. It uses a roughly equal number of 16x16, 32x32, 64x64 blocks. The remaining frames are all inter frames and use mostly 64x64 blocks. Interestingly enough, there’s a recurring pattern here. There appear to be no 32x32 blocks in frame 8, 16, 24, etc. I wonder why? These are the kinds of questions the analyzer aims to uncover. This may be the normal operation behavior of the codec, or it could be a bug.
At 10 QP, this looks different, but similar.
Prediction Modes Layer
Each block has a prediction mode. For intra frames, these include directional prediction modes which are drawn as thin lines within each block. The colored blocks use DC_PRED (pink) and TM_PRED (blue).
If we zoom in on the eye of the lady in the center by clicking on it, we can clearly see the prediction modes and the coding artifacts that are ultimately produced by these coding decisions.
Inter frames have no directional prediction modes: White (NEWMV), Blue (NEARMV), Burgundy (NEARESTMV) and Purple (ZEROMV).
Block Info Details
You can get additional information on a block by clicking on it. For instance, clicking on the top left block (0x0) above shows the following block details.
Block Size: BLOCK_64X64
Transform Size: TX_32X32
Transform Type: DCT_DCT
Skip: NO SKIP
Motion Vectors: 0,0 | 0,0
Reference Frame: ALTREF_FRAME, N/A
This is convenient way to figure out what the colors mean.
Motion Vectors Layer
Blocks in inter frames can be predicted from other frames. Each block may have 2 motion vectors shown here as a combination of red and blue lines. The intensity of the color represents the magnitude of the vector. Each vector is an offset from where the contents of a block can be predicted. The longer the vector, the more motion there is.
Bit Accounting Layer
In AV1, whenever a symbol is read from the bitstream, the decoder keeps track of the number of bits used to represent that symbol. Bit accounting information has a block level context, which means the analyzer can figure out exactly how many bits are spent in a block on each of the symbol types. In the table below, this bit accounting information is aggregated over the entire frame: 458 read_mv_component symbols were read, totaling 537 bits or 28.5% of all the bits spent to encode the frame.
Symbol Bits % Samples
read_mv_component 537 28.5 458
decode_coefs 393 20.8 431
read_inter_mode 315 16.7 538
read_partition 177 9.4 269
read_mv 126 6.7 77
read_ref_frames 115 6.1 261
read_drl_idx 72 3.8 73
read_block_reference_mode 67 3.6 230
read_is_inter_block 40 2.1 236
read_mb_interp_filter 28 1.5 230
read_tx_type 7 0.4 7
read_intra_mode 6 0.3 12
The analyzer can also show aggregate bit information over several frames. This is useful when comparing two different bitstreams. These charts are purposely laid out so that they don’t move when toggling between videos in order to make it easier to spot differences.
Bit accounting information can also be displayed as a layer. The highlighted purple areas indicate the distribution of bit within a frame.
Disabling the image makes the bit accounting layer a bit more visible.
If we go to the second frame, we see brighter colored areas. By default, the color scale and intensity is computed based of the number of bits/pixel relative to the maximum number of bits/pixel in the frame. The Bit Scale is adjustable:
- Frame Relative: default, this is useful when analyzing bit distribution within a single frame.
- Video Relative: the maximum number of bits/pixel is computed over all frames in a video sequence. This is useful when analyzing bit distribution within an entire sequence.
- Video Relative (all): same as Video Relative, but all currently loaded video sequences are used to compute the scale. This is useful when comparing frames between two sequences.
If we look at the second frame we see that it has brighter colored areas. This doesn’t mean it uses more bits, it just means that most of the bits in the frame are spent in a smaller area of the image.
Colors scales can also be adjusted, by default the analyzer uses a heat map scale with transparency. Blue colors are mostly translucent, and red areas are opaque.
- Single Color: single color with transparency.
- Heat Map: default, heat map color scale with transparency.
- Heat Map (Opaque): heat map color scale without transparency.
The bit accounting layer also lets you filter based on a symbol type. This can be useful to dive into the bit distribution of a specific symbol. For instance, below we can see the bit distribution of the “read_mv” (read motion vectors) symbol.
Skip Flags Layer
Skip flags are used to indicate that a block has no coefficients. Skipped blocks are drawn as blue, and it is evident from the image below that skipped blocks appear in areas of the image that are mostly empty. If we also overlay the bit accounting layer, we can see that most bits are spent in non-skipped areas, which is to be expected.
The Emscripten decoder is fast enough to be usable, but of course it could always be faster. In high-bit depth mode, the codec uses 64 bit math which needs to be emulated in asm.js because it lacks 64 bit integer math. This currently affects performance by 10% to 20%. WebAssembly has support for 64 bit math, and once that’s ready we’ll be switching over to WebAssembly.
As you can image, the AV1 has a ton of SIMD code paths. Currently, we disable all SIMD in the analyzer build.
If you wan’t to measure raw decoding performance in your browser, you can try this benchmark link. On my machine, Firefox needs 512 milliseconds to decode 15 frames, Chrome 719 and Safari 1044.
Another performance problem is YUV2RGB conversion. This code uses floating point math and needs to be optimized.
If you want to catch up on the progress in the Alliance of Open Media, watch this: