I remember a funny old ad, where a Godzilla and a robot fall in love and have a Hummer as a baby!
I am working on apps where users can interact with live video using chat, polls, and mini-games. And the technology feels like live streaming and video chat had a baby.
You want to have the scalability and quality of live streaming but also have the interactivity of chat.
In this post, I want to give you a simple overview of the video technology used in these apps. The idea is to provide the intuition behind the technology and share the resources for you to dig deeper. But before that, check the ad out.
Let us start with the live streaming Godzilla.
Here is a quick introduction to live streaming technology. If you know this, you can skip to the next section.
I will use a popular streaming technology by Apple called HLS (Http Live Streaming) to explain the basic concepts.
It works by:
A player on any device such as a mobile phone or a laptop can download this playlist and play the video using standard HTTP calls.
Here is the sequence used by the device player to play the video
A simplified view of HLS Live Streaming
This is a simplified version of what happens. For playback to happen without re-buffering, multiple versions of the segments are created for different network speeds. Smaller ones with lower quality video for slower networks and bigger higher quality ones for the faster ones.
CDNs (Content Delivery Networks) are used to cache and serve the streams from the nearest location to the player. This enables streaming to millions of viewers.
You can get a detailed understanding of the HLS specification here.
For live streaming, the video stream comes from a camera and is compressed by an encoder. Apart from this, the flow remains the same.
As we have seen above that the video data flows through a pipeline from the Camera -> Encoder -> Server -> CDN -> Device Player. And each step is done sequentially, one after another.
For example, the encoder would wait to accumulate enough video data to create a segment. The player would wait for 3 segments to be downloaded before starting playback to avoid re-buffering. This results in the real world action captured by the camera showing up delayed on the video player. This delay between the real world and video on your app is called latency.
~4 seconds latency between the live stream host (real world) and the audience (live video)
It is common to have a latency that is 5X the segment duration. So for a 6-second segment, the latency is 30 seconds.
Effects of Latency
While for a one way broadcast the latency is not a big issue. You are watching a live event a few seconds delayed to get a good quality playback experience. The only issue is that the live video might be a little delayed compared to a TV broadcast. So your neighbors watching the same telecast on TV gets to see that goal before you do.
But for interactivity, it becomes a bigger problem. Any audience feedback received by the live show host(s) is delayed. If the live show host(s) has to respond to a question or interact with the audience, any delay beyond a few seconds destroys the user experience.
The acceptable latency depends on the application. A video chat requires sub-second latency. But an interactive live stream quiz or live shopping might be able to work with a few seconds of latency.
One obvious way to reduce latency is to reduce the segment duration. And it works until a point.
The video encoder requires a set of video frames to efficiently compress video data. Reducing the segment duration reduces the number of video frames available for compression. This results in sub-optimal compression. Which means poorer video quality for the same amount of bits.
So for this and network stability reasons, it does not work very well to reduce the segment duration beyond 2 seconds.
I describe three standards-based technology options for achieving lower latency.
Chunked Transfer Encoding
The key cause of latency is the delay in the accumulation and transfer of video data through each step in the live video pipeline. So latency can be reduced if instead of waiting for the completion of each step, the transmission through the pipeline starts earlier.
In this technique, each segment is divided into smaller pieces called chunks by the encoder. This process is called chunked encoding.
The chunks are then sent on the video pipeline (Encoder -> Server -> CDN -> Player) as soon as they are created.
This is done by an HTTP 1.1 mechanism called chunked transfer encoding. This mechanism allows the transfer of an object whose size is unknown. So you request the object and it keeps coming in chunks till completion.
You can imagine chunked transfer as a large order (a segment) placed with Amazon that comes one item (a chunk) at a time.
So how is this different from just having a smaller segment duration you ask? Without going into the details, in this case, the encoder, web server, CDN still work at a segment level. Thus ensuring efficient video compression and optimal network behavior.
Chunked transfer encoding can be used with an MPEG standard for storing audio, video, and text data called CMAF to achieve low latency broadcasts. It is supported by many encoder, CDN, and player vendors.
You can get more details about achieving low latency using CMAF in this outstanding paper by Will Law @ Akamai.
Apple LL-HLS (Low Latency HLS)
This is the approach taken by Apple and is part of the recently updated HLS specifications. It is not very different from chunked CMAF, in that you break down the segments into smaller parts called partial segments or parts. However, the player and the server interaction is a little different.
This specification does not use chunked transfer encoding. Instead, the server advertises an upcoming segment part and the player can put a blocking request for the upcoming part.
It also adds other functionality such as delta playlists to optimize the data transfer for downloading the playlist. The player downloads the playlist several times during the playback, so transferring only the updates help.
To provide an intuitive understanding of the difference between chunked CMAF & LL-HLS. Imagine you go to a restaurant that is about to serve a sumptuous dinner and you want to eat it all. The dinner here represents a segment of video data.
You put a request for the whole dinner and each item in the dinner is brought to you as soon as it is ready. This is how chunked CMAF works.
The Apple variant is as follows. The chef announces that the first course of the meal is being prepared and you ask that to be brought to you along with the updates to the menu. You receive the new additions to the menu. Once all the items in the first course are ready they are brought to you all at once. Chef announces the second course and you order that and so forth.
WebRTC is built for real-time communication and uses a peer to peer connection for video communication. It prioritizes low latency over quality and it is possible to get sub-second latencies.
It’s the robot side of things to quote my Hummer ad example. What! Did you already forget my clickbaity analogy :-)?
As WebRTC is based on peer to peer communication, scaling it for very large broadcasts becomes challenging and expensive. It is a good option for highly interactive broadcasts for a relatively smaller audience.
This is a trade-off between latency, quality, and scale.
If your application is closer to broadcast with a high number of viewers and acceptable latency in the range of 3–8 seconds, then go with LL-HLS and/or chunked CMAF. If it requires sub-second latencies and a relatively smaller number of viewers, go with WebRTC.
Whether to use chunked CMAF or LL-HLS or both will depend on factors such as device and browser support and DRM requirements.
The CMAF standard is a little older. So there are implementations available from encoder providers, CDN, and video players, unlike LL-HLS.
But that might change quickly with LL-HLS now part of the HLS specifications and supported in iOS14, TvOS14, WatchOS7, and MacOS.
So LL-HLS seems a good bet for future-proofing your technology. Finding LL-HLS compliant players for various devices might be a bit of a challenge though.
Here is a summary comparison of the three technology options:
I would love to get your feedback on my post.