Developing a Video Conference Call App: Protocols, Architecture, and Considerations

A surge in video communication tools has become a catalyst for telemedicine, entertainment, e-learning, fitness, e-commerce, and other online-related businesses. If you want to implement real-time video communication in your product, integrating a ready-to-use solution will be an easy way to follow. However, everything will drastically change if you decide to develop a custom solution for which video streaming is a key feature.

Classification of Video Service Platforms

All live-streaming business cases can be divided into three groups of technological solutions:

One-to-many: These are services such as broadcasting (e.g., IPTV), VOD (Video on demand), and live streaming.
One-to-one: Video calls one-on-one, video chat.
Many-to-many: Group calls, conferences with multiple hosts.

The chosen technological solution impacts the product’s architecture, defining the software roadmap and recommended real-time communication protocol.

Real-Time Communication Protocols for Building Video Services

The key moment in planning video streaming app development is the selection of a communication protocol that defines how data communicates from one device or system to another over the Internet. From the perspective of content delivery, there are three main approaches associated with different transport protocols:

MPEG-DASH/HLS – a media protocol used for cross-platform video transmission of live or on-demand video content. For example, it can be used for TV streaming.
WebRTC – is a low-latency protocol designed for one-to-one video streaming. It can also be used in other cases where low latency is required. It needs a more significant server infrastructure than the first case. WebRTC is developed specifically for certain business cases, since group calls, broadcast streaming, or one-on-one calls have significantly different architectures. However, if we’re talking about video calls, WebRTC app development is the way to go.
RTMP – a Real-Time Messaging Protocol that can be optimized for low latency. It has several implementation options, but it can only be played by applications (in browsers, using plugins). By splitting streams into fragments, RTMP can effectively transmit more information. It is primarily used for transmitting live streams on platforms like YouTube.

Let’s now overview what types of solutions these protocols can be used for.

Video Streaming/Conferencing App Architecture Patterns

Architecture must be selected in accordance with the business case and functionality you want to incorporate into your product. Below, I will outline the characteristics of the architecture for different business cases.

LIVE STREAMING/VIDEO-ON-DEMAND APPLICATIONS LIKE NETFLIX OR HULU

This group of projects requires the use of the RTMP protocol. The system should include several essential components:

Streaming Server: This server is responsible for handling incoming streams from publishers and distributing them to viewers. It often includes a built-in transcoder.
Transcoder: The transcoder is a crucial part of the streaming server, responsible for re-encoding the stream from the publisher into the broadcast protocol used for streaming. This ensures compatibility and optimization for viewers.
CDN (Content Delivery Network): The CDN is essential for caching and delivering content to viewers. Without a CDN, the quality of the output can fluctuate depending on the network conditions, leading to an inconsistent user experience. Choosing the right CDN ensures the availability and performance of the live stream.
Business Logic and Billing Server: This component manages the business-related aspects of the streaming service. It handles user authentication, authorization, billing, and other business logic. It’s crucial for monetization and user management.

Other system elements are optional and depend on the specific functionality you want to implement. Typical live streaming apps rely onNGINX, Amazon services, or NodeMediaServer. A perfect fit will depend on the business requirements. For instance, ready-to-use solutions like NodeMediaServer may suit products that won’t be used by a large audience. However, branding and scaling will require assembling the product from different parts.

ONE-TO-ONE VIDEO CHAT APPS

One–to–one video chat apps are the simplest option if no additional functionality is required. This functionality can be implemented in chat roulette, dating apps, and corporate systems. For example, we implemented one-to-one calls in an enterprise communication system.

If clients are located on the same network (except for 3G), the following parts of the backend infrastructure are required:

Signaling Server: This server is used so that clients know whom to call (addresses).
Business Logic Server: This server handles the business-related aspects of the service.

Unfortunately, such cases are extremely rare. Therefore, in real-world scenarios, two additional types of servers are necessary, known as STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) servers. There is no need to develop these types of servers from scratch as they already exist with MIT licenses and can be easily deployed.

The key point is that setting up a UDP pool for relay is required for TURN. Additionally, communication in cellular networks won’t work without TURN because they often use symmetric NAT and address falsification protection. In sectors like banking and healthcare, address falsification protection is also common, so the use of TURN servers is necessary.

VIDEO CONFERENCING APPS LIKE ZOOM OR GOOGLE MEET

WebRTC protocol is used for video conferencing apps like Zoom and Google Meet, and there are three network organization types for group calls in video conferencing systems, which determine the quality and functionality of group calls.

A mesh network is a decentralized network where devices are interconnected, forming a mesh-like structure. Unlike traditional networks with a central hub, mesh networks enable direct communication between devices. This architecture offers advantages like scalability, redundancy, and self-healing but may not handle a large number of participants and requires significant participant bandwidth.

MCU, commonly used in video conferencing, manages multiple audiovisual streams in multipoint conference calls. It combines individual participants’ audio and video data into a single stream sent to all participants. This centralized approach relies on the MCU for media stream processing and distribution.

In contrast, SFU is another component of video conferencing systems. It doesn’t merge and redistribute media streams but selectively forwards specific streams to participants based on their needs, considering network conditions and device capabilities. SFUs are often used in decentralized or peer-to-peer conference setups and are the most optimal choice for video conferencing applications.

However, it’s worth understanding that in some cases where video conferencing is not a key feature, it’s justified to opt for the integration of ready-made tools such as Zoom. For example, it’s a common solution for telemedicine platforms.

Five Things to Consider When Planning a Video Communication Product

Before embarking on the video service development process, you need to conduct an analysis. This analysis will allow you to find out specific requirements and identify potential challenges.

For example, if you need to implement MPEG-DASH/HLS broadcasting for compatibility with any browser or application while also requiring low-latency streaming with delays under a second, it can be unattainable due to incorrectly chosen broadcasting standards.

The architectural design and technology selection stage (Technical Analysis) are crucial for the project’s success. You need to thoroughly examine all the criteria before starting, not only refining the requirements but also prioritizing each one for a specific client or project. This ensures choosing the most valid solution, optimal in terms of cost versus requirements, without overlooking something essential.

FEATURES AND INTEGRATIONS

Understanding what functionality is needed right now and may be required in the future allows for designing the right technical solution. It’s important to take into account all limitations and assess the need for additional services provided during video streaming. Such services can include recording, screenshot generation, AR/VR functionality, and machine learning capabilities (background blurring, face recognition, etc.).
NUMBER OF VIDEO SESSION PARTICIPANTS

Important factors to be considered in the planning stage are the number of participants in media sessions and sources and receivers of video streams. This affects the choice of technologies that will allow the transmission of video of satisfactory quality for all users.
LATENCY

By latency, I mean what real-time means to you. All real-time services have some delay. For example, in online conferences or live streaming, there’s typically a delay attributed to factors like broadcast standards. Even in advanced implementations, it starts at around 10-12 seconds and can go up to a minute. In practice, feedback usually occurs through chat, and delays in responses are often attributed to publisher-related physical delays (not having read a message in time, not responding in time, etc.)
DIGITAL RIGHTS MANAGEMENT (DRM) AND REGULATORY COMPLIANCE

DRM is necessary because there’s no foolproof protection against hacking, even if the transmission channel is fully secured. Any issues related to implementing protection against hacking need to be carefully evaluated in terms of the trade-off between implementation time and the level of protection. It’s crucial to consider that implementation time increases significantly with the level of protection.

Industry-specific requirements can impose additional demands. For example, in healthcare, complying with HIPAA involves implementing specific security measures and adhering to guidelines to protect sensitive health information.
INFRASTRUCTURE COSTS

Video services require ongoing maintenance costs that depend on the type of service and the workload. They consume significant processing power and bandwidth, which needs to be paid for, either to service providers or for servers and bandwidth.

While service providers might charge around 4-5 cents per minute of service (publisher), theoretically, self-implementation could reduce costs to about 1 cent per minute, but it depends on the services provided. Some services have non-standard billing strategies, such as Zoom, where you pay for the host rather than for time, or services with fixed fees after a certain number of minutes, making it possible to choose third-party services for different business cases.

Wrapping Up

The effectiveness of video streaming and conferencing solutions relies on their specific characteristics. Choosing the right architecture is crucial in determining how well the product will perform. It is important to carefully consider the following factors: desired features and integrations, the number of participants in the video session, adherence to regulatory compliance, and the cost of the infrastructure.

Also published here