The majority of -related material is about the application level of code writing and doesn’t help understand the technology. Let’s dive deeper into the topic and find out how the connection establishes, why we need the and servers, and what a session descriptor and candidates are. WebRTC TURN STUN What is WebRTC for? is a browser-oriented technology that allows us to connect to clients to transmit video data. Internal browser support (external technologies, such as Adobe Flash, aren’t needed) and an ability to connect clients without using any additional servers ( connection) are the main peculiarities of . WebRTC p2p WebRTC Establishing a connection is complicated as computers don’t always possess public s (their internet addresses). Due to a low amount of addresses and for the security’s sake, was invented. It allows creating private networks, for instance, for home use. Many home routers support , thus all devices that are connected to the router have internet access, although service providers usually allow one address. Public IPs are unique, whereas private ones aren’t, hence connection is difficult. p2p IP IPv4 NAT NAT IP p2p To better understand the concept, let’s take a look at three scenarios: 1. Both nodes are within the same network 2. Both nodes are within different networks (private and public) 3. Both nodes are within different private networks with the same IPs The first letter in the images above represents a node type: for a router, for a peer. r p Image one shows a nice situation. Nodes within their networks identify with network addresses, and they can directly connect with each other. IP Image two shows two different networks with similarly sequenced nodes. We introduce routers here, and they have two network interfaces: inside and outside their system. Hence, they have two s. Usually, nodes have only one interface, and they use it to interact within their networks, and if they transmit data to something outside their system, they only do it with the help of inside of a router. That’s why these nodes appear as a router address – it’s their external Therefore, the node has an external ( ) and an internal one ( ), with the first address also being external for all other nodes within the network. The node experiences similar circumstances, so their connection is impossible as long as only their internal addresses are used. It’s possible to go with external s, but it poses a challenge since all nodes within the same private network are under the same external address. solves this problem. IP NAT IP IP. p1 IP 10.50.200.5 192.168.0.200 p2 IP NAT What happens if we decide to connect nodes via their internal addresses? The data won’t leave the network. To magnify the effect, imagine a situation from the third image, where both nodes have the same internal addresses. If they use those addresses to communicate with one another, they both will be communicating with themselves. Here is where steps in. In order to solve these problems, uses the protocol which requires additional and servers. WebRTC WebRTC ICE STUN TURN The two phases of WebRTC In order to connect two nodes with the protocol (or just if there are two iPhones), it’s necessary to complete some preliminary steps to establish a connection. That’s the first phase. The 2nd phase is video data transmission. WebRTC RTC Although uses lots of means of communication ( and ) and can flexibly switch between them, , which is not surprising as connecting two p2p nodes isn’t a simple task. Having said that, we need an additional, not related to , data transmission way. It can be an HTML protocol, socket transmission, or an SMTP protocol. This way of sending initial data is a signaling mechanism. Not too much information is transmitted. The data is transmitted as text and is split into two categories: and (you can also read about them ) is used to establish a logical connection, is for a physical connection. It’s important to remember that gives you the information that needs to be passed on to the next node. As soon as we transmit the necessary information, the nodes will be able to connect, and our help won’t be needed anymore. Therefore, a signaling mechanism which we need to create , will be used and not while we transmit video data. WebRTC TCP UDP this technology does not possess a protocol for transmitting connection data WebRTC SDP Ice Candidate here SDP Ice Candidate WebRTC separately only upon connection So, let’s take a look at the first phase. It consists of several steps. First, let’s look at it as for the connection initiating node, and then as for the connection receiving node. Initiator ( ): caller Receiving a local media stream and establishing its transmission ( ) getUserMediaStream An offer to begin video data transmission ( ) createOffer Receiving an own object and sending it via the signaling mechanism ( ) SDP SDP Receiving own Ice candidate objects and sending them via the signaling mechanism ( ) Ice candidate Receiving a remote media stream and showing it on the screen ) (onAddStream Receiver (callee) Getting a local media stream and establishing its transmission ( getUserMediaStream) An offer to begin video data transmission and answer creation ( ) createAnswer Receiving an own object and sending it via the signaling mechanism ( ) SDP SDP Receiving own Ice candidate objects and sending them via the signaling mechanism ( ) Ice candidate Receiving a remote media stream and showing it on the screen ( ) on AddStream Only the 2nd step is different. However complicated these steps might seem, as a matter of fact, there are just three of them: sending a local media stream (step 1), establishing the connection parameters (steps 2-4), receiving a remote media stream (step 5). The 2nd step is the most difficult one as it consists of two parts – we need to establish the logical and physical connection. The latter shows the way for the packet to follow to get from one node to the other, and the former points at video and audio parameters – what quality and codecs to use. Connect the and the steps to the steps with the transmission of and objects. createOffer createAnswer SDP Ice Candidate Now we are going to take a look at some entities, such as , and . MediaStream SDP, Ice Candidate Main entities MediaStream MediaStream is a basic entity, it consists of video and audio data streams. There are two types of media streams, local and remote. Local streams receive data from the input devices (camera, mic), remote streams receive data from the network. Therefore, every node has a local and a remote stream. In , for these streams, there is a interface, as well as a sub-interface which is out there specifically for a local stream. In , you can only face the former one, but if you use , you can also encounter the latter. WebRTC MediaStream LocalMediaStream JavaScript libjingle suggests a difficult hierarchy within a stream. Every stream consists of several media tracks ( ), which can consist of several media channels ( ). There can also be several media streams. WebRTC MediaTrack MediaChannel For example, we not only want to transmit a video of ourselves but also our table with a piece of paper on it, as we are about to write something on the piece of paper. We’ll need two videos (of us and the table) and one audio (us). Obviously, we and the table should be divided into different streams, as they aren’t really dependent on each other. That’s why we’ll have two s: one for us and one for the table. The first one will have video and audio data, and the 2nd one – video data only. MediaStream The media stream has to provide an opportunity to keep different types of data, namely video and audio. This is accounted for in the technology, therefore every data type is realized through . MediaTrack has a special quality called which determines whether it’s a video or audio before us. MediaTrack kind So how does everything happen inside the program? We create two media streams. Then we’ll proceed to create two video tracks and one audio track. Get access to the camera and microphone. Tell every track what feature it needs to use. Add a video and audio tracks into the first media stream and the video track from the 2nd camera – into the 2nd media stream. How to distinguish media tracks on the other end? By the feature that every media channel has. Media tracks have the same feature. label So, if we could identify the media tracks with a mark, why do we need to use two of them instead of one in this example? You can transmit one media stream and use different tracks within it. Now we’ve reached an important feature of media tracks, they media tracks. Different media tracks aren’t synced between each other, but all tracks are played within each media track. synchronize simultaneously Therefore, if we want our words, facial expressions and the piece of paper to be played at the same time, we need to use the same media track. If it’s not too important, it’d be better to use different media tracks, so the picture is smoother. If a track needs to be switched off during the transmission, we can use the feature of a media track. enabled In the end, it’d be nice to think about stereo sound. Stereo is two different sounds, and they have to be transmitted separately. is used for that. Media track can use different channels (for instance, 6 if we need a 5+1 sound). The channels inside the media track are also synced. When a video is played, usually one channel is used, but it’s possible to use several of them, for example, to apply advertisement. MediaChannel : we use a media stream to transmit video and audio data. The data is synced inside each media stream. We could use different media channels if we don’t aim for synchronization. There are two media tracks inside each stream, for video and audio. There can be more tracks if we need to transmit different videos (interlocutor and their table). Every track can consist of different channels but usually is used for stereo sound only. To summarize In the simplest situation, we won’t have a video chat, and there’ll only be one local media stream of two tracks, audio and video. Each track will consist of one primary channel. The video track is responsible for the camera, while the audio track – for the microphone. The media stream is a container for both of them. Session descriptor (SDP) Different computers have different cameras, mics, graphics cards, etc. There is a multitude of parameters to them. It all needs to be coordinated for the media data transmission between two network nodes. does it automatically and creates a special object – . Transmit to another node, and you can transmit the video data. There is no connection with another node, though. WebRTC SDP SDP Any signaling mechanism can help here. can be sent via sockets, humans (tell the to another node via the phone), or.. well, post office. You get a ready , and it needs to be sent out – as simple as that. When the other guy receives , they need to send it to . It is stored as a text and can be changed from the applications, but it’s rarely needed. As an example, with a desktop <-> phone connection, sometimes it’s obligatory to forcefully choose the right audio codec. SDP SDP SDP SDP WebRTC Usually, when the connection is established, it’s obligatory to mention an address, such as a . There is no necessity to do it here, as you yourself will send the data via the signaling mechanism. To tell that we want to establish a connection, function has to be invoked. After that’s been done, and the special created, a new object will be created and sent to the same callback. All you need to do is transmit this object to another node (interlocutor) via the network. URL WebRTC p2p createOffer callback SDP The signaling mechanism will help data, this object, to arrive. This session descriptor is alien for this node, therefore it bears useful information. SDP Receiving this object is a signal to start the connection. So, you have to agree with it and call the function. It is an absolute analog to . Your will receive a local session descriptor and then will need to be transmitted via the signaling mechanism back again. createAnswer createOffer callback It’s worth mentioning that calling a createAnswer function is only possible after receiving an alien object. That’s because the local object that will generate upon calling createAnswer has to rely on a remote object. Only then will it be possible to coordinate your video settings with those of your interlocutor. Also, don’t call and before receiving a local media stream, as they will have nothing to write to the object. SDP SDP SDP createAnswer createOffer SDP Since allows you to edit an object, you will need to install the local descriptor upon receiving it. Sending the things gave us back to it might seem strange, but that’s the protocol. A remote descriptor also needs to be installed upon receiving. WebRTC SDP WebRTC After this handshaking of some sort, the nodes will learn about each other’s wishes. For example, if node supports codecs and , and node supports codecs and , they both will choose codec . That’s because these nodes know local and alien descriptors. The connection logic has been established and it’s possible to send media streams now. There is another problem, though: the nodes are still connected with just a signaling mechanism. 1 A B 2 B C B Ice candidates Upon establishing a connection, the address of the node that you need to connect with isn’t mentioned. First, logical connection establishes, then physical, although it used to be the other way around. It won’t be so strange, however, if we keep in mind that we use an external signaling mechanism. So, the logical connection has been established but there’s no path that the nodes can use to transmit data yet. Not everything is simple here, but we can still start with the simple things. Imagine that the nodes are within the same private network. As we know, they can easily connect with each other via their internal (or other addresses if is not in use). IPs TCP/IP tells us the objects through some They too arrive in the form of text, and they too need to be sent through a signaling mechanism, just like the session descriptors. If the session descriptor contained information about our settings on the camera and phone level, candidates do that with our placement inside a network. Send them to another node, and it will be able to logically connect with us. As it already has a session descriptor, the data will flow in. If it doesn’t forget to send us its candidate object (information on where it’s placed inside the network), we’ll be able to connect with it. WebRTC Ice candidate callbacks. There is another difference from a classical client-server interaction. Communication with an HTTP server goes as request-answer. The client sends data to the server, the server processes it and sends it to the address mentioned in the request packet. It’s obligatory to know two addresses in and connect them from both sides. WebRTC The difference from session descriptors is that only remote candidates have to be installed. Editing is prohibited here and won’t be of use. In different realizations, candidates must be installed only after the installation of session descriptors. WebRTC So, why can there be one session descriptor but lots of candidates? Because placement within a network can be determined not only by an own internal address but also by an external router address (one or more) and by server addresses. IP TURN So, we have two candidates within one network (picture below). How to identify them? With the help of addresses only. Of course, different transport can be used ( and ), as well as different ports. This is the information that’s contained inside the candidate object – , , etc. For instance, let’s take port and transport. IP TCP UDP IP TRANSPORT, PORT 531 UDP So, when we’re inside the node, will send us this as a candidate object: [ . It’s not an exact thing, just a scheme. If we’re inside the node, the candidate will change to . will receive ’s and through a signaling mechanism and will be able to connect to p2 directly. In fact, will send data to , hoping that it will reach . Whether that address is owned by or an intermediary, not important. What is important is that the data will be sent to this address and will be able to reach p1 WebRTC 10.50.200.5, 531, udp] p2 [10.50.150.3, 531, udp] P1 p2 IP PORT p1 10.50.150.3:531 p2 p2 p2. While the nodes are inside the same network, everything is a piece of cake, as every node only has one candidate object (their own, which is their placement in the network). But the number of candidates will grow by a lot if the nodes are in different networks. Let’s take a look at a more complicated case. One node is behind a router ( ), and the 2nd node is in the same network as that router (for example, on the internet). NAT This case has its own solution. A home router usually has a table. This mechanism is created for the nodes inside a private router network to communicate with, for example, websites. NAT Let’s assume that a web-server is connected to the internet directly, meaning it has a public . Let it be the node. Then, the node (web client) sends a request to the address. First, the data arrives at the router or, to be precise, to its internal interface . After that, the router memorizes the source address ( ) and puts it in the table. Then, the router changes the source address to its own ( -> ). Then, using its interface, the router sends the data to the web server. The web server processes the data generates an answer which it sends back to the router. When the router receives the data, it checks the table and sends the data over to the node. The router here is an intermediary. IP p2 p1 10.50.200.10 r1 192.168.0.1 p1 NAT p1 r1 external p2 NAT p1 Well, what if several nodes from the internal network send a request to the external network? How does a router realize where to send the answer? This problem is solved with the help of ports. When the router substitutes the node address with its’ own, it also substitutes the port. If two nodes request the internet, then the router substitutes their source ports to different ones. Then, when the packet from the web server returns to the router, the router will understand the recipient of the packet by the port. The example is down below. Going back to the and the part where it uses an protocol (hence Ice candidates). The node has one candidate (its placement inside the network, ), and the node that is with the router with NAT, has 2 candidates: local ( ) and a router candidate ( ). The first one isn’t of much use here, however, it is being generated as WebRTC knows nothing about a remote node – it can be within the same network or not. The second candidate is useful and as we know, the port will have an important role to get through NAT. WebRTC ICE p2 10.50.200.10 p1 192.168.0.200 10.50.200.5 The entry in the table is generated only when the data leaves the internal network. That’s why the node has to send its data first, and only then can the data from reach . NAT p1 p2 p1 Actually, both nodes will be behind . To create an entry in every router’s table, nodes have to send something to a remote node, but this time none will be able to reach the other. That’s because nodes don’t know their external addresses, and sending data to the internal addresses is pointless. NAT NAT IP However, if external addresses are known, the connection will be easily established. If the first node sends the data to the second node router, the router will ignore the data as its table is empty at that moment. However, the first node router has got an entry in the table. Now, as soon as the 2nd node sends the data to the first node router, the router will successfully send to the 1st node. Now, the table of the 2nd router has the needed data. NAT NAT NAT The problem is, to find out an external , we need a node that is inside a public network. In order to deal with this problem, additional servers are used, that are connected to the internet directly. They also help create those entries in the table. IP NAT STUN and TURN servers Available and servers must be mentioned upon a initialization, and we’ll be calling them servers from now on. If the servers aren’t mentioned, only nodes from the same network will be able to connect (those that are connected to the network without ). It’s important to mention that networks require you to use servers to be operational. STUN TURN WebRTC ICE NAT 3g TURN The server is a server on the internet that sends a return address (source address of the node) back. The node behind the router communicates with a server to bypass A packet that arrived at the server contains a source address. It is a router address, in other words, an external address of our node. This is the address that a server returns. Therefore, a node receives its external and port that makes him available in the network. Then, creates an additional candidate with this address (external router address and port). Now the table has an entry that allows the packets that are sent to the router via a correct port, to our node. STUN STUN NAT. STUN STUN IP WebRTC NAT A STUN server example: how it works The server will be . Router and node stand as and respectively. We will also need to look after a table, let’s make it . In that table there is usually a lot of entries from different subnetwork nodes – we won’t mention them. STUN s1 r1 p1 NAT r1_nat Let’s start with an empty : r1-nat Internal IP Internal PORT External IP External PORT There are 4 columns in the table. It gives each column from the first two ( ), their couple from the last two (I , ). IP, PORT P PORT sends a packet to . We see four interesting fields in the table down below, they’re in the title of a transport packet ( or ) – and of the source and receiver. Let us imagine that these are the addresses. P1 s1 TCP UDP IP PORT Src IP Src PORT Dest IP Dest PORT 192.168.0.20035777 12.62.100.2006000 sends this packet to . The router will need to substitute the address of a source as the address that’s mentioned in the packet won’t work for an external network. Furthermore, addresses from that range are reserved, and there is no address on the internet that has that address. The router substitutes the packet and creates a new entry in . That’s why it needs to come up with a port number. Since different nodes within a subnetwork can call out to an external network, the table has to contain additional information, so that the router can determine, what node is the recipient for the return packet from the server. Let’s imagine that the router created a port . P1 r1 Src IP, r1_nat NAT 888 The changed packet heading: Src IP Src PORT Dest IP Dest PORT 10.50.200.5888 12.62.100.2006000 – router’s external address. 10.50.200.5 : r1_nat Internal IP Internal PORT External IP External PORT 192.168.0.20035777 10.50.200.5888 address and a subnetwork port are the same as in the initial packet. Actually, sending it back, we need to have a way to completely restore them. for the external network is a router address, and the port will change to one created by the router. IP IP An actual port, to which node accepts connection is, indeed, , but the server sends data to a dummy port . It will be later changed to the real one, . p1 35777 888 35777 So, the router has substituted an address and a port of the source in the packet heading and added an entry to the NAT Now the packet is sent via the network to the server – to the s1 node. S1 has a packet like this upon entrance: Src IP Src PORT Dest IP Dest PORT 10.50.200.5888 12.62.100.2006000 So, a STUN server knows that it received a packet from . The server sends this address back now. It’s worth stopping here for a bit and look once again at this. 10.50.200.5:888 The tables above are a piece from a packet , not from its . We haven’t discussed the content since it’s not so important – it’s described in the protocol. Now, however, we also will be looking at the content. It will be simple and will contain the router address – 10.50.200.5:888, despite us taking it from the packet heading. It’s not done often as protocols don’t usually care about node addresses. The only important thing is that the packets are delivered as intended. But here we are looking at a protocol that establishes a path between the two nodes. heading content STUN Now we got the 2nd packet which goes backward: Src IP Src PORT Dest IP Dest PORT 12.62.100.2006000 10.50.200.5888 The heading has changed because the source and the receiver swapped places which is logical as the packet’s destination is different now. Content 10.50.200.5:888 This is the content of the packet. Actually, it could contain a lot of information. But only what’s important for understanding how the server works is mentioned here. STUN Then the packet travels throughout the network unless it ends up on the external interface of . The router understands that the packet isn’t meant for him. How? It can be determined by the port. Port isn’t used by the router for its own purpose but for the mechanism. That’s why the router is looking at that table. It also looks at the External column and searches for a row that matches with the from the arriving packet, which is . r1 888 NAT PORT Dest PORT 888 Internal IP Internal PORT External IP External PORT 192.168.0.20035777 10.50.200.5 888 We’re lucky that this row exists. If we weren’t so lucky, the packet would be dropped away. Now we need to understand, to what subnetwork node to send the packet. Don’t hurry, let’s remember how important ports are in this mechanism. Two nodes in the subnetwork could be sending requests to an external network. Then, if the router created port 888 for the first node, it created port 889 for the 2nd one. Let’s assume that that’s the case, and looks like this: r1_nat Internal IP Internal PORT External IP External PORT 192.168.0.20035777 10.50.200.5888 192.168.0.17335777 10.50.200.5889 We can understand by port 888 that the needed internal address is 192.168.0.200:35777. The router changes that receiver’s address from Src IP Src PORT Dest IP Dest PORT 12.62.100.2006000 10.50.200.5888 to Src IP Src PORT Dest IP Dest PORT 12.62.100.2006000 192.168.0.20035777 The packet successfully reaches node and, upon looking at the packet content, the node finds out its external address – its address in the external network. It also knows the port that it makes way for through . r1 IP NAT What’s next? How is it useful? The usefulness lies within the entry to table . If anyone sends a packet with port 888 to , the packet will be redirected to . Thus, a narrow way to a hidden node is created. r1_nat r1 p1 p1 From the example above you can imagine how and server work. Actually, and servers are there to bypass restrictions. NAT STUN ICE STUN/TURN NAT Between node and a server, there can be several routers. In case the node receives the address of the router that is the first in the same network as the server. In other words, we’ll receive an address for the router that’s connected to the STUN server. It is exactly what we need for the p2p communication if we keep in mind the fact that each router will be updated with an important row in the NAT table. That’s why the way back will be as smooth as silk. server is an upgraded server, therefore each server can work as a server. However, there are advantages to the server. If a communication is impossible (in networks), the server becomes a relay and starts working as an intermediary. Of course, is out of the question then, but outside of the mechanism, nodes think that they have direct interaction. TURN STUN TURN STUN TURN p2p 3g p2p ICE When is the server a must? Why is the server not enough? Because there are different kinds of . They substitute address and a port in the same manner, but some of them have embedded falsification protection. For example, in the symmetrical table, two more parameters are stored, , and a remote node port. A packet from an external network goes through to the internal network only when the address and port of the source match those mentioned in the table. That’s why the trick with the server doesn’t work out table stores the address and port of the server. When the router receives a packet from a interlocutor, it drops it off as it deems the packet falsified. The packet has arrived not from the server. TURN STUN NAT IP NAT IP NAT STUN NAT STUN WebRTC STUN Therefore, the server is needed when the two interlocutors are behind a (everyone’s behind their own) TURN symmetric NAT TL;DR Media stream Video and audio data are packed into media streams Media streams synchronize media tracks that they consist of Different media streams aren’t synced between themselves Media streams can be either local or remote. Local ones are in charge for camera and microphone, whereas remote ones receive data from the network as a code There are two types of media tracks: for video and for audio Media tracks can be turned on or off Media tracks consist of media channels Media tracks synchronize the media channels they consist of Media streams and media tracks have marks that help to distinguish them from one another. Session descriptor Session descriptor is used for a logical connection of two nodes within a network Session descriptor stores information about available ways to code audio and video data WebRTC uses an external signaling mechanism. Transferring session descriptors ( ) becomes an application’s task SDP Mechanism of logical connection consists of two steps: and offer answer Session descriptor generation is impossible without using a local media stream with an . It’s also impossible without using a remote session descriptor with an offer answer A received descriptor must be given to realization, regardless of whether this descriptor was received remotely or locally from the same realization WebRTC WebRTC There is also an opportunity to slightly change session descriptor Candidates is a node’s address within a network Ice candidate The address can be own, router’s or the server’s TURN There are many candidates A candidate consists of an address, port and a transport type ( or ) IP TCP UDP Candidates are used to establish a physical connection between two nodes within a network Candidates need to be sent via a signaling mechanism Only remote candidates should be transferred to a realization WebRTC In some realizations, candidates can be sent only after a session descriptor is installed WebRTC STUN/TURN/ICE/NAT is a mechanism that allows access to an external network NAT Home routers support a special table NAT Routers substitute addresses in packets. The source address becomes their own if the packet goes to an external network, and the source address becomes a node address within the internal network if the packet arrived from an external network. uses ports to allow multi-channel access to an external network NAT is a mechanism to bypass ICE NAT and servers help to bypass STUN TURN NAT server allows the creation of obligatory entries in a table and returns an external node address STUN NAT server generalizes mechanism and makes it always working TURN STUN In the worst-case scenarios, a server is used as a , so a turns into a client-server-client connection. TURN relay p2p Previously published at https://forasoft.com/blog/article/what-is-webrtc-156 This article was written by Dmitry K., Fora Soft Senior web and video developer