WebRTC tutorials always have something in them like “look what you can do in fifteen minutes!” It is a testament to WebRTC’s design that this is even possible: fifteen minutes to get a working peer-to-peer demo is pretty astounding. And by and large, these tutorials are legit: you can get some pretty awesome crap working in well under an hour.
That said, the difference between a working demo and something ready for production is significant. For example,
- What browsers are you going to support?
- Will it be deployed on corporate networks? Will the solution have to handle proxies and firewalls?
- Is it 1–1 chat, or will it be a multi-party conference?
- How will you help users configure their microphones, speakers and cameras?
- Will you support iOS and Android clients, possibly through a native app?
…these are….not simple questions. In the most trivial case, everybody is using Chrome and it’s all 1 on 1 chats and it’s generally fairly easy. But the reality is you’re probably going to have a Product Manager of Doom (PMoD) insist you support IE11, Safari, a gaggle of mobile phones, and a room with 2 to 10 concurrent parties. And that’s where stuff gets….hard.
Let’s start with browsers. Maybe you get lucky and your PMoD only wants Chrome. Yay! Skip to the next section.
Or not. They want IE11 and Safari. This means: a plug-in. Your options are some github projects that use a fairly old version of WebRTC, pay Temasys, or roll your own unholy union of open source software, like Firebreath plus OpenWebRTC or Chrome’s WebRTC implementation.
This also likely means supporting an installer (possibly for both Windows and OSX), potential MPEG-LA licensing issues if you include H.264 support and a doubling/quadrupling of QA effort, since plug-ins frequently behave differently than native implementations in some rather crucial ways. You might also get to deal with other fun stuff like cross-platform drawing models if you opt for the home-grown solution.
Lastly, even with an installer, inside certain companies users are going to be blocked even from installing. So you’ll need a way to deploy internally in those companies, and handle updates.
Let’s say your PMoD primarily wants to use your product with fortune 500 companies. Now your problem is: every one of those companies is going to have some god-awful proxy/firewall infrastructure combined with a healthy dose of random anti-virus software running locally that is going to stymie every effort to seamlessly deploy your product.
You will absolutely need a TURN server, like coturn, and you will absolutely need to support both HTTP CONNECT and TCP 443, because certain network security types don’t seem to understand A) there’s a whole wide world of networking beyond HTTPS and B) UDP isn’t a fad.
Even with a TURN server and 443 access, there’s going to be networks that simply won’t cooperate. For those, you’ll have to track down some person inside this monolithic company and convince them to grant you access. Good times.
Lastly, you’ll have to scatter these TURN servers around the world, because your users are world-wide.
You start to build up a tiny army of servers in AWS.
Your PMoD isn’t done with you yet. Now they want more than two people in the same conference. The problem is: as you add people to a conference, you’re going to run into the handshake problem. Basically, as you add more people, the number of connections will be dictated by:
So if you have four people, that’s 4(4–1)/2, or six connections. Going to eight suddenly means you have 8(8–1)/2, or twenty eight connections. Your PMoD thinks ten sounds good; that puts you at 45 connections to be established, worst case scenario.
Sadly, this won’t scale. At ten people, each person in the room has to push data to nine other participants. Most people lack the upstream bandwidth to do this. You’ve got a Real Problem™.
So how do we get around this? We use a Selective Forwarding Unit (SFU), a Multi-point Control Unit (MCU), or you tell your PMoD they’re on drugs and you can support four (maybe five??) people in a P2P full mesh of doom:
Of course, suddenly you need infrastructure — a lot of it, and likely redundant in case some server dies. And that server needs to be rock solid, since it’s a single point of failure for a conference. And you likely need it deployed all around the world, because there’s a serious latency penalty for a user in Australia trying to connect to your media server in the eastern United States. And that infrastructure is going to have to work with your TURN servers. And you need to make a decision about SFU or MCU, both of which have really distinct tradeoffs.
Your PMoD sends you a frantic email; the CEO tried out the product, and her camera didn’t work right and she couldn’t hear anyone else. Turns out the speakers were turned down and the camera was in use with Skype. It also turns out This Is Your Problem™.
You need to provide ways for people to enumerate, configure and test their hardware. This sucks pretty bad, for a number of reasons:
- Windows kinda sucks at sharing between apps. Example: camera is claimed by Skype. Chrome fails to get camera.
- Windows really sucks at managing audio devices. You have your default microphone, but you also have a default communications device, and nobody seems to understand this very well because on some basic level it’s incomprehensible.
- Power users always want to configure their speaker output. They want it to come through their headset, not through their computer speakers (duh), so gonna have to be able to configure that.
- Except Firefox doesn’t allow that. No speaker configuration for you.
- And Firefox’s getUserMedia implementation is insane. Every time you request access to mic and camera, it throws up a permission prompt. Thanks a lot, Firefox.
- Which reminds me: god forbid the user who accidentally denied that permissions prompt in Chrome or Firefox. That’s a support call. Crappo.
- Oh, also, your Fortune 500 customers keep installing Chrome/Firefox with camera and mic access disabled by an administrator. Thanks a lot guys. Thanks for everything.
…you’ll also need some way of measuring microphone strength, deciding the video stream is legit, and allowing users to configure their speakers correctly. For a serious project, these aren’t niceties; they’re crucial to actual success in the real world.
And then come all the various other junk: muting audio/video, allowing people to swap hardware mid-conference, providing people tools to debug their devices, etc.
PMoD doesn’t understand why they can’t use iOS. No problem, you say. I’ll just port some WebRTC library to mobile, create a native app, implement all of the signaling necessary for the SFU/MCU solution that we ended up selecting, and voila: done. Piece of cake!
Oh, and then I’ll do most of that all over again, but for Android.
This requires an intimate knowledge of stuff like an SDP offer/answer exchange, trickle ICE, DTLS negotiation, native code and so forth. Your project takes a three month hit. It also tightly couples you to some SFU/MCU that you opted for because signaling isn’t a part of the WebRTC standard.
…that damn fifteen minute demo. Look what it got you.
I shouldn’t complain. I really am not complaining. It used to be that it’d take a sizable team (teams??) of engineers to launch a solution like this. Now, with WebRTC, it’s entirely possible to have a small team manage all of this infrastructure for a single company.
It isn’t for the faint of heart — you’ll want some media and networking gurus, and someone with SIP/SDP experience can really save your butt, and someone who is fearless in the face of some ungodly native code. But it’s within the realm of reason.
Just beware the 15 minute demo.