Multiple successful tech startups, done some big company time too. Based in Edinburgh, Scotland.
A job with 50 flights a year, and a life in back to back audio and video call meetings, and the same question over and over again in my mind: Why are all these conferencing platforms so shit, we’ve got AI making deep fakes from dead people in 4K with Dolby surround, and my colleague is still an unintelligible single delayed pixel falling off the edge of the screen out of sync, and no one seems to ever be able to share the right screen. No I’m still seeing the Presenter View.
I quit a dream job with a multinational technology company last year to try and fix this. This is not my first rodeo but I quickly found out the problem was a lot deeper and fundamental then I expected. Then Covid happened and now everyone’s asking the same question I was asking: Why are all these conferencing platforms so shit.
Let me take you down the rabbit hole I find myself that’s all about choosing where you attack on the value stacks, market dynamics of an oddly sized fragmented market, and what I think could be one of the greatest business opportunities I’ve seen in ages.
There’s been numerous surveys done on the problems with video conferencing, you’ll know the list: Audio issues, sync issues between video and voice, poor picture quality, latency, echo problems and hearing yourself, people not hearing you and not being able to interrupt them, the list goes on, long and familiar.
And then there’s been many physiological studies on why we have such a negative experience, from the effect of lack of eye contact suggesting dishonesty, to the overloading of the brain the division of people into little boxes on the screen causes, and the decremental effects of hearing yourself. So why haven’t the titans of technology, the Google’s and the Microsoft’s of this world, or the younger focused companies like Zoom, fixed these already?
The reason these tools are not great is actually really obvious if you’ve used a specialised commercial software or service, like your own homemade HR or holiday booking system in your company, or one of those websites for a small local government service for example, they’re all terrible to use, full of bugs, and don’t work well or as expected. The reason is there’s just not that big a user-base to justify the invest to build something better. Building robust reliable services with well thought out UX takes a lot of investment.
Before Covid hit, the whole of the video conferencing market in the US was only worth about $3.85B (USD). Which isn’t that big a number if you think about it. It was split about 50/50 between HW and SW/Service. Also, corporate conferencing, the bit you first think of, is only about a quarter of the market again, with other submarkets like Education, Healthcare, Government and Defence making the other pieces. If you think the corporate platforms are not great, you should see the other specialised ones, they’re shockingly bad. So we’ve started with not that great a number, and divided it by 2, and then divided by 4, you’re now in the 100s of millions of dollars range.
Then of course the market is fragmented, and each company has a smaller piece of this. When you get to those levels of revenue, quite frankly the return on investment on building say a new advance echo cancellation system isn’t great.
If you looks at the teams behind many of these products, a large part of the operational cost is around the sales organisation, winning and keeping those customers, and the engineering teams are grossly under resourced and overloaded.
Which brings me to the other characteristic of this market: The need for extremely high reliability. Imagine you were using one of those virtual background capabilities that are very fashionable at the moment, that cut people out of their background and make them look like they’re on the beach (for some reason) and it didn’t recognise the face of your colleague that has darker skin properly, and they disappeared. You can imagine that happening, and that’s the exact reason you wouldn’t use that feature in a serious meeting.
For a capability to reach that enterprise level quality it needs to be truly well qualified for all edge scenarios, and all biases to be removed, and again that adds significant cost, and reduces the ROI further. And that’s the main reason these tools aren’t great. Making them better isn’t outside the roam of human capability, it’s just that they’re not worth it for the companies to do, all the best engineers in the world are off pushing the envelope on deep fakes, they don’t have time for this stuff.
To find the solution we need to go back to those budget local government websites, and free service apps again. In the last 5 years some of them have got a lot better, Why? In big part this is due to the availability of various intermediate technology stacks that you can build your service on top of.
Everyone knows you’re not gonna go out there now and buy your own servers, you’ll use AWS/GCP/Azure for the compute and serving capability, but a lot of the underlaying SW you need to deal with capabilities like accounts, billing, data management, security, and frameworks for UIs are either available for free, or there’s great paid for services, and they are so much better than anything you could ever make yourself, because they have the economy of scale. You’ll left to make the application SW only so your investment goes a lot further.
The world of video and audio conferencing doesn’t have this, you can’t buy in all those well-developed underlying capabilities you need, well not yet anyway. You’d have to build a lot of it yourself.
If you set out to build your own video conferencing platform tomorrow, say with a better screen sharing flow that was more usable, or say you wanted to focus and build something with certain features suitable for remote exercise classes, you’d likely get the HW on AWS, you’ll use open source for the OS and server infrastructure, you’ll even be able to find a great white-label video conferencing SW platform, free or paid to take and tailor, but if you use these pieces and build your application on top of it, you’d likely have poor audio quality, and echo problems, and disappointing video performance.
If you wanted better quality there, even on par with the Zooms and Microsofts of this world, you’ll have to sink millions into making your own audio/video signal processing, since there isn’t anything out there free or paid for that’s any good and properly qualified, and yes big guys like Microsoft, Google and Zoom have their own, but they’re not gonna sell just that capability to you to build a competing platform to them, and there’s isn’t even that good, because they couldn’t justify the investment needed in the first place (as explained above).
Go a bit more application specific, that background swap feature, you’d have to build your own and fully qualify it to make sure it works for all kinds of webcams, and rooms shapes, and backgrounds, and skin tones, etc. That’s a reasonable investment, and that’s only to get you on par with the incumbents.
So unfortunately despite your best intentions your platform will be just as bad, if not worse, and even though you have better screen sharing flow, or suitable market features, people will not want to use your product.
At this point, the solution should be obvious: The way to break this situation is that we need independent technology companies that want to invest and do the audio and video signal processing well, and their business model is setup to sell these capabilities to the conferencing platform providers, new and old.
The benefit of this approach is that the ROIs start making sense. If you develop a great real time noise cancelling capability, AI powered, highly qualified, you can sell this to all the major conferencing providers, and not just corporate, but also education, healthcare, and even other markets outside of conferencing. Now the market size is big enough to justify the investment, and you’re not then trying to build, market and sell a whole conferencing platform to end users, but just this capability.
If you look at the conferencing market, and want to make it better, I believe the worst thing you could do is to try and make another conferencing platform. It won’t be better, you just won’t have pockets deep enough, and even if you did, you’ll never get a reasonable return on it. Try not to be the hero and household name, instead identify the underlying technologies that would make the existing platforms better or enable better new platforms, develop them, and find a business model that allows you to sell them to many.
If I were to describe a world where video conferencing isn’t rubbish, there would be major independent technology companies at various levels of value stack. There would be 2 or 3 companies offering great noise suppression capabilities, some would also have echo cancelling, there would a number of technology companies offering great real-time video processing capabilities, and there would be startup’s breaking through with new advance capabilities for this market, some quite high up the value stack with very application specific features.
As an end user of conferencing you wouldn’t be buying these capabilities from these companies, in fact you would likely never hear the name of these companies at all, but the conferencing platform providers you use and know, would buy these capabilities from the companies mentioned above, and this would free the platform companies to use those sales organisations to listen to the end customers, figure out what they need at the application level, and tailor products to work the way the customers want, with access to rich powerful underlying technologies from various suppliers.
To fix this market we need an ecosystem of players, that is the solution to the problem, not one vertical behemoth owning the whole stack. We’ve seen that doesn’t work.
With the explosion in video conferencing due to Covid-19, the market size has obviously grown, but it’s too early to know how much of this change and size increase is here to stay, but beside the market size growing and allowing the big players to justify their investments on technology, finally, independently we’re started to see more companies trying to build the underlaying technologies with a sell to all business model, rather than making their own platform.
On noise suppression a great example is a startup called Krisp, building an AI power noise suppressor. Other recent examples are a tool in beta called mmhmm from All Turtles, which is a set of features mostly around video and image processing to make the experience of video calls more productive.
Today these companies are offering their capabilities as plugins that the end user can use with any conferencing platform. I think it feels inevitable that for these startups to succeed they will have to evolve out of offering plugins to end users and become technology suppliers to the platform providers, but that in itself is a journey they will have to go through.
You might be thinking some of this real time audio signal processing capabilities must have existed for years, my phone does it, surely not every phone manufacturer is going trying to build their own capabilities from scratch?
If you go back 5 years, the majority of the audio signal processing was happening on the communication devices themselves, like your phone, or your physical conference phone sitting in the conference room, and even on your laptop, and traditional integrated circuit and signal processing companies like Analog Devices, Texas Instruments, STMicroelectornics, have been providing real time signal processing capabilities for decades.
But two things have happened recently, that changes things up:
- While the initial signal processing in the devices remain, we’re seeing a emergence of the more advance signal processing capabilities away from the devices themselves to the cloud, where there’s more information, or information from multiple devices, enabling much better processing opportunities for certain problems
- Advances in AI for both audio and video processing, combined with the vast increase in centralised processing capability has increased the possibilities but has also pushed the signal processing further up from the embedded devices in many cases and away from the offerings of the traditional incumbents in the real-time signal processing space.
I believe and hopeful that we’ll witness one of 2 things over the next decade in this space of real-time signal processing: Either the traditional power houses of signal processing, predominately the IC manufactures, will grow their AI driven capability and transition further up the signal processing chain, and offer more application specific audio and video processing capability, or we’ll see a new breed of major company emerge, filling that gap in the ecosystem, providing application specific real-time signal processing capabilities. There’s a real need for this.
I have a new startup, but we’re still in stealth, so we’re not talking much about what we’re doing at the moment, but we have a couple of cool bits technology and our goal is to build state-of-the-art, fully qualified audio and video signal processing capabilities, with a focus in the voice and video communication and conferencing market, no surprises there.
We want to make video conferencing better, a lot better.
We’re experimenting with a sort of Digital Signal Processing as a Service (DSPaaS) business model, where you pay for the real time processing of audio and video feeds in the cloud in a by minute model, this way regardless of your size and application you can get hold of the best signal processing needed for your application, and you also don’t need to worry about the infrastructure or latency, it’s all taken care of. We’re hoping to flip this market on its head, and if we do our job right, you’ll never hear our name, but your video conference platform, whichever one you use, and whatever you use it for, should suddenly get a whole lot better.
If you want to learn and read more on the space, here’s some links to the studies, market analysis and companies I talked about in the article above:
OWLLabs — State of Video Conferencing 2018
A nice survey by OWLLabs on the the state of video conferencing in 2018, and some commentary on what some of the biggest challenges in using video conferencing is
Ksenia Klykova — The impact of videoconferencing on business travel: an historical point of view
Some commentary in the article on study of the effect of lack of eye contact in video conferencing, and how it causes lack of trust
Tom Warren — Microsoft Teams’ new Together Mode is designed for pandemic-era meetings
New feature from Microsoft Teams, but interestingly their study in seeing that having people in different boxes on the screen seems to make you more tired
Grand View Research — Video Conferencing market Size, Share 7 Trends Analysis Report
If you want to find out more information about the video conferencing market sizes and segmentations
Fortune Business Insights — Video Conferencing Market Size, Share & Industry Analysis, By Type, By Application, By Enterprise Size and Regional Forecast
Another look at the video conferencing market, with sub market breakdown — Corporate, Education, Healthcare, etc
OpenVidu and Kurento
There’s a few, but OpenVidu is an free or paid for white label video conferencing platform, based on Kurento, that you can take and build your own video conferencing platform on top of pretty quickly for example
Krisp, is that noise suppression startup I was talking about above. They have a free plugin you can try
mmhmm is that beta tool that help with screen sharing and processes the video on calls, on their website at the moment at least you can find a video explaining what they’re doing. I’ve added the link for the video as well.
This is our new startup company working in this space, there’s almost no information on the website at the moment, sorry.
Create your free account to unlock your custom reading experience.