Chris Ward

@chris_82106

Video Conferencing without the Video?

In this post, we demonstrate an idea for wear-your-pajamas-enabled video conferencing.

We all know how significant non-verbal cues are for effective communication. But, audio-only calls still dominate conference calling. With today’s tech, why isn’t video the default?

I suspect the answer is a combination of tech shortcomings, cultural norms, and personal preferences. I discussed some of the tech aspects I found lacking in my previous article, “It’s 2016, why is video conferencing still terrible.” Today, I want to focus on 2 shortcomings of video conferencing:

  • Video requires a robust internet connection — even when our own connection is sufficient, we may worry whether the people will have a good experience.
  • Video is less private — we don’t always want to show our state of dress, our messy space, our multitasking.

These issues can make us feel uncertainty and discomfort. Not emotions we tend to actively seek out.

What if we could maintain much of the body language benefits of video conferencing, but significantly reduce these negatives? Video conferencing without video!

What?

Actually, video conferencing with avatars. Instead of transmitting video of ourselves, a conferencing system could transmit a representation of ourselves. Using a webcam video as input, it is possible to track facial feature motion in real time. Those facial feature positions would be transmitted to other users, who would see a reconstructed animated avatar. Your webcam video never needs to leave the comfort of your own computer!

Proposed methodology

It’s been said, “If a picture is worth a thousand words, then a video is worth a million.” I certainly don’t want to write a million words. We’ve created an example of this idea by converting the demo videos from our existing conferencing system, Locus (in public beta), into simple avatars. This is a first hack demonstration, not a polished work of art.

Locus with videos (left) vs. avatars (right)
Locus demo using avatars. *Best experienced with headphones* For reference, original demo with videos here.

Interesting, right? But why would we want such a system?

  1. Reduced bandwidth. Video conferences require roughly 100 times higher data rate vs. audio-only calls. It’s no surprise video can be unreliable! If we only transmit facial feature position information, we can reduce the “video” data rate down lower than an audio stream! If connectivity is good enough for an audio call, it would generally be sufficient for audio + avatar.
  2. Increased privacy. Only your body language is transmitted, not your webcam.
  3. Appear together more naturally. With Locus we’re working on translating the in-person experience of being together in a room into a conferencing system. Using video feeds, we currently break the metaphor with each participant appearing in a box with their own background from their own physical space. With avatars, conference participants can all appear together in one space.
  4. Fun. Who wouldn’t want to show up to their next meeting looking like Superman?
  5. Reduced CPU. This will depend on the implementation details of the avatar system and the traditional video conferencing system, but eliminating (sometimes multiple) video transcodings could result in lower computation and therefor increased battery life. Reconstruction could also take place for each user, allowing that user to select trade-offs of CPU vs avatar quality.

Sounds pretty great. Why wouldn’t we want this system? There could be a few shortcomings:

  1. Reduced communication. In counterpoint to the privacy benefit, information on our environment, bags under our eyes, etc can at times provide useful context, which may be lost.
  2. Tracking errors. Tracking facial features from webcam video is not always 100% accurate and can depend on lighting quality among other factors. Although video sharing is dependent on similar factors, an avatar system might degrade less gracefully.
  3. Creepiness / uncanny valley. Perfecting avatars to avoid the creepy factor could be a ton of work, and may never get to 100%.
  4. One webcam stream with multiple users may become a much harder challenge to address.

The avatars in such a system could be designed in many different ways. Some ideas:

  • Simple 2D cartoons with customization. Similar to Chris & JD in the demo.
  • 3D rendered characters. Similar to Lois in the demo.
  • Facial manipulation from an image. Use a still photo of the subject and manipulate it to match the subject’s live facial expressions. See work of Thies, et al for a somewhat creepy, but amazing example of this idea.
Trade offs on avatar types

We’ve created examples of these options below. The simple cartoon in its current form does lose some expression information, but could be improved. We do like the simplicity of this option — and such cartoons could be created autonomously from webcam images.

Avatar option examples.

The tech required to realize this proposal appears feasible. What do you think? Could avatar conference calls be the future?

Share your opinion with us and optionally subscribe for updates using the form below, or here.

If you like what you read, be sure to 💗 below.

More by Chris Ward

Topics of interest

More Related Stories