Consume Our Internet
Posts
Interactive avatars: The face as the next interface

Interactive avatars: The face as the next interface

Sasha Kaletsky
June 24, 2024

Thanks for reading! If you enjoyed this, share it with a friend or colleague, or you can follow me on Twitter @SashaKaletsky. Thanks to Adam Turaev (Praktika CEO) for his feedback on this post.

You have probably heard that screens are addictive. Maybe you’ve even expressed that view yourself. But have you asked yourself the logical next question: what are you actually looking at on your screens?

The answer for most people, most of the time, is human faces.

TikTok feed? Mostly faces. Instagram? Faces. Sports game? Close-up on the players’ faces after every key moment. Dating apps? 99% faces.

From an evolutionary perspective this makes sense. A homo erectus more preoccupied by a nearby cloud, insect or rock than their fellow hominids, would not last very long. (My undergraduate degree was in human sciences, so I been making unprovable claims about this sort of thing for almost 15 years).

So imagine if there was a way for a product to make a face an application’s whole interface. You just imagined interactive avatars.

At Creator Ventures, we believe that interactive avatars are the next interface. We are proud to have recently led the $2.5m seed round of avatar-based English language learning app Praktika, followed by their $30m Series A with Blossom Capital. They now have over a million users and are already one of the largest avatar companies in the world (as well as the best team). We have also previously invested in a number of other companies that are building tools for avatars, including ElevenLabs and Sync Labs, and our ambition is to be invested in as many of the leaders in interactive avatar components, across the technology stack, as we can.

So clearly we are bullish. This article will explain why.

What are interactive avatars?

When most people think of avatars, they might think of talking head videos, looking squarely at the screen, perhaps with a nodding head and gesturing hands. There are many amazing founders working on those types of video avatars, and I’m sure the opportunity is big, particularly in L&D, sales enablement, video translation and some forms of marketing. It’s an interesting space.

But that is not what this specific thesis is about. We are particularly excited about a categorically different form of digital human experience: interactive avatars.

So what’s the difference? An interactive avatar is, well, interactive. At the lower end of interactivity, it can respond to your instructions via point and click. As interactivity increases, perhaps it will have a voice interface, so you speak to it and respond. The most interactive avatars possible will listen to you, look at you, and respond to you in real time, in a totally dynamic way. This is a very different experience to a talking head video avatar.

	Interactive avatar	Video avatar
Communication	Two way	One way
Latency	Zero or close to zero	Less important; videos are pre-made
Personalization	More dynamic, can change with discussion	More static, fixed by set parameters
Realism	Usually less realistic	Usually more realistic
Use-cases	Learning, friendship, games, conversation practice, customer support, advanced L&D	Personalised marketing and sales, general L&D, content translation, realistic novelty
Technical architecture	Optimized for speed and flexibility	Optimized for realism and fidelity

These are two very different products, and we are excited about both.

But, as implied by the title of this post, when we think about what the next interface is going to be, we are talking about interactive avatars.

Interactive avatars are predictably unpredictable

Decades of using computers have trained us to expect consistent inputs to drive consistent outputs. But, as our post on live content last year argued, sometimes unpredictability can be interesting. Sometimes people want to be unsure of the output that will come out of the machine. This is partly why increasing temperature can have a positive impact on perceived LLM creativity.

Interactive avatars take this a step further.

If an AI avatar character remembers your birthday and says HBD, it's cute
If a faceless AI voice does the same, it's creepy
One of the reasons Avatars Are The Next Interface
— Sasha Kaletsky (@SashaKaletsky)
7:05 PM • Jun 9, 2024

In Praktika’s interface, for example, instead of learning from puzzle games like Duolingo or pre-set voice interfaces like other language learning apps, users learn from characters. These characters can have personalities, interests, flaws, and most importantly, they might respond to you in unpredictable ways. That is part of the appeal.

When text-based and voice-based interfaces do something unpredictable: that’s usually seen as a bug.

When an avatar surprises you, it’s fun.

Avatars create a blank canvas that gives app developers far more flexibility to take risks and push features forward than anything before them. There’s a reason Clippy had eyes.

Interactive avatars are a platform, not a feature

One objection I sometimes hear as somebody who invests in avatar companies is “isn’t this just a feature? Why can’t another company just add avatars?” This is much harder than you might think.

To be realistic and engaging, avatars need:

Screen space: To have their intended impact, avatars need to be the thing that you’re looking at on-screen. This means they need to take over the whole interface (hence the title of this post). Injecting a small avatar into the corner of a text-based interface might trick a few users for a bit, but it’s not scratching the same itch.
Two-way small talk: Most of human interactions with each other are pretty trivial. A history teacher interactive avatar might say: “We’re going to talk about the Battle of Hastings, does that work for you?” and you’d respond in voice “Sounds good”. That is 10 seconds of basically wasted time. No app developer would be willing to sacrifice 10 seconds of critical retention time for that. But for an interactive avatar app, building rapport with the character is a critical part of the experience.
Compromise: Most avatars look pretty unrealistic (more on this in the next section). This is for a host of structural and technological reasons, and won’t change anytime soon. App developers will have to be comfortable with users complaining about the low quality of their avatars’ design.

These are the building blocks of real human conversational “UI”. We have got pretty used to them after more than 200,000 years of verbal communication.

But the challenge for incumbents is that most users don’t know they want avatars. Nobody says “I love this sales enablement Q&A app, but I wish there was an human avatar to guide me through it”. It’s a bit like gamification: it doesn’t come up in consumer surveys. It’s a matter of revealed preference.

So a PM of a text- or voice-based interface is unlikely to suggest an avatar interface. Most likely it will nuke their core metrics: retention, time spent in app, monetization. For avatars to work properly, they need to take over the whole experience. And most incumbents aren’t ready for that.

What we need from avatars

We are at v0.5 for interactive avatars. They are not at all realistic, and in some cases they’re unsettling. There are a few things we need from them to start actually working.

Drivers of the uncanny valley for AI avatars, in order of importance:
1. Voice
2. Lips
3. Delayed responses
@elevenlabsio and @synclabs_so are already solving (1) and (2). After using the demo version I’m now convinced that @GroqInc is well on the way to solving (3).
— Sasha Kaletsky (@SashaKaletsky)
10:00 PM • Apr 20, 2024

In order, the uncanny valley drivers are as follows:

Voice: Voice is by far the most important engagement driver for avatars. No matter how realistic an avatar looks, if the voice is robotic or the tone doesn’t work, the game is up. It’s over on any type of emotional connection.
Lip sync: Once the voice works (e.g. using ElevenLabs), lip sync is the second most important factor to eliminate the uncanny valley. We are so used to seeing real human speech that we have become incredibly attuned to whether the lip sync matches the speech. And we are unforgiving when it doesn’t. This was the basis for our investment in Sync Labs.
Latency and interruption management: Interrupting is a core part of human dialogue. Growing up in the family I did, I learned that intimately. Plausible human dialogue cannot be based on a walkie-talkie system. Feedback needs to be instant, and nobody has cracked this yet.
Photorealism: Surprisingly, this is the least important factor. A photorealistic avatar with a robotic voice, a mismatched lip sync and high latency is almost worse than a cartoonish one. In a perfect avatar, photorealism is great, but if anything else about the avatar is imperfect, photorealism digs an even deeper trench in the uncanny valley.

The interactive avatars of recent years are only possible because of R&D breakthroughs (including ElevenLabs) to allow for 1 & 2.

3 is a work in progress, and needs work. 4 is the nice-to-have, and is further away.

We are only a few years from super high quality outputs on all four factors. When we achieve this, we will be in the interactive avatar promised land.

Where do we go from here?

It might start with learning and gaming. Then sales and support. Following that, more and more interfaces will inject interactive avatars, until we’re there: avatars have become the next interface.

A big new computing interface hasn’t come along very often, and when it has, it’s been a trillion dollar opportunity. I think we might be at the very start of something similar.

So we are leaning in. If you agree that interactive avatars are the next interface, and are building with avatars (whether app layer, tools, B2B, consumer), please get in touch via DM. We will be excited to meet you.

Reply

or to participate.