Back to the journal

June 4, 2026 · 6 min · engineering · api

What is a realtime avatar API? A builder's guide

From a single portrait to a character that answers in live, audio-synced video: the moving parts of a realtime avatar API and what to evaluate before you build on one.

A realtime avatar API turns a still face into a live one. You provide an image (or a short video), attach a voice and a persona, and the API streams back video of that character speaking — fast enough to hold a conversation. It is the difference between generating a clip and talking to someone.

The pipeline under the hood

Every realtime avatar stack is some arrangement of four stages:

  1. Character registration. A portrait or reference video is processed once into a reusable identity — facial geometry, idle motion, the material the renderer animates. Good platforms cache this so later sessions start warm.
  2. Language and persona. An LLM produces the reply in character. This stage is usually yours to configure: the persona, the memory, the model.
  3. Speech synthesis. The reply becomes audio in the character's chosen voice, streamed rather than rendered whole.
  4. Audio-clocked video rendering. The renderer generates frames slaved to the audio timeline, so lips, breath, and micro-expressions land on the syllable. This is the stage that separates realtime systems from offline lip-sync tools.

The metrics that actually matter

  • Time to first frame. The pause between the user finishing a sentence and the avatar visibly responding. Under a second feels like a reaction; over two feels like a loading screen.
  • Sync drift. Audio/video offset across a long turn. Drift is more damaging than latency — users forgive a beat of silence, not a mouth that lies.
  • Cadence stability. Frame pacing on imperfect networks. A steady 24fps beats a jittery 40.
  • Identity persistence. The same face across sessions, devices, and re-renders. Critical for companions and brand characters.

What to look for in the API surface

Beyond the demo, evaluate the boring parts: a typed SDK (ours is TypeScript, generated from an OpenAPI spec), explicit session lifecycle so you can meter and cap usage, avatar caching so characters start warm, and usage-based pricing in minutes — the only unit that maps to what your users actually consume. (For reference, our plans anchor live avatar time at about $5 per hour, with a free 5-minute monthly sandbox.) If AI agents are part of your stack, check for an MCP server and an llms.txt; an avatar platform your agents can operate is worth more than one with a prettier dashboard.

Where builders take it

The use cases we see most: consumer companion apps, sales and corporate training roleplay, livestream hosts that run for hours, support agents with a face, game NPCs with persistent personas, and tutors that teach out loud. One character, designed once, reused across every surface — that reuse is where the economics of a realtime avatar API quietly compound.

Meet the cast. Hold the first conversation.

Enter the studio