Multimodal AI for Conversational Human Motion

Rifx.Online
Chatbots , Autonomous Systems , Natural Language Processing
16 Nov, 2024

Written by Christian Safka and Keyu Chen

In this exploration we’ll look at how multimodal models are changing the game for conversational AI agents, and how we can enable seamless interaction in various environments using perception, memory, behavior modeling, and rendering in real-time.

The outline of this one-pager:

Why multimodal?
Deep dive into human motion pipeline
Challenges in training
Current use cases and the future

Why multimodal?

From a high level, the three “layers” we need to achieve life-like human conversation are input perception, motion planning, and avatar rendering. Most of the pipelines in academia as of writing this have separated these layers with text as an intermediate:

What multimodal models unlock is a decrease in information loss between these layers:

Deep dive into the human motion pipeline

Generating human-like actions and reactions is a difficult problem. It requires a pipeline to process real-time cues from multiple sources, interpret, translate, and generate synchronized responses. It’s critical that all stages All stages are critical for creating avatars that can engage in fluid, contextual conversations.

We talked about the three layers:

1. Input Perception — Gathering multimodal cues from visual, auditory, and text-based sources.

2. Motion Planning — Determining appropriate actions or reactions based on these inputs.

3. Avatar Output — Rendering these planned actions with an avatar in real-time.

Now lets break down how each layers crucial role in creating life-like conversations.

Perception of Multimodal Input

Effective human motion synthesis begins with understanding multimodal cues, much like humans rely on sight, sound, and language for communication. In digital applications, this process can replicate the complex ways in which humans gather and respond to information:

Visual Inputs: Images and video streams capture elements like facial expressions, gaze direction, and hand gestures
Auditory Inputs: Audio signals provide essential information, such as tone, intonation, and rhythm, enabling us to interpret the emotional context of speech
Text Inputs: Text-based prompts or conversation logs can guide the avatar’s actions by providing semantic context — knowing what’s being discussed allows the avatar to respond appropriately to the nuances of conversation

Integrating these modalities creates a holistic understanding of the conversational setting, providing a foundation for how the system interprets and maps the world.

Motion Planning with LLMs

In multimodal AI, the interaction layer — often powered by large language models (LLMs) — acts as the avatar’s “brain.” This layer processes the synthesized multimodal cues from the perception stage, determines the most contextually relevant response, and translates it into a planned motion or verbal response.

Using both speech and visual features as input allows the model to handle:

Contextual Motion Planning: The model can pick up on conversational cues, matching them to actions that are contextually appropriate. For instance, if an avatar detects enthusiasm in a user’s speech, it might adopt an open, engaging posture or facial expression
Sequential Interaction Control: The model can learn to interpret sequences of cues, allowing it to handle nuances like turn-taking, active listening gestures, and pauses, which are all critical for a natural conversation

Previous works such as Zhou et al. [0] or Pereira et al. [1] would output text from this layer — emotion labels like “happy” which can be used for conditional expression generation. This is very lossy and the expressions will never be fully aligned with the output speech.

The beauty of multimodality in motion planning is both in input and output. On the input side, we can draw on the vast language model’s world knowledge, even as it is trained to align multimodal tokens. On the output side, we can reduce the information loss between desired behavior and final rendered output.

To summarize, the interaction layer enables the avatar to respond to both explicit and implicit conversational cues, bridging the gap between multimodal perception and human-like interaction.

Avatar Generation

To achieve empathic conversational AI or a human-level information flow, the rendered actions and reactions need to go beyond static, pre-planned motions. The goal is to create a system that can interpret and adjust to subtle conversational cues almost instantaneously.

In this context, the avatar layer acts as the output rendering mechanism. It takes the actions planned by the interaction layer and translates them into smooth, real-time behaviors. This layer focuses on low-latency response generation, prioritizing rapid and accurate alignment between desired actions and visual/audio output.

The primary objective can be described as Synchronized Speech and Motion — the avatar must coordinate facial expressions, body language, and lip movements using the auditory output and behavioral signals, ensuring that all elements stay in sync.

Maintaining temporal consistency and synchronization is vital, as any delays or mismatches in behavior can quickly break immersion.

Challenges in training

Some of the active R&D areas in industry and academia are:

Token Alignment Across Modalities: Aligning modalities like visual cues and audio intonations without losing contextual or semantic meaning is complex, and the model must learn how to represent them in a unified way for consistent responses
Latency Management: Real-time responsiveness requires the entire multimodal pipeline to operate with low latency, which becomes challenging as complexity increases
Personality and Memory: For avatars, consistent personality traits are essential, especially in prolonged interactions. Proper handling of memory and personality can be essential for maintaining coherent responses in some use-cases

Current use-cases and the future

First, a few examples of current use cases we’re seeing:

Healthcare: Imagine an empathetic avatar as a virtual health coach that provides guidance, responds in real-time, and adapts its tone and expressions to suit the user’s mood
Customer Support: Customer support avatars can interpret vocal cues, body language, and even view the user’s technical problem as a screenshare or live video. It could additionally offer responses that feel attentive and personalized, reducing user frustration
Educational Tools: Tutors with real-time interaction capabilities can engage with students, display attentive gestures, and modulate their expressions to reinforce encouragement or correction

As research advances, these applications will expand, allowing digital humans to be deployed in increasingly nuanced, high-stakes environments. Human-level conversational avatars will additionally unlock both empathic use-cases as well as a high information flow HCI interface.

If tackling challenges like modality alignment, latency, and contextual coherence sound interesting to you — we’re hiring! Check us out at https://tavus.io

References

[0] Zhou, Hao, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. “Emotional chatting machine: Emotional conversation generation with internal and external memory.” In Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1. 2018.

[1] Pereira, Patrícia, Helena Moniz, and Joao Paulo Carvalho. “Deep emotion recognition in textual conversations: A survey.” Artificial Intelligence Review 58, no. 1 (2025): 1–37.