Gloria: Consistent Character Video Generation via Content Anchors

Yuhang Yang1Fan Zhang2Huaijin Pi3Shuai Guo1Guowei Xu4Wei Zhai1†Yang Cao1Zheng-Jun Zha1

1USTC    2UNSW    3HKU    4UESTC    Corresponding Author

CVPR 2026
TL;DR We train a video foundation model to generate highly expressive human videos with long-term identity consistency through an anchor-based mechanism, producing character videos exceeding 10 min without noticeable drift.

Abstract

Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the "memory", leading to suboptimal consistency. Recognizing that character video generation inherently resembles an "outside-looking-in" scenario. In this work, we propose representing the character's visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding $10$ minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

Method

Method Overview
Overview of Gloria. Given the inital image, text and audio prompts, and a set of content anchors (global, viewpoint, and expression anchors), Gloria generates long-duration character videos with consistent identity. The framework incorporates Superset Content Anchoring to prevent copy-paste artifacts and RoPE as Weak Condition to disambiguate multiple anchor references, enabling stable appearance and identity preservation across extended sequences.

Long-Duration Generation Gallery

The gallery showcases generated videos with glboal, multi-view and expression anchors, in which the character maintains long-term consistent identity, even when the expression transitions and viewpoints change throughout the sequence.

Case 1 / 1

Text Prompt

xxx
0:00 / 0:00

First Frame

First Frame

Global Anchor

Global

Viewpoint Anchor

Viewpoint 1

Expression Anchor

Expression 1
0:00 / 0:00

Generated Video

Single-Anchor Generation Gallery

Given only one type of content anchor, Gloria is still capable of generating long-duration videos (1-2), videos with diverse expressions (3-6), and videos with multi-view appearances (7-10), showcasing the flexibility and generalization of our anchor-based framework.

Case 1 / 1

Text Prompt

xxx
0:00 / 0:00

First Frame

First Frame

Global Anchor

Global

Viewpoint Anchor

Viewpoint 1

Expression Anchor

Expression 1
0:00 / 0:00

Our Result

Long-term Consistency Comparison

We compare Gloria against state-of-the-art baselines on long-term character consistency. All methods share identical inputs. Results demonstrate that Gloria significantly reduces identity drift over long video sequences.

Case 1 / 1

Text Prompt

xxx
xxx
xxx
xxx
0:00 / 20:00

First Frame

First Frame

Global Anchor

Global

Viewpoint Anchor

Viewpoint 1 Viewpoint 2 Viewpoint 3 Viewpoint 4

Expression Anchor

Expression 1 Expression 2 Expression 3 Expression 4
0:00 / 0:20
0:00

Hunyuan

Kling

Ours

Ours

Expressive Identity Consistency Comparison

This section evaluates how well each method preserves character identity under varying facial expressions. All methods receive identical inputs without audio. Gloria maintains expressive fidelity while keeping appearance consistent across emotion transitions.

Case 1 / 1

Text Prompt

xxx
xxx
xxx
xxx
0:00 / 20:00

First Frame

First Frame

Global Anchor

Global

Viewpoint Anchor

Viewpoint 1 Viewpoint 2 Viewpoint 3

Expression Anchor

Expression 1 Expression 2 Expression 3 Expression 4 Expression 5 Expression 6
0:00 / 0:20
0:00

Pika

Gen-2

Ours

Multi-view Appearance Consistency Comparison (w/o audio input for fairness)

We benchmark Gloria and competing methods on multi-view appearance consistency. All methods use the same inputs without audio. Gloria leverages viewpoint anchors to accurately preserve character appearance across different camera angles and head poses.

Case 1 / 1

Text Prompt

xxx
xxx
xxx
0:00 / 15:00

First Frame

First Frame

Global Anchor

Global

Viewpoint Anchor

Viewpoint 1 Viewpoint 2

Expression Anchor

Expression 1 Expression 2
0:00 / 0:15
0:00

Sora

Runway

Ours

Subjective evaluation results

User Study Chart 1
Foundation model performance comparison.
User Study Chart 2
Reference-based model performance comparison. (a) expressive identity consistency; (b) multi-view appearance consistency.

Citation


            
×
Click image or press ESC to close