Gloria: Consistent Character Video Generation via Content Anchors

Yuhang Yang¹, Fan Zhang², Huaijin Pi³, Shuai Guo¹, Guowei Xu⁴, Wei Zhai^1†, Yang Cao¹, Zheng-Jun Zha¹

¹USTC ²UNSW ³HKU ⁴UESTC ^†Corresponding Author

CVPR 2026

📦 arXiv 💻 Benchmark

TL;DR We train a video foundation model to generate highly expressive human videos with long-term identity consistency through an anchor-based mechanism, producing character videos exceeding 10 min without noticeable drift.

Abstract

Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the "memory", leading to suboptimal consistency. Recognizing that character video generation inherently resembles an "outside-looking-in" scenario. In this work, we propose representing the character's visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding $10$ minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

Method

More Results Gallery

Page 1 / 1

Long-Duration Generation Gallery

The gallery showcases generated videos with glboal, multi-view and expression anchors, in which the character maintains long-term consistent identity, even when the expression transitions and viewpoints change throughout the sequence.

Case 1 / 1

Text Prompt

xxx

0:00 / 0:00

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

0:00 / 0:00

Generated Video

Single-Anchor Generation Gallery

Given only one type of content anchor, Gloria is still capable of generating long-duration videos (1-2), videos with diverse expressions (3-6), and videos with multi-view appearances (7-10), showcasing the flexibility and generalization of our anchor-based framework.

Case 1 / 1

Text Prompt

xxx

0:00 / 0:00

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

0:00 / 0:00

Our Result

Long-term Consistency Comparison

We compare Gloria against state-of-the-art baselines on long-term character consistency. All methods share identical inputs. Results demonstrate that Gloria significantly reduces identity drift over long video sequences.

Case 1 / 1

Text Prompt

xxx

0:00 / 20:00

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

0:00 / 0:20

Hunyuan

Kling

Ours

Expressive Identity Consistency Comparison

This section evaluates how well each method preserves character identity under varying facial expressions. All methods receive identical inputs without audio. Gloria maintains expressive fidelity while keeping appearance consistent across emotion transitions.

Case 1 / 1

Text Prompt

xxx

0:00 / 20:00

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

0:00 / 0:20

Pika

Gen-2

Ours

Multi-view Appearance Consistency Comparison (w/o audio input for fairness)

We benchmark Gloria and competing methods on multi-view appearance consistency. All methods use the same inputs without audio. Gloria leverages viewpoint anchors to accurately preserve character appearance across different camera angles and head poses.

Case 1 / 1

Text Prompt

xxx

0:00 / 15:00

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

0:00 / 0:15

Sora

Runway

Ours

Subjective evaluation results

User Study Chart 1 — **Foundation model performance comparison.**

User Study Chart 2 — **Reference-based model performance comparison.** (a) expressive identity consistency; (b) multi-view appearance consistency.

Gloria: Consistent Character Video Generation via Content Anchors

Abstract

Method

More Results Gallery

Long-Duration Generation Gallery

Text Prompt

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

Generated Video

Single-Anchor Generation Gallery

Text Prompt

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

Our Result

Long-term Consistency Comparison

Text Prompt

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

Hunyuan

Kling

Ours

Ours

Expressive Identity Consistency Comparison

Text Prompt

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

Pika

Gen-2

Ours

Multi-view Appearance Consistency Comparison (w/o audio input for fairness)

Text Prompt

First Frame

Global Anchor

Viewpoint Anchor

Expression Anchor

Sora

Runway

Ours

Subjective evaluation results

Citation