1USTC 2UNSW 3HKU 4UESTC †Corresponding Author
Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the "memory", leading to suboptimal consistency. Recognizing that character video generation inherently resembles an "outside-looking-in" scenario. In this work, we propose representing the character's visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding $10$ minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.
The gallery showcases generated videos with glboal, multi-view and expression anchors, in which the character maintains long-term consistent identity, even when the expression transitions and viewpoints change throughout the sequence.
Given only one type of content anchor, Gloria is still capable of generating long-duration videos (1-2), videos with diverse expressions (3-6), and videos with multi-view appearances (7-10), showcasing the flexibility and generalization of our anchor-based framework.
We compare Gloria against state-of-the-art baselines on long-term character consistency. All methods share identical inputs. Results demonstrate that Gloria significantly reduces identity drift over long video sequences.
This section evaluates how well each method preserves character identity under varying facial expressions. All methods receive identical inputs without audio. Gloria maintains expressive fidelity while keeping appearance consistent across emotion transitions.
We benchmark Gloria and competing methods on multi-view appearance consistency. All methods use the same inputs without audio. Gloria leverages viewpoint anchors to accurately preserve character appearance across different camera angles and head poses.