SIGMAN: Scaling 3D Human Gaussian Generation with Millions of Assets

1 USTC 2 Shanghai AI Lab 3 SJTU 4 CMU
* Equal Contribution   Corresponding Author
teaser

In this work, we construct (a) a large-scale, unified 3D Human Gaussian Dataset, called HGS-1M, to support; (b) the large-scale generation model for 3D human Gaussian generation; (c) This paradigm, with large-scale data, produces high-quality 3D human Gaussians that exhibit complex textures, facial details, and realistic deformation of loose clothing.

Abstract

3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains 1 million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.

HGS-1M Dataset

The constructed HGS-1M dataset, it contains 1 million 3D Gaussian human assets with different ages, races, appearances, and poses. It supports free-view rendering.


3D Human Gaussian Assets



4D Human Gaussian Assets


Pipeline

(a). The UV-structured VAE, which uses human priors to define learnable tokens in UV space and takes them to query multi-view contexts to model the Gaussian latent, then, the latent is decoded into human Gaussians in canonical space and could be driven by differentiable LBS to obtain the final posed human Gaussian. (b). The MM-DiT architecture, it treats the conditional sequence and noise as a whole sequence to complete controllable 3D human Gaussian generation.


Results Gallery (3D Gaussian)



Input Image

Example 1


3D Gaussians


Input Image

Example 2


3D Gaussians

Example 1
Example 2
Example 1
Example 2
Example 1
Example 2
Example 1
Example 2

Results Gallery (Driven Sequence)



Gaussian Sequence


Gaussian Sequence


Gaussian Sequence


Gaussian Sequence


Text-to-3D Gaussian



The female has dark hair styled simply. She is wearing a dark, slightly oversized hoodie. The lower body is clad in a long, black skirt that is smooth in texture. She is wearing a black sandals.


The hairstyle of the young male is short, standard build. He is wearing a light blue t-shirt made of a plain. His pants are light gray, which have a comfortable, relaxed fit.


The young male have a medium skin tone and short hairstyle styled upward. He is wearing a long-sleeved, light blue raglan t-shirt. The pants are blue and white checkered plaid.


The male hairstyle is short and appears neat, wearing a checkered shirt. He is paired with light blue jeans. On his feet, he wears brown shoes.

BibTeX