In this work, we construct (a) a large-scale, unified 3D Human Gaussian Dataset, called HGS-1M, to support; (b) the large-scale generation model for 3D human Gaussian generation; (c) This paradigm, with large-scale data, produces high-quality 3D human Gaussians that exhibit complex textures, facial details, and realistic deformation of loose clothing.
3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains 1 million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.
The constructed HGS-1M dataset, it contains 1 million 3D Gaussian human assets with different ages, races, appearances, and poses. It supports free-view rendering.
(a). The UV-structured VAE, which uses human priors to define learnable tokens in UV space and takes them to query multi-view contexts to model the Gaussian latent, then, the latent is decoded into human Gaussians in canonical space and could be driven by differentiable LBS to obtain the final posed human Gaussian. (b). The MM-DiT architecture, it treats the conditional sequence and noise as a whole sequence to complete controllable 3D human Gaussian generation.