Learning 3D human-object interaction relation is pivotal to embodied AI and interaction modeling. Most existing methods approach the goal by learning to predict isolated interaction elements, e.g., human contact, object affordance, and human-object spatial relation, primarily from the perspective of either the human or the object. Which underexploit certain correlations between the interaction counterparts (human and object), and struggle to address the uncertainty in interactions. Actually, objects' functionalities potentially affect humans' interaction intentions, which reveals what the interaction is. Meanwhile, the interacting humans and objects exhibit matching geometric structures, which presents how to interact. In light of this, we propose harnessing these inherent correlations between interaction counterparts to mitigate the uncertainty and jointly anticipate the above interaction elements in 3D space. To achieve this, we present LEMON (Learning 3D huMan-Object iNteraction relation), a unified model that mines interaction intentions of the counterparts and employs curvatures to guide the extraction of geometric correlations, combining them to anticipate the interaction elements. Besides, the 3D Interaction Relation dataset (3DIR) is collected to serve as the test bed for training and evaluation. Extensive experiments demonstrate the superiority of LEMON over methods estimating each element in isolation.
LEMON pipeline. Initially, it takes modality-specific backbones to extract respective features Fh, Fo, Fi, which are then utilized to excavate intention features (To, Th) of the interaction. With To, Th as conditions, LEMON integrates curvatures (Co, Ch) to model geometric correlations and reveal the contact ϕc, affordance ϕa features. Following, the ϕc is injected into the calculation of the object spatial feature ϕp. Eventually, the decoder projects ϕc, ϕa, ϕp to the final outputs.
3DIR Dataset. (a) The quantity of images and point clouds for each object, and a data sample containing the image, mask, dense human contact annotation, 3D object with affordance annotation, and the fitted human mesh with the object proxy sphere. (b) The proportion of our contact annotations within 24 parts on SMPL, and distributions of contact vertices for certain HOIs. (c) The ratio of annotated affordance regions to the whole object geometries, and the distribution of this ratio for some categories. (d) Mean distances (unit: m) between annotated object centers and human pelvis joints, and directional projections of annotated centers for several objects.
Compare with SOTA methods which estimate each Interaction element in isolation. (a) Results of the estimated human vertices in contact with objects, the estimated contact vertices are shown in yellow . (b) The anticipations of 3D object affordance, the depth of red represents the probability of anticipated affordance. (c) Two views of the predicted spatial relation, translucent spheres are object proxies.
Train LEMON on 3DIR dataset and infer on the unseen BEHAVE dataset, including human contact, object affordance, and human-object spatial relation.
Multiple Interactions. LEMON anticipates the interaction elements of multiple interactions with the same object.
Multiple Objects. LEMON could anticipate distinct results according to the interacting object.
@inproceedings{yang2024lemon,
title={LEMON: Learning 3D Human-Object Interaction Relation from 2D Images},
author={Yang, Yuhang and Zhai, Wei and Luo, Hongchen and Cao, Yang and Zha, Zheng-Jun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={16284--16295},
year={2024}
}