LEMON: Learning 3D Human-Object Interaction Relation from 2D Images

CVPR2024

Yuhang Yang¹, Wei Zhai¹, Hongchen Luo¹, Yang Cao^1,2, Zheng-Jun Zha¹,

¹University of Science and Technology of China ²Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

arXiv Code Data

For an interaction image with paired geometries of the human and object, LEMON learns 3D human-object interaction relation by jointly anticipating the interaction elements, including human contact, object affordance, and human-object spatial relation. Vertices in yellow denote those in contact with the object, regions in red are object affordance regions, and the translucent sphere is the object proxy.

Interaction elements inferred by LEMON. Including human contact, object affordance, and two views of human-object spatial relation, the translucent sphere is the object proxy.

The interaction elemnts inferred by LEMON possess the potential to facilitate applications like interaction modeling.

Abstract

Learning 3D human-object interaction relation is pivotal to embodied AI and interaction modeling. Most existing methods approach the goal by learning to predict isolated interaction elements, e.g., human contact, object affordance, and human-object spatial relation, primarily from the perspective of either the human or the object. Which underexploit certain correlations between the interaction counterparts (human and object), and struggle to address the uncertainty in interactions. Actually, objects' functionalities potentially affect humans' interaction intentions, which reveals what the interaction is. Meanwhile, the interacting humans and objects exhibit matching geometric structures, which presents how to interact. In light of this, we propose harnessing these inherent correlations between interaction counterparts to mitigate the uncertainty and jointly anticipate the above interaction elements in 3D space. To achieve this, we present LEMON (Learning 3D huMan-Object iNteraction relation), a unified model that mines interaction intentions of the counterparts and employs curvatures to guide the extraction of geometric correlations, combining them to anticipate the interaction elements. Besides, the 3D Interaction Relation dataset (3DIR) is collected to serve as the test bed for training and evaluation. Extensive experiments demonstrate the superiority of LEMON over methods estimating each element in isolation.

Method Overview

LEMON pipeline. Initially, it takes modality-specific backbones to extract respective features F_h, F_o, F_i, which are then utilized to excavate intention features (T_o, T_h) of the interaction. With T_o, T_h as conditions, LEMON integrates curvatures (C_o, C_h) to model geometric correlations and reveal the contact ϕ_c, affordance ϕ_a features. Following, the ϕ_c is injected into the calculation of the object spatial feature ϕ_p. Eventually, the decoder projects ϕ_c, ϕ_a, ϕ_p to the final outputs.

3DIR Dataset

3DIR Dataset. (a) The quantity of images and point clouds for each object, and a data sample containing the image, mask, dense human contact annotation, 3D object with affordance annotation, and the fitted human mesh with the object proxy sphere. (b) The proportion of our contact annotations within 24 parts on SMPL, and distributions of contact vertices for certain HOIs. (c) The ratio of annotated affordance regions to the whole object geometries, and the distribution of this ratio for some categories. (d) Mean distances (unit: m) between annotated object centers and human pelvis joints, and directional projections of annotated centers for several objects.

Experiment Results

Compare with SOTA methods which estimate each Interaction element in isolation. (a) Results of the estimated human vertices in contact with objects, the estimated contact vertices are shown in yellow . (b) The anticipations of 3D object affordance, the depth of red represents the probability of anticipated affordance. (c) Two views of the predicted spatial relation, translucent spheres are object proxies.

Train LEMON on 3DIR dataset and infer on the unseen BEHAVE dataset, including human contact, object affordance, and human-object spatial relation.

Multiple Interactions. LEMON anticipates the interaction elements of multiple interactions with the same object.

Multiple Objects. LEMON could anticipate distinct results according to the interacting object.

BibTeX


            @inproceedings{yang2024lemon,
              title={LEMON: Learning 3D Human-Object Interaction Relation from 2D Images},
              author={Yang, Yuhang and Zhai, Wei and Luo, Hongchen and Cao, Yang and Zha, Zheng-Jun},
              booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
              pages={16284--16295},
              year={2024}
            }