UMP

Mohammed VI Polytechnic University is an institution dedicated to research and innovation in Africa and aims to position itself among world-renowned universities in its fields

The University is engaged in economic and human development and puts research and innovation at the forefront of African development. A mechanism that enables it to consolidate Morocco’s frontline position in these fields, in a unique partnership-based approach and boosting skills training relevant for the future of Africa.

Located in the municipality of Benguerir, in the very heart of the Green City, Mohammed VI Polytechnic University aspires to leave its mark nationally, continentally, and globally.

Key-words: Hand-Object Interaction, 3D Grasp Generation, 3D Motion Generation, Deep Gen- erative Models

Research Motivation

Hands are dexterous and versatile manipulators essential to human interaction with objects and the environment. Therefore, accurately modeling realistic hand-object interactions, including the subtle movements of individual fingers has the potential to aid robots in learning human-robot interactions through simulation and improve the realism of virtual manipulation experiences. In virtual reality, for example, hand interactions still often depend on heuristics and controllers that attach objects to the hand based on predefined grasps. Faithfully reproducing object manipulation from input signals such as natural language or few previous frames of hand and object poses could significantly boost the immersiveness of these interactions.

In this context, we propose a thesis that will contribute to the task of generating human hand object interaction. Our aim is to leverage deep generative models, reinforcement learning and recent advance in large language models to investigate new solutions to the problem of synthesizing hand object interaction in 3D controlled with provided text prompts and geometry of the object.

Problem Statement

Synthesizing realistic hand-object interactions in 3D comes with various challenges given that the re- sulting motions should satisfy different constraints, (1) The motions must be geometrically plausible, minimizing hand and object intersections and ensuring that the grasp appears stable. (2) The mo- tions must be semantically plausible, with hands respecting natural object affordances (e.g., grasping a cup by its handle rather than flipping it upside down). (3) motions must be temporally consistent, with hand and object movements synchronized and the dynamics appearing natural. Addressing these challenges requires finding a suitable representation to better model interaction, contact and collision and avoid artifacts like hand-object interpenetration or non-plausible contact points.

The limited scale of existing hand-object datasets is another challenge encountered in generating hand- object interactions. This limited data affects also the generalization ability of the trained models to

unseen objects. The generalization ability of the model is a critical point and challenging task since different object shapes require different types of interaction and hand grasps, such as a power grasp of an apple, a delicate three-finger pinching of a cup handle, and bi-manual grasp of binoculars.

To tackle the above challenges, recent studies exploit diffusion models [1, 2], while others turn to reinforcement learning to learn from physical simulation [3]. On the other hand, rather than focusing on hands, new studies tackle the problem of hand interaction while considering the whole human body motion[4, 5, 6]. However, all these approaches still encounter various issues, such as limited general- ization ability, high computation time, the need for an initial hand pose, or only modeling single-hand interactions.

In this thesis, we aim to propose new models and approaches to understand and synthesize human hand interactions. More specifically, we aim to address the following questions; (1) Given a 3D point cloud of an object, how we can generate a plausible hand pose that can grasp and handle the object correctly? This includes the need to understand the object’s shape and environment and the challenge of generalization to new unseen objects. (2) How to generate physically plausible 3D human hand motion to move the object into a target location and pose? This involves generating continuous interaction with objects to move them while maintaining a stable grasp throughout the interaction. We aim also to leverage the recent advance in large language models to guide the hand object interaction with natural language and allow fine-grained control over the motion.

Research Scope

The aim of this thesis is to explore new approaches to model and generate realistic human hand object interactions. Firstly, a state-of-the-art review should be performed in order to understand the achieved advance, the existing challenges and the promising directions that can be investigated. Next, we aim to propose new generative model architectures to synthesize the human hand pose and motion to correctly interact, manipulate and grasp a given 3D object. Our goal is to publish these contributions in high impact computer vision conferences (e,g., ICCV, CVPR, ECCV) and journals.

Admission Criteria

The PhD position is proposed by the International Center of Artificial Intelligence of Morocco, of the Mohammed VI Polytechnic University. Applicants with excellent cursus must be holders of a Mas- ter’s, an engineering or an equivalent recognized degree in Computer Science or Applied Mathemat- ics. In addition, they should have skills in Programming (Python and C++) and good communication skills in English. Particular attention will be given to the suitability of this research project with the applicant’s background.

References

Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, and Bugra Tekin. Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions. arXiv preprint arXiv:2403.17827, 2024.
Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mello, Stan Birchfield, Jiaming Song, Shub- ham Tulsiani, and Sifei Liu. Affordance diffusion: Synthesizing hand-object interactions. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22479-22489, 2023.

Sammy Christen, Muhammed Kocabas, Emre Aksan, Jemin Hwangbo, Jie Song, and Otmar Hilliges. D-grasp: Physically plausible dynamic grasp synthesis for hand-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20577-20586, 2022.
Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, and Michael J Black. Grip: Generating interaction poses using spatial cues and latent consistency. In Interna- tional conference on 3D vision (3DV), 2024.
Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges. Physically plausible full-body hand-object interaction synthesis. arXiv preprint arXiv:2309.07907, 2023.
Keyang Zhou, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Gears: Local geometry-aware hand-object interaction synthesis. arXiv preprint arXiv:2404.01758, 2024.

UM6P.

CEDoc-UM6P-AI MOV : Modeling and Synthesizing Hand-Object Interactions Full-time

UMP

Job Overview

Log In

Sign Up

CEDoc-UM6P-AI MOV : Modeling and Synthesizing Hand-Object Interactions Full-time

UMP

Related Jobs

Crypto Data Scientist / Machine Learning Engineer (Istanbul-Remote) Full-time

Machine Learning Engineer – Customer Data Platform Section, Analytics Data Full-time

Applied Scientist – Computer Vision/Machine Learning, Last Mile Geospatial, Full-time

Entry-Level AI Data Rater – Japanese (Philippines) Full-time

Machine Learning Engineer, OS Intelligence Full-time

Senior Machine Learning Engineer Full-time

Job Overview