Learning Physics-Based Full-Body Human
Reaching and Grasping

from Brief Walking References

Yitang Li1,2,4*, Mingxian Lin2*, Zhuo Lin1, Yipeng Deng1,2, Yue Cao1,2, Li Yi1,3,2†
1Tsinghua University, 2Shanghai Qi Zhi Institute, 3Shanghai AI Laboratory, 4Galbot
CVPR 2025

Video

Abstract

Existing motion generation methods based on mocap data are often limited by data quality and coverage. In this work, we propose a framework that generates diverse, physically feasible full-body human reaching and grasping motions using only brief walking mocap data. Base on the observation that walking data captures valuable movement patterns transferable across tasks and, on the other hand, the advanced kinematic methods can generate diverse grasping poses, which can then be interpolated into motions to serve as task-specific guidance. Our approach incorporates an active data generation strategy to maximize the utility of the generated motions, along with a local feature alignment mechanism that transfers natural movement patterns from walking data to enhance both the success rate and naturalness of the synthesized motions. By combining the fidelity and stability of natural walking with the flexibility and generalizability of task-specific generated data, our method demonstrates strong performance and robust adaptability in diverse scenes and with unseen objects.

Reaching and Grasping Compared to Baselines

Grasping Object on High & Low Table.



Ours

Fullbody PPO

ASE

AMP

AMP*
(Adding generated data)

PMP(2-Part)
(Upper and Lower)

PMP(5-Part)
(Torso and Five Limbs)

PSE(2-Part)
(Upper and Lower)

PSE(5-Part)
(Torso and Five Limbs)

Generalization ability

Different Table height & width & Goal position

We can generate diverse reaching and grasping motions conditioned on different scenes with different table height(0.0-1.6m), table width(0.6-1.2m), initial positions with high success rate and natural movement. The generalization capability for different scenes, particularly with respect to table heights, largely stems from the diversity of our generated data. Our dataset covers almost all possible table heights, providing task-specific guidance to facilitate grasping in various scenarios.



Various Object Instances & Categories

Our policy successfully generalizes to various objects, including unseen categories, producing natural movements with a high success rate.




Effects of Data Ratio

At low data ratios, task completion improves rapidly as the ratio increases. However, when the ratio exceeds 100%, the character struggles with natural turning, and beyond 200%, the character shifts focus to balancing between generated demos, hindering effective walking.



Data Ratio: 0%

Data Ratio: 5%

Data Ratio: 10%

Data Ratio: 20%

Data Ratio: 50%

Data Ratio: 100%

Data Ratio: 200%

Ablation Study on Feature Alignment

We conduct various ablation studies to validate the effectiveness of feature alignment. The result show it can improve the motion naturalness and stability during recovery.


Pose Refine

Local feature alignment enhances the refinement of the grasping pose. For better comparison, we include a pause in the video during the grasping phase. A more detailed explanation is presented in the figure below.


w/o features

Zero+First-layer features

Pose Refine Illustration

More detailed comparison


w/o features

Torso-feature only

Limb-features only

Torso-feature + First-layer feature

Zero+First-layer features

Zero+First+Second-layer features


Improve Stability


Feature alignment enhances overall stability: with feature alignment, the agent have its left hand raises swiftly, and left foot steps back quickly to maintain balance when grasping low objects. This coordinated movement is crucial for dynamic recovery.


w/o feature align

Adding Zero/First-Layer feature align

Contact

Please contact us at liyitang22@mails.tsinghua.edu.cn if you have any question.