Hand-Focused Reconstruction of Monocular RGB Clothed Humans
Introduction
Reconstructing 3D human bodies from monocular RGB images is a fundamental technology for computational modeling of human behavior. Human actions are often conveyed through subtle cues such as facial expressions, hand gestures, body posture, and clothing. Among these, hands serve as a primary medium for human interaction with the world, making hand reconstruction crucial for applications in VR/AR, robotics, and human-computer interaction. However, traditional full-body reconstruction methods often struggle to capture fine hand details, while standalone hand reconstruction lacks contextual information about the overall body structure and motion. Recent research has shifted toward localized refinement within full-body reconstruction to achieve more complete and accurate human models. Compared to complex multi-view camera systems, single RGB image-based reconstruction is more economical and convenient.
Parametric human models such as SMPL, STAR, and SMPL-X have facilitated full-body reconstruction, including hands. However, methods like ICON, ECON, and NeRF-based approaches typically train a single model to estimate the entire body, inherently limited by the lack of accurate and diverse full-body motion datasets. While these methods ensure consistency and coordination across body parts, their hand reconstruction performance remains inadequate for specialized applications such as behavioral analysis, animation, and medical fields.
Existing hand reconstruction methods, including those based on the MANO hand model, directly regress shape and pose parameters, but this high-dimensional estimation problem often ignores spatial correlations. Some approaches use voxel representations or implicit functions to improve accuracy, but these methods demand significant computational resources. Recent models like NIMBLE achieve higher precision but are computationally expensive, making them unsuitable for real-time applications.
Full-body reconstruction tends to lose hand details, while standalone hand reconstruction lacks motion dependencies. The latest work, ECON, which reconstructs clothed humans from single RGB images, also suffers from hand distortions. Although ECON provides a hand replacement option, it struggles with occlusions and complex poses. Since ECON embeds the SMPL-X model, integrating more advanced hand reconstruction techniques becomes feasible.
To address these challenges, this paper presents H-ECON, a method combining ECON with the MANO hand model for efficient and refined hand reconstruction in clothed humans. H-ECON introduces a type-independent hand detector, attention mechanisms, and dilated spiral convolution to enhance hand perception and feature extraction. A unique fusion module ensures seamless integration between hand reconstruction and the full-body model. Experiments on FreiHAND and HanCo datasets demonstrate that H-ECON significantly outperforms existing methods, narrowing the gap between 2D image generation and 3D human mesh reconstruction.
Related Work
Statistical Body Modeling
Statistical human models like SCAPE and SMPL simplify body deformation by separating identity-dependent and pose-dependent variations. SCAPE decomposes geometric deformations into pose-related shape changes, individual morphological differences, and rigid transformations based on skeletal structure. However, its reliance on triangle rotations introduces memory overhead and limits compatibility with animation software.
SMPL, a vertex-based linear model, combines identity deformation with skeletal skinning, making it the most widely used human model. Extensions like SMPL+H enhance hand details, while SMIL introduces a statistical infant model. STAR reduces parameter complexity, and SMPL-X incorporates facial expressions, finger movements, and eye details, improving surface detail representation. SMPL-X supports differentiable rendering, facilitating deep neural network-based parameter regression.
MANO Hand Model
Most hand reconstruction methods rely on MANO, a deformable hand mesh model. However, MANO-based approaches require estimating low-dimensional pose and shape parameters, where minor changes can propagate through the kinematic tree, causing significant hand pose variations. Model-free methods directly predict 3D hand vertex positions using deep neural networks but face challenges in dense surface estimation.
CNN-based vertex regression disrupts spatial relationships in input images, while heatmap-based methods preserve spatial correlations at a lower computational cost. This paper adopts a graph-based approach, treating the hand mesh as a structured graph and using SE-ResNet and dilated spiral convolution to avoid spatial relationship degradation. This method progressively refines mesh vertices, ensuring robust hand reconstruction.
Clothed Human Reconstruction
SMPL-based geometric priors enhance pose stability and lay the foundation for clothed human reconstruction. The SMPL+D model adds vertex displacements to simulate tight clothing but struggles with topology mismatches. Implicit representations offer flexibility but lack interpretability.
Recent advances like ECON combine explicit parametric models with deep implicit functions, using SMPL-X depth as a geometric constraint. ECON integrates depth-normal integration (d-BiNI) to maintain surface continuity, supporting animation-ready avatars. However, ECON’s reliance on PIXIE for SMPL-X parameter regression leads to hand distortions. H-ECON addresses this by employing a type-independent hand detector and direct vertex regression for precise hand reconstruction.
H-ECON Model Design
The H-ECON pipeline begins with a monocular RGB image input, processed by PIXIE to generate an SMPL-X body model. The image and SMPL-X output are fed into ECON to obtain a preliminary clothed human model. The hand region is then cropped and resized to 224×224 pixels, with left hands flipped to right-hand orientation. An encoder-decoder network regresses MANO hand vertices, and a fusion module replaces ECON’s hand model with the refined MANO output.
Hand Detection Module
Hand detection relies on a pose estimator and a type-independent hand detector. The pose estimator locates body keypoints (e.g., wrists), and the hand detector precisely identifies hand positions and orientations within the target region.
Hand Vertex Regression
The hand module uses an SE-ResNet encoder with attention mechanisms and a dilated spiral convolution decoder to directly predict 3D vertex coordinates.
SE-ResNet Encoder
The encoder employs SE-ResNet-50, integrating SENet’s attention mechanism with ResNet’s residual connections. SENet enhances feature perception by weighting channels via global average pooling and two convolutional layers. SE-ResNet-50 achieves performance close to ResNet-101 with half the computational cost.
Dilated Spiral Decoder
The decoder uses spiral patch operators to define spatial neighborhoods, ensuring fixed-length vertex sequences. Dilated spiral convolution expands the receptive field without losing resolution, capturing multi-scale hand features.
Upsampling increases vertex density using a mesh hierarchy, where discarded vertices are projected onto downsampled triangles. The decoder alternates between upsampling and spiral convolution, progressively refining hand mesh details.
Fusion Module
The fusion module integrates MANO hands into the SMPL-X body model through coordinate transformation, scale adjustment, translation, and rotation. Wrist joints guide hand positioning, and singular value decomposition (SVD) computes the rotation matrix for alignment.
Training Objectives
The loss function combines L1 norms for 3D vertex, 2D joint, and 3D pose errors. Additional terms include normal loss for smooth surfaces, edge length loss for mesh regularity, and silhouette loss for contour alignment. The total loss balances these components to optimize reconstruction accuracy.
Experiments
Datasets and Augmentation
H-ECON is trained on FreiHAND and HanCo datasets, with 80% for training and 20% for validation. Data augmentation includes random scaling, translation, color jittering, rotation, and motion blur to enhance generalization.
Implementation
Experiments run on a GTX 3090 GPU using PyTorch. The Adam optimizer with an initial learning rate of 1e-4 is used, decaying by 0.1 every 30 epochs. Training lasts 50-100 epochs, with image normalization based on ImageNet statistics.
Evaluation
Metrics
• MPJPE/MPVPE: Mean joint/vertex position error.
• PA-MPJPE/MPVPE: Procrustes-aligned errors.
• F-score: Harmonic mean of precision and recall at 5mm/15mm thresholds.
• AUC: Area under the PCK curve for error thresholds up to 50mm.
Hand Reconstruction
On FreiHAND, H-ECON with SE-ResNet-50 outperforms ResNet-18 baselines by 33.1% in PA-MPJPE and 40.0% in F@5. With additional HanCo data, H-ECON surpasses HIU-DMTL by 2.7% in PV and 2.14% in F@5, demonstrating superior hand pose and mesh accuracy.
Qualitative results show H-ECON’s robustness to occlusions, gloves, and hand crossings, unlike ECON’s distorted outputs. The fusion module ensures seamless integration, even in unseen viewpoints.
Full-Body Reconstruction
H-ECON enhances ECON’s hand reconstruction without increasing processing time (1.12s per image). The refined hand module eliminates unrealistic artifacts, improving hand positioning and articulation.
Limitations
H-ECON depends on SMPL-X wrist rotation accuracy; errors may cause fusion artifacts. Expanding training data with wrist-specific features or adopting more precise hand models (e.g., HTML, DART) could mitigate this. Additionally, H-ECON reconstructs geometry but not underlying skeletons or skinning weights. Future work may incorporate SSDR for animation-ready avatars.
Conclusion
H-ECON reconstructs clothed humans with detailed hands from single RGB images, addressing hand distortions in complex poses. Its hand module outperforms ECON by 41.7% in PA-MPJPE and 52.6% in F@5, achieving state-of-the-art results with minimal data. The modular design allows separation of hair, clothing, and accessories for downstream tasks. Future directions include integrating deep skinning and 3D Gaussian splatting for animatable avatars.
doi.org/10.19734/j.issn.1001-3695.2024.02.0112
Was this helpful?
0 / 0