AttentionHand:
Text-driven Controllable Hand Image Generation
for 3D Hand Reconstruction in the Wild

ECCV 2024

(Oral Presentation)

1Sogang University, 2LG Electronics, 3Pusan National University
(* : Equal contribution, † : Corresponding author)
pipeline

We propose a novel method, AttentionHand, for text-driven controllable hand image generation. (1) In the data preparation phase, we prepare global and local RGB images, global and local hand mesh images, bounding box, and text prompt. (2) In the encoding phase, we get global and local latent image embeddings from VQ-GAN, and text embedding from CLIP. (3) In the conditioning phase, we refine image embeddings through the text attention stage, and obtain the diffusion feature through the visual attention stage. (4) In the decoding phase, we generate a new hand image from the diffusion feature.

Abstract

Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occclusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e, an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.

Results of Text-to-Hand Image Generation

Results of 3D Hand Mesh Reconstruction on MSCOCO

Results of 3D Hand Mesh Reconstruction on Re:InterHand

Motivation

intro

3D hand mesh reconstruction becomes difficult when hands are in the wild, due to insufficiency of in-the-wild 3D hand datasets. Compared to in-the-lab datasets, acquisition in-the-wild datasets is challenging due to unpredictable conditions such as weather, lighting, cost of sensors, and safety issues on crowded roads and public places. Even if an in-the-wild dataset is collected, data diversity would be poor due to the aforementioned severe constraints. Although arbitrary labels can be obtained through pseudo annotation, the precision and accuracy is still poor compared to in-the-lab datasets as shown in the figure (a). To tackle this problem, several synthetic datasets have introduced. However, since the hand and background images are synthesized out of harmony, they consist of unnatural and unrealistic hand images as shown in the figure (b). Hence, it is difficult to overcome the domain gap between indoor and outdoor scenes with synthetic datasets. To address these issues, we propose AttentionHand, a new method for the text-driven controllable hand image generation. AttentionHand is designed to create accurate, natural, realistic and harmonious in-the-wild hand images easily and infinitely as shown in the figure (c).

Method

TAS

Overall process of the text attention stage (TAS). TAS attends on hand-related tokens from the given text prompt by leveraging attention maps. Specifically, TAS extracts hand-related attention maps (i.e., holding and hand), and these attention maps are updated to highlight hand-related regions by the refinement based on the softmax operation and Gaussian filter. With TAS, we can obtain more hand-focused images than before.

VAS

Overall process of the visual attention stage (VAS). VAS attends on hand-related regions by conditioning global and local hand mesh images with the SD-based pipeline. With global and local information, AttentionHand can be jointly optimized to reflect the global context (i.e., in-the-wild background) and local context (i.e., hand-focused foreground.) In the end of the conditioning phase, we finally get the diffusion feature, which is decoded to new hand images in the decoding phase. Hence, AttentionHand can generate well-aligned hand images with the given mesh image and text prompt for the 3D hand mesh reconstruction in the wild.

optimization

Overall process of the optimization of AttentionHand. By global and local denoising of updated noisy embeddings with t diffusion steps, we obtain global and local predicted noises. They are optimized by L2 loss with global and local residual noises. Note that global and local denoising networks share weights.

Exploration of Text Attention Stage

exp_TAS

Attention maps are well described their corresponding tokens in the case of with TAS. It implies that with TAS, AttentionHand can reflect hand-related tokens enough compare to the case of without TAS.

exp_Gaussian

(1) Ablation study for Gaussian filter. In the case of no/random Gaussian filter, the hand was disappeared or its shape became strange. Howver, in the case of fixed Gaussian filter, generated images are well-aligned with given hand mesh images and look natural. Hence, we determined fixed Gaussian filter makes the generated image plausibly regardless of diffusion timestep.
(2) Ablation study for our loss. While the load balancing loss flattens the 2D attention map as 1D representation, leading to distort spatial knowledge, our loss updates the image embedding based on the spatial information of the attention map. Therefore, generated images with our loss are well-fit with given hand mesh images.
(3) Ablation study for regularization of updated noise. If the regularization term is randomly set, the updated noise tends to be out of distribution. Specifically, generated images are not aligned with given mesh images, or missed some hands. Therefore, it is necessary to regularize the updated noise for faithful hand image generation.

Model Design Justification

table_comparison

To justify our model's superiority, we compared the characteristics of prior works including our model. As shown in the table, our model's distinctive and potential features compared to prior works are (1) harmonious preservation of locality (i.e., hand) with globality (i.e., in-the-wild scene), and (2) selective attention on hand-related tokens by cross attention.

table_VAS

Specifically, to harmonize globality and locality, we developed global and local designs for the visual attention stage (VAS). Moreover, since the global and local branches are designed structurally same, we set them to share their weights for reducing the number of training parameters and improving the generalizability. We experimentally verified the effectiveness of our design as shown in the table.

Robustness of Our Generated Dataset

exp_multi

To verify robustness of our generated dataset, we generated multiple hand images from same modalities as shown in the figure. As a result, all generated images are perfectly well-aligned with given hand mesh images.

exp_tsne

Moreover, we found the t-SNE distribution of AttentionHand is broader than MSCOCO as shown in the figure. As a result, we believe that AttentionHand can contribute to the downstream task with our extensive in-the-wild hand images, leading to alleviate the domain gap between indoor and outdoor scenes.

BibTeX

@article{park2024attentionhand,
  author    = {Park, Junho and Kong, Kyeongbo and Kang, Suk-Ju},
  title     = {AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild},
  journal   = {European Conference on Computer Vision},
  year      = {2024},
}