ICLR 2026

EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations

1AI Lab, LG Electronics 2KAIST 3Visual Geometry Group, University of Oxford

† Corresponding author

TL;DR. EgoWorld reconstructs an egocentric view from an exocentric view by predicting point clouds, 3D hand poses, and textual descriptions, then conditioning a diffusion model on those observations.

EgoWorld teaser figure showing exocentric-to-egocentric translation examples

Abstract

Egocentric vision is central to understanding fine-grained hand-object interaction, but translating an exocentric image into an egocentric view remains difficult when methods rely only on 2D cues, synchronized multi-view capture, or assumptions such as an initial egocentric frame and known relative camera poses at inference time. EgoWorld addresses that setting directly from a single exocentric image.

The framework first extracts rich exocentric observations, including an egocentric sparse RGB map derived from point clouds, an egocentric 3D hand pose, and a textual description of the scene and interaction. It then conditions a diffusion-based reconstruction model on those observations to generate dense, semantically coherent egocentric images. Across H2O, TACO, Assembly101, and Ego-Exo4D, EgoWorld achieves state-of-the-art performance and retains strong generalization on unseen objects, actions, scenes, subjects, and in-the-wild scenarios.

Method

EgoWorld method overview diagram

Stage 1

Exocentric View Observation \( \Phi_{exo} \)

Given a single exocentric image \( I_{exo} \), EgoWorld estimates depth, reconstructs a metrically scaled point cloud, projects it into the egocentric frame as a sparse RGB map \( S_{ego} \), predicts a 3D egocentric hand pose \( P_{ego} \), and extracts a textual description \( T_{exo} \).

Stage 2

Egocentric View Reconstruction \( \Phi_{ego} \)

The sparse and pose embedding, and text embedding are fused in a latent diffusion model to generate the final egocentric image \( \hat{I}_{ego} \), recovering both local hand-object interaction and global scene context.

Formulation

\[ S_{ego}, P_{ego}, T_{exo} = \Phi_{exo}(I_{exo}), \] \[ \hat{I}_{ego} = \Phi_{ego}(S_{ego}, P_{ego}, T_{exo}). \]

Results

We evaluate EgoWorld across four benchmark datasets, six quantitative metrics, and real-world scenarios.

Comparisons on H2O Dataset

Qualitative comparisons on the H2O dataset

Table 1. On unseen object, action, scene, and subject scenarios of H2O, EgoWorld outperforms prior methods across image quality, hand accuracy, and semantic alignment metrics: FID, PSNR, SSIM, LPIPS, PA-MPJPE, and CLIPScore.

Methods Unseen Objects Unseen Actions Unseen Scenes Unseen Subjects
FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑
pix2pixHD 436.25 25.012 0.2993 0.6057 18.007 0.2302 211.10 24.420 0.2854 0.6127 17.754 0.2450 490.32 18.567 0.2425 0.7290 20.229 0.2159 452.13 18.172 0.3310 0.7234 21.357 0.2311
pixelNeRF 498.23 26.557 0.3887 0.5372 15.746 0.2270 251.76 27.061 0.3950 0.8159 14.636 0.2315 489.13 26.537 0.2574 0.7143 17.085 0.2097 493.13 22.636 0.4135 0.6838 18.131 0.2263
CFLD 59.615 25.922 0.4307 0.4539 7.9971 0.2656 50.953 28.529 0.4324 0.4593 8.1199 0.2699 118.10 29.030 0.3696 0.6841 7.8766 0.2506 129.30 21.050 0.4001 0.6269 9.5606 0.2461
EgoWorld (Ours) 41.334 31.171 0.4814 0.3476 7.3178 0.2731 33.284 31.620 0.4566 0.3780 7.2602 0.2824 90.893 31.004 0.4096 0.6519 7.4087 0.2585 96.429 24.851 0.4605 0.6188 8.1031 0.2582

On H2O, EgoWorld improves over baselines in every unseen setting while also lifting pose accuracy and text-image alignment. The strongest gains come from combining geometric cues from sparse projection with semantic cues from language, which helps reconstruction beyond the hand region.

Comparisons on TACO, Assembly101, and Ego-Exo4D Datasets

Comparisons on TACO, Assembly101, and Ego-Exo4D

Table 2. EgoWorld generalizes to unseen action scenarios on TACO, Assembly101, and Ego-Exo4D, consistently outperforming prior methods across image quality, pose, and semantic metrics.

Methods TACO Assembly101 Ego-Exo4D
FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑
pix2pixHD 227.87 25.875 0.2806 0.7037 19.054 0.2309 350.97 17.107 0.3587 0.6578 21.967 0.2114 401.48 14.792 0.3065 0.6899 25.082 0.2203
pixelNeRF 302.19 26.661 0.3888 0.8543 16.137 0.2251 356.44 19.037 0.3761 0.6019 19.658 0.2070 367.39 17.347 0.3618 0.7134 23.793 0.2149
CFLD 61.357 28.769 0.4009 0.5033 7.9078 0.2715 53.931 20.998 0.3988 0.5566 11.108 0.2458 70.476 21.578 0.3614 0.5975 15.010 0.2670
EgoWorld (Ours) 37.191 30.155 0.4237 0.4025 7.3590 0.2828 50.232 25.365 0.4101 0.5142 10.561 0.2558 61.231 24.985 0.3986 0.5482 13.992 0.2862

The results on broader benchmark show the same pattern as H2O: EgoWorld keeps improving local interaction fidelity while also recovering object appearance and scene structure. The gains remain consistent even as the datasets become more diverse and less controlled.

Comparisons on Real-World Scenarios

Real-world generalization examples

On in-the-wild smartphone captures, EgoWorld still produces coherent egocentric view images. Compared with the baseline, the outputs remain better aligned with the actual interaction and less biased toward training-set appearance patterns.

Analysis

We cover multimodal conditioning, reconstruction backbones, pose modeling, generation stability, text controllability, and representative failure cases.

1 / 6

BibTeX

@inproceedings{park2026egoworld,
  author    = {Park, Junho and Ye, Andrew Sangwoo and Kwon, Taein},
  title     = {EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
}