ICLR 2026

EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations

1AI Lab, LG Electronics 2KAIST 3Visual Geometry Group, University of Oxford

† Corresponding author

TL;DR. EgoWorld reconstructs a first-person view from a single exocentric image by predicting projected point-cloud observations, 3D egocentric hand pose, and textual scene context, then conditioning a diffusion model on those signals.

EgoWorld teaser figure showing exocentric-to-egocentric translation examples
EgoWorld translates a single exocentric view into an egocentric one by leveraging complementary exocentric observations, including point clouds, 3D hand poses, and textual descriptions.

Abstract

Egocentric vision is central to understanding fine-grained hand-object interaction, but translating an exocentric image into an egocentric view remains difficult when methods rely only on 2D cues, synchronized multi-view capture, or assumptions such as an initial egocentric frame and known relative camera poses at inference time. EgoWorld addresses that setting directly from a single exocentric image.

The framework first extracts rich exocentric observations, including a projected sparse egocentric RGB map derived from point clouds, a predicted 3D egocentric hand pose, and a textual description of the scene and interaction. It then conditions a diffusion-based reconstruction model on those observations to produce dense, semantically coherent egocentric images. Across H2O, TACO, Assembly101, and Ego-Exo4D, EgoWorld achieves state-of-the-art performance and retains strong generalization on unseen objects, actions, scenes, subjects, and in-the-wild captures.

Method

EgoWorld separates exocentric observation extraction from egocentric reconstruction so geometry, pose, and semantics all remain explicit conditioning signals.

EgoWorld method overview diagram
EgoWorld first extracts projected point-cloud cues, 3D hand pose, and a textual description from the exocentric image, then reconstructs the final egocentric view from those observations.

Stage 1

Exocentric View Observation \( \Phi_{exo} \)

Given a single exocentric image \( I_{exo} \), EgoWorld estimates depth, reconstructs a metrically scaled point cloud, projects it into the egocentric frame as a sparse RGB map \( S_{ego} \), predicts a 3D egocentric hand pose \( P_{ego} \), and extracts a textual description \( T_{exo} \).

Stage 2

Egocentric View Reconstruction \( \Phi_{ego} \)

The sparse map, pose embedding, and text embedding are fused in a latent diffusion model to generate the final egocentric image \( \hat{I}_{ego} \), recovering both local hand-object interaction and broader scene context.

Formulation

\[ S_{ego}, P_{ego}, T_{exo} = \Phi_{exo}(I_{exo}),\\ \hat{I}_{ego} = \Phi_{ego}(S_{ego}, P_{ego}, T_{exo}). \]

Results

The updated paper evaluates EgoWorld across four benchmark datasets, six quantitative metrics, and real-world captures.

Comparisons on H2O Dataset

Qualitative comparisons on the H2O dataset

Table 1. On H2O unseen objects, actions, scenes, and subjects, EgoWorld outperforms prior methods across image quality, hand accuracy, and semantic alignment metrics: FID, PSNR, SSIM, LPIPS, PA-MPJPE, and CLIPScore.

Methods Unseen Objects Unseen Actions Unseen Scenes Unseen Subjects
FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑
pix2pixHD 436.25 25.012 0.2993 0.6057 18.007 0.2302 211.10 24.420 0.2854 0.6127 17.754 0.2450 490.32 18.567 0.2425 0.7290 20.229 0.2159 452.13 18.172 0.3310 0.7234 21.357 0.2311
pixelNeRF 498.23 26.557 0.3887 0.5372 15.746 0.2270 251.76 27.061 0.3950 0.8159 14.636 0.2315 489.13 26.537 0.2574 0.7143 17.085 0.2097 493.13 22.636 0.4135 0.6838 18.131 0.2263
CFLD 59.615 25.922 0.4307 0.4539 7.9971 0.2656 50.953 28.529 0.4324 0.4593 8.1199 0.2699 118.10 29.030 0.3696 0.6841 7.8766 0.2506 129.30 21.050 0.4001 0.6269 9.5606 0.2461
EgoWorld (Ours) 41.334 31.171 0.4814 0.3476 7.3178 0.2731 33.284 31.620 0.4566 0.3780 7.2602 0.2824 90.893 31.004 0.4096 0.6519 7.4087 0.2585 96.429 24.851 0.4605 0.6188 8.1031 0.2582

On H2O, EgoWorld improves over CFLD in every unseen setting while also lifting pose accuracy and text-image alignment. The strongest gains come from combining geometric cues from sparse projection with semantic cues from language, which helps reconstruction beyond the hand region.

Comparisons on TACO, Assembly101, and Ego-Exo4D

Comparisons on TACO, Assembly101, and Ego-Exo4D

Table 2. EgoWorld also generalizes to unseen-action settings on TACO, Assembly101, and Ego-Exo4D, consistently outperforming prior methods across image quality, pose, and semantic metrics.

Methods TACO Assembly101 Ego-Exo4D
FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑ FID↓ PSNR↑ SSIM↑ LPIPS↓ PA-MPJPE↓ CLIPScore↑
pix2pixHD 227.87 25.875 0.2806 0.7037 19.054 0.2309 350.97 17.107 0.3587 0.6578 21.967 0.2114 401.48 14.792 0.3065 0.6899 25.082 0.2203
pixelNeRF 302.19 26.661 0.3888 0.8543 16.137 0.2251 356.44 19.037 0.3761 0.6019 19.658 0.2070 367.39 17.347 0.3618 0.7134 23.793 0.2149
CFLD 61.357 28.769 0.4009 0.5033 7.9078 0.2715 53.931 20.998 0.3988 0.5566 11.108 0.2458 70.476 21.578 0.3614 0.5975 15.010 0.2670
EgoWorld (Ours) 37.191 30.155 0.4237 0.4025 7.3590 0.2828 50.232 25.365 0.4101 0.5142 10.561 0.2558 61.231 24.985 0.3986 0.5482 13.992 0.2862

The broader benchmark suite shows the same pattern as H2O: EgoWorld keeps improving local interaction fidelity while also recovering object appearance and scene structure. The gains remain consistent even as the datasets become more diverse and less controlled.

Real-World Generalization

Real-world generalization examples

On in-the-wild smartphone captures, EgoWorld applies the full pipeline from a single RGB image and still produces coherent egocentric views. Compared with CFLD, the outputs remain better aligned with the actual interaction and less biased toward training-set appearance patterns.

Analysis

The updated analyses cover multimodal conditioning, reconstruction backbones, pose modeling, generation stability, text controllability, and representative failure cases.

1 / 6

BibTeX

@inproceedings{park2026egoworld,
  author    = {Park, Junho and Ye, Andrew Sangwoo and Kwon, Taein},
  title     = {EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
}