EgoWorld

TL;DR. EgoWorld reconstructs a first-person view from a single exocentric image by predicting projected point-cloud observations, 3D egocentric hand pose, and textual scene context, then conditioning a diffusion model on those signals.

Abstract

Egocentric vision is central to understanding fine-grained hand-object interaction, but translating an exocentric image into an egocentric view remains difficult when methods rely only on 2D cues, synchronized multi-view capture, or assumptions such as an initial egocentric frame and known relative camera poses at inference time. EgoWorld addresses that setting directly from a single exocentric image.

The framework first extracts rich exocentric observations, including a projected sparse egocentric RGB map derived from point clouds, a predicted 3D egocentric hand pose, and a textual description of the scene and interaction. It then conditions a diffusion-based reconstruction model on those observations to produce dense, semantically coherent egocentric images. Across H2O, TACO, Assembly101, and Ego-Exo4D, EgoWorld achieves state-of-the-art performance and retains strong generalization on unseen objects, actions, scenes, subjects, and in-the-wild captures.

Method

EgoWorld separates exocentric observation extraction from egocentric reconstruction so geometry, pose, and semantics all remain explicit conditioning signals.

Stage 1

Exocentric View Observation \( \Phi_{exo} \)

Given a single exocentric image \( I_{exo} \), EgoWorld estimates depth, reconstructs a metrically scaled point cloud, projects it into the egocentric frame as a sparse RGB map \( S_{ego} \), predicts a 3D egocentric hand pose \( P_{ego} \), and extracts a textual description \( T_{exo} \).

Stage 2

Egocentric View Reconstruction \( \Phi_{ego} \)

The sparse map, pose embedding, and text embedding are fused in a latent diffusion model to generate the final egocentric image \( \hat{I}_{ego} \), recovering both local hand-object interaction and broader scene context.

Formulation

\[ S_{ego}, P_{ego}, T_{exo} = \Phi_{exo}(I_{exo}),\\ \hat{I}_{ego} = \Phi_{ego}(S_{ego}, P_{ego}, T_{exo}). \]

Results

The updated paper evaluates EgoWorld across four benchmark datasets, six quantitative metrics, and real-world captures.

Comparisons on H2O Dataset

Qualitative comparisons on the H2O dataset

Table 1. On H2O unseen objects, actions, scenes, and subjects, EgoWorld outperforms prior methods across image quality, hand accuracy, and semantic alignment metrics: FID, PSNR, SSIM, LPIPS, PA-MPJPE, and CLIPScore.

Methods	Unseen Objects						Unseen Actions						Unseen Scenes						Unseen Subjects
Methods	FID↓	PSNR↑	SSIM↑	LPIPS↓	PA-MPJPE↓	CLIPScore↑	FID↓	PSNR↑	SSIM↑	LPIPS↓	PA-MPJPE↓	CLIPScore↑	FID↓	PSNR↑	SSIM↑	LPIPS↓	PA-MPJPE↓	CLIPScore↑	FID↓	PSNR↑	SSIM↑	LPIPS↓	PA-MPJPE↓	CLIPScore↑
pix2pixHD	436.25	25.012	0.2993	0.6057	18.007	0.2302	211.10	24.420	0.2854	0.6127	17.754	0.2450	490.32	18.567	0.2425	0.7290	20.229	0.2159	452.13	18.172	0.3310	0.7234	21.357	0.2311
pixelNeRF	498.23	26.557	0.3887	0.5372	15.746	0.2270	251.76	27.061	0.3950	0.8159	14.636	0.2315	489.13	26.537	0.2574	0.7143	17.085	0.2097	493.13	22.636	0.4135	0.6838	18.131	0.2263
CFLD	59.615	25.922	0.4307	0.4539	7.9971	0.2656	50.953	28.529	0.4324	0.4593	8.1199	0.2699	118.10	29.030	0.3696	0.6841	7.8766	0.2506	129.30	21.050	0.4001	0.6269	9.5606	0.2461
EgoWorld (Ours)	41.334	31.171	0.4814	0.3476	7.3178	0.2731	33.284	31.620	0.4566	0.3780	7.2602	0.2824	90.893	31.004	0.4096	0.6519	7.4087	0.2585	96.429	24.851	0.4605	0.6188	8.1031	0.2582

On H2O, EgoWorld improves over CFLD in every unseen setting while also lifting pose accuracy and text-image alignment. The strongest gains come from combining geometric cues from sparse projection with semantic cues from language, which helps reconstruction beyond the hand region.

Comparisons on TACO, Assembly101, and Ego-Exo4D

Table 2. EgoWorld also generalizes to unseen-action settings on TACO, Assembly101, and Ego-Exo4D, consistently outperforming prior methods across image quality, pose, and semantic metrics.

Methods	TACO						Assembly101						Ego-Exo4D
Methods	FID↓	PSNR↑	SSIM↑	LPIPS↓	PA-MPJPE↓	CLIPScore↑	FID↓	PSNR↑	SSIM↑	LPIPS↓	PA-MPJPE↓	CLIPScore↑	FID↓	PSNR↑	SSIM↑	LPIPS↓	PA-MPJPE↓	CLIPScore↑
pix2pixHD	227.87	25.875	0.2806	0.7037	19.054	0.2309	350.97	17.107	0.3587	0.6578	21.967	0.2114	401.48	14.792	0.3065	0.6899	25.082	0.2203
pixelNeRF	302.19	26.661	0.3888	0.8543	16.137	0.2251	356.44	19.037	0.3761	0.6019	19.658	0.2070	367.39	17.347	0.3618	0.7134	23.793	0.2149
CFLD	61.357	28.769	0.4009	0.5033	7.9078	0.2715	53.931	20.998	0.3988	0.5566	11.108	0.2458	70.476	21.578	0.3614	0.5975	15.010	0.2670
EgoWorld (Ours)	37.191	30.155	0.4237	0.4025	7.3590	0.2828	50.232	25.365	0.4101	0.5142	10.561	0.2558	61.231	24.985	0.3986	0.5482	13.992	0.2862

The broader benchmark suite shows the same pattern as H2O: EgoWorld keeps improving local interaction fidelity while also recovering object appearance and scene structure. The gains remain consistent even as the datasets become more diverse and less controlled.

Real-World Generalization

On in-the-wild smartphone captures, EgoWorld applies the full pipeline from a single RGB image and still produces coherent egocentric views. Compared with CFLD, the outputs remain better aligned with the actual interaction and less biased toward training-set appearance patterns.

Analysis

The updated analyses cover multimodal conditioning, reconstruction backbones, pose modeling, generation stability, text controllability, and representative failure cases.

01

Effect of Conditioning Modalities

Text contributes most to object and scene semantics, while pose improves hand configuration. The best results come from using both together; removing both recovers the degraded 4Diff-style setting.

02

Backbone of Image Completion

Compared with MAE and MAT variants, the LDM backbone produces the most coherent egocentric reconstructions because it can combine sparse geometry, pose, and text during iterative denoising.

Backbone comparison for image completion

03

Pose Modeling Strategies

Egocentric hand pose estimation works better than camera-pose or whole-body alternatives in this setting. The ViT-based hand estimator is the most reliable because the hands remain the most observable body part in exocentric interaction scenes.

04

Generation Consistency

Repeated sampling under the same conditions produces consistent egocentric views. The model remains stochastic, but the generated interaction layout and scene structure stay stable across runs.

05

Incorrect Textual Description Guidance

When the text description is intentionally mismatched, EgoWorld changes appearance-level object and scene semantics while preserving the underlying geometry from the sparse map.

06

Failure Cases

The hardest cases still involve subtle finger articulation, heavily occluded objects, and inaccurate language descriptions from the VLM. Those errors can propagate into the final reconstruction even when the global structure is mostly correct.

1 / 6

BibTeX

@inproceedings{park2026egoworld,
  author    = {Park, Junho and Ye, Andrew Sangwoo and Kwon, Taein},
  title     = {EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
}