GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction

1Tsinghua University   2Li Auto
Work done during internship. Project leader. § Corresponding author.
GeoDiff4D Teaser

Method Overview

GeoDiff4D Architecture

Overall architecture. Our system takes a reference image, driving expressions, and head poses as input. Specifically, the reference image is encoded into hierarchical identity embeddings using a pretrained VAE and UNet-based reference network. Driving expressions are compressed into low-dimensional latents via a pose-free expression encoder. Both embeddings are injected into the diffusion model through cross-attention, while head pose maps concatenated with noise serve as inputs. The model then jointly predicts portrait images and surface normals. For 3D reconstruction, a UNet refines FLAME meshes using expression latents through cross-attention, and an MLP captures Gaussian dynamics. Finally, the generated surface normals provide additional geometric supervision that further enhances the reconstruction fidelity.

Gallery

Geometry-Aware Diffusion Model

Our geometry-aware video generation model can simultaneously generate high-fidelity portrait videos and detailed surface normals, supporting reference images with large head poses and driving images with exaggerated expressions.

Reference Image
Driving Image
Generated RGB
Generated Normal

4D Head Avatar

GeoDiff4D can generate vivid and high-quality 4D head avatars from single reference images, even when the reference exhibits large head pose variations and exaggerated facial expressions, covering a wide range of styles including but not limited to real humans and cartoon characters.

Comparisons

Self-Reenactment

We compare our method with state-of-the-art approaches on self-reenactment tasks. Please click the arrows to view different comparison examples.
In addition to GeoDiff4D, we also present the results of video generation models(Ours VGM-RGB and Ours VGM-Normal) below.

Cross-Identity Reenactment

For cross-identity reenactment, our geometry-aware approach maintains identity consistency while accurately capturing expression dynamics.
In addition to GeoDiff4D, we also present the results of video generation models(Ours VGM-RGB and Ours VGM-Normal) below.

Novel View Synthesis

Our method can synthesize novel views of 4D head avatars with high fidelity, demonstrating strong multi-view consistency and realistic geometry.

Extensions

Pose-Free Expression Encoder

Our Pose-Free Expression Encoder demonstrates excellent view consistency. Expression features extracted from different viewpoints can drive the reference image to generate highly consistent results.

Acknowledgements

We thank the authors of the following open-source projects for making their code or datasets publicly available: GaussianAvatars, CAP4D, Pixel3DMM, VHAP, NeRSemble, RenderMe-360, and DAViD. We are also grateful to the corresponding authors and participants for their valuable contributions.

Citation

@misc{xu2026geodiff4dgeometryawarediffusion4d,
      title={GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction},
      author={Chao Xu and Xiaochen Zhao and Xiang Deng and Jingxiang Sun and Zhuo Su and Donglin Di and Yebin Liu},
      year={2026},
      eprint={2602.24161},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.24161},
}