Method Overview
Overall architecture. Our system takes a reference image, driving expressions, and head poses as input. Specifically, the reference image is encoded into hierarchical identity embeddings using a pretrained VAE and UNet-based reference network. Driving expressions are compressed into low-dimensional latents via a pose-free expression encoder. Both embeddings are injected into the diffusion model through cross-attention, while head pose maps concatenated with noise serve as inputs. The model then jointly predicts portrait images and surface normals. For 3D reconstruction, a UNet refines FLAME meshes using expression latents through cross-attention, and an MLP captures Gaussian dynamics. Finally, the generated surface normals provide additional geometric supervision that further enhances the reconstruction fidelity.
Gallery
Geometry-Aware Diffusion Model
Our geometry-aware video generation model can simultaneously generate high-fidelity portrait videos and detailed surface normals, supporting reference images with large head poses and driving images with exaggerated expressions.
4D Head Avatar
GeoDiff4D can generate vivid and high-quality 4D head avatars from single reference images, even when the reference exhibits large head pose variations and exaggerated facial expressions, covering a wide range of styles including but not limited to real humans and cartoon characters.
Comparisons
Self-Reenactment
We compare our method with state-of-the-art approaches on self-reenactment tasks. Please click the arrows to view different comparison examples.
In addition to GeoDiff4D, we also present the results of video generation models(Ours VGM-RGB and Ours VGM-Normal) below.
Cross-Identity Reenactment
For cross-identity reenactment, our geometry-aware approach maintains identity consistency while accurately capturing expression dynamics.
In addition to GeoDiff4D, we also present the results of video generation models(Ours VGM-RGB and Ours VGM-Normal) below.
Novel View Synthesis
Our method can synthesize novel views of 4D head avatars with high fidelity, demonstrating strong multi-view consistency and realistic geometry.
Extensions
Pose-Free Expression Encoder
Our Pose-Free Expression Encoder demonstrates excellent view consistency. Expression features extracted from different viewpoints can drive the reference image to generate highly consistent results.
Acknowledgements
We thank the authors of the following open-source projects for making their code or datasets publicly available: GaussianAvatars, CAP4D, Pixel3DMM, VHAP, NeRSemble, RenderMe-360, and DAViD. We are also grateful to the corresponding authors and participants for their valuable contributions.
Citation
@misc{xu2026geodiff4dgeometryawarediffusion4d,
title={GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction},
author={Chao Xu and Xiaochen Zhao and Xiang Deng and Jingxiang Sun and Zhuo Su and Donglin Di and Yebin Liu},
year={2026},
eprint={2602.24161},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.24161},
}