SynShot - Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

Wojciech Zielonka^1,2,3*, Stephan J. Garbin³, Alexandros Lattas³,
George Kopanas³, Paulo Gotardo³, Thabo Beeler³, Justus Thies^1,2, Timo Bolkart³

Max Planck Institute for Intelligent Systems, Tübingen, Germany¹

Technical University of Darmstadt², Google³

Code ... arXiv Video Poster BibTeX

We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle two major challenges. First, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Second, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Moreover, under General Data Protection Regulation, the storage and use of real datasets are highly regulated. Users are often required to delete trained models and remove any derivatives of the real data within a specified time frame if a subject from the dataset withdraws their consent, as guaranteed by law. This process is very cumbersome from a data management perspective, whereas synthetic datasets are not subject to these regulations.

With few input images (usually 3 or 5), SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints.

Our method significantly outperforms monocular personalized methods like INSTA, FlashAvatar, and Splatting Avatar in cross-reenactment tasks. Moreover, it provides better view and expressions consistency, and controllability compared to NeRF-GAN-based approaches like InvertAvatar, Next3D, and Portrait4D, despite being trained only on assets-based synthetic data. Please note the difference from Portrait4D synthetic data, which uses real images to first build a 4D generative model to synthesize large-scale 4D training data. We, on the other hand, use asset to first compose an avatar and then ray-trace the image, which introduces different challanges and a much larger domain gap (more examples below).

Given an extracted texture in a neutral position, rasterized position map, and PCA expression deltas our network regresses Gaussian maps which later undergo 3DGS splatting.

Linearly interpolating \(z_{\mathbf{q}_{id}}\) and \(z_{\mathbf{q}_{expr}}\) between the leftmost and rightmost avatars demonstrates that our latent manifold exhibits smooth transitions in both expression and identity.

Cross-reenactment comparison of SynShot inversion using only 3 views to state-of-the-art (SOTA) methods: INSTA, Flash Avatar, and Splatting Avatar, each of which was trained on an average of 3000 frames. It is evident that, without a strong prior, these methods lack generalization to novel expressions and views. Next we comparare to InvertAvatar, Portrait4D, and Next3D. In this case we used three views for enrollment for InvertAvatar. As can be seen, our method correctly captured the subject's identity and remains stable during novel view and expression synthesis which is a problem for all of the baselines.

Video

BibTeX


@inproceedings{zielonka2025synshot,
  title     = {Synthetic Prior for Few-Shot Drivable Head Avatar Inversion},
  author    = {Wojciech Zielonka and Stephan J. Garbin and Alexandros Lattas and George Kopanas and Paulo Gotardo and Thabo Beeler and Justus Thies and Timo Bolkart},
  booktitle = {CVPR},
  month     = {June},
  year      = {2025},
}

*Work done while Wojciech Zielonka was interning at Google Zurich, Switzerland