We present SynShot, a novel method for the few-shot in version of a drivable head avatar based on a synthetic prior. We tackle two major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints.

SynShot - Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

SynShot - Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

Max Planck Institute for Intelligent Systems, Tübingen, Germany1
Technical University of Darmstadt2, Google3


We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle two major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs. hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to state-of-the-art monocular methods that require thousands of real training images, SynShot significantly improves novel view and expression synthesis.



Given an extracted texture in a neutral position, rasterized position map, and PCA expression deltas our network regresses Gaussian maps which later undergo 3DGS splatting.


Linearly interpolating $z_{\mathbf{q}_{id}}$ and $z_{\mathbf{q}_{expr}}$ between the leftmost and rightmost avatars demonstrates that our latent manifold exhibits smooth transitions in both expression and identity.
Cross-reenactment comparison of SynShot inversion
Source Ours INSTA Flash Avatar Splatting Avatar

Cross-reenactment comparison of SynShot inversion using only 3 views to state-of-the-art (SOTA) methods: INSTA, Flash Avatar, and Splatting Avatar, each of which was trained on an average of 3000 frames. It is evident that, without a strong prior, these methods lack generalization to novel expressions and views.

Video

BibTeX


@article{zielonka2025synshot,
    title={Synthetic Prior for Few-Shot Drivable Head Avatar Inversion},
    author={Wojciech Zielonka and Stephan J. Garbin and Alexandros Lattas 
                    and George Kopanas and Paulo Gotardo and Thabo Beeler 
                    and Justus Thies and Timo Bolkart},
    journal={arXiv:2501.06903},
    year={2025},
    eprint={2501.06903},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2501.06903}, 
}
            
*Work done while Wojciech Zielonka was interning at Google Zurich, Switzerland