We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of
three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D
inputs to 3D. Existing methods rely on deterministic, feed-forward predictions, which limit
their ability to handle the inherent ambiguity of 3D inference from 2D data. Diffusion models
have recently shown promise as powerful generative models for 3D data, including Gaussian
splats; however, standard diffusion frameworks typically require the target signal and denoised
signal to be in the same modality, which is challenging given the scarcity of 3D data. To
overcome this, we propose a novel training strategy that decouples the denoised modality from
the supervision modality. By using a deterministic model as a noisy teacher to create the noised
signal and transitioning from single-step to multi-step denoising supervised by an image
rendering loss, our approach significantly enhances performance compared to the deterministic
teacher. Additionally, our method is flexible, as it can learn from various 3D Gaussian Splat
(3DGS) teachers with minimal adaptation; we demonstrate this by surpassing the performance of
two different deterministic models as teachers, highlighting the potential generalizability of
our framework. Our approach further incorporates a guidance mechanism to aggregate information
from multiple views, enhancing reconstruction quality when more than one view is available.