SnapFusion

SnapFusion: Textual content-to-Picture Diffusion Mannequin on Cellular Units inside Two Seconds

Summary
Textual content-to-image diffusion fashions can create beautiful photos from pure language descriptions that
rival the work {of professional} artists and photographers. Nonetheless, these fashions are massive, with
advanced community architectures and tens of denoising iterations, making them computationally
costly and sluggish to run. Because of this, high-end GPUs and cloud-based inference are required to run
diffusion fashions at scale. That is expensive and has privateness implications, particularly when person information is
despatched to a 3rd occasion. To beat these challenges, we current a generic strategy that, for the
first time, unlocks operating text-to-image diffusion fashions on cellular units in lower than 2
seconds.
We obtain so by introducing environment friendly community structure and enhancing step distillation.
Particularly, we suggest an environment friendly UNet by figuring out the redundancy of the unique
mannequin and lowering the computation of the picture decoder by way of information distillation.
Additional, we improve the step distillation by exploring coaching methods and introducing
regularization from classifier-free steerage. Our intensive experiments on MS-COCO present that
our mannequin with 8 denoising steps achieves higher FID and CLIP scores than Secure Diffusion
v1.5 with 50 steps. Our work democratizes content material creation by bringing highly effective text-to-image
diffusion fashions to the fingers of customers.
On-Gadget Demo
Comparability w/ Secure Diffusion v1.5 on MS-COCO 2014 validation set (30K samples)
Extra Instance Generated Photos