Now Reading
Textual content-To-Picture Technology through Masked Generative Transformers

Textual content-To-Picture Technology through Masked Generative Transformers

2023-01-15 04:51:31

A birthday cake with the phrase “Muse” written on it.

A fire the place the flames spell “Muse”.

Muse: Textual content-To-Picture Technology through Masked Generative Transformers

Huiwen Chang*, Han Zhang*, Jarred Barber, AJ Maschinot, José Lezama, Lu Jiang,
Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan

*Equal contribution. Core contribution.

Google Research

We current Muse, a text-to-image Transformer mannequin that achieves state-of-the-art picture technology efficiency whereas being considerably extra environment friendly than diffusion or autoregressive fashions. Muse is skilled on a masked modeling activity in discrete token area: given the textual content embedding extracted from a pre-trained giant language mannequin (LLM), Muse is skilled to foretell randomly masked picture tokens. In comparison with pixel-space diffusion fashions, resembling Imagen and DALL-E 2, Muse is considerably extra environment friendly attributable to the usage of discrete tokens and requiring fewer sampling iterations; in comparison with autoregressive fashions, resembling Parti, Muse is extra environment friendly attributable to the usage of parallel decoding. The usage of a pre-trained LLM allows fine-grained language understanding, translating to high-fidelity picture technology and the understanding of visible ideas resembling objects, their spatial relationships, pose, cardinality, and many others. Our 900M parameter mannequin achieves a brand new SOTA on CC3M, with an FID rating of 6.06. The Muse 3B parameter mannequin achieves an FID of seven.88 on zero-shot COCO analysis, together with a CLIP rating of 0.32. Muse additionally instantly allows a variety of picture enhancing purposes with out the necessity to fine-tune or invert the mannequin: inpainting, outpainting, and mask-free enhancing.

See Also

We thank William Chan, Chitwan Saharia, and Mohammad Norouzi for offering us coaching datasets, varied analysis codes, web site templates and beneficiant ideas. Jay Yagnik, Rahul Sukthankar, Tom Duerig and David Salesin supplied enthusiastic assist of this venture for which we’re grateful. We thank Victor Gomes and Erica Moreira for infrastructure assist, Jing Yu Koh and Jason Baldridge for dataset, mannequin and analysis discussions and suggestions on the paper, Mike Krainin for mannequin speedup discussions, JD Velasquez for discussions and insights, Sarah Laszlo, Kathy Meier-Hellstern, and Rachel Stigler for aiding us with the publication course of, Andrew Bunner, Jordi Pont-Tuset, and Shai Noy for assistance on inner demos, David Fleet, Saurabh Saxena, Jiahui Yu, and Jason Baldridge for sharing Imagen and Parti velocity metrics.

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top