Now Reading
Excessive-Decision Video Synthesis with Latent Diffusion Fashions

Excessive-Decision Video Synthesis with Latent Diffusion Fashions

2023-04-19 00:22:50

We current Video Latent Diffusion Fashions (Video LDMs) for computationally environment friendly high-resolution video technology. To alleviate the intensive compute and reminiscence calls for
of high-resolution video synthesis, we leverage the LDM paradigm and lengthen it to video technology. Our Video LDMs map movies right into a compressed latent house and
mannequin sequences of latent variables comparable to the video frames (see animation above). We initialize the fashions from picture LDMs and insert temporal layers into the
LDMs’ denoising neural networks to temporally mannequin encoded video body sequences. The temporal layers are based mostly on temporal consideration in addition to
3D convolutions. We additionally fine-tune the mannequin’s decoder for video technology (see determine under).


Latent diffusion mannequin framework and video fine-tuning of decoder. Prime: Throughout temporal decoder fine-tuning, we course of video sequences with a frozen per-frame encoder and implement temporally coherent reconstructions throughout frames. We moreover make use of a video-aware discriminator. Backside: in LDMs, a diffusion mannequin is educated in latent house. It synthesizes latent options, that are then remodeled by the decoder into pictures. Be aware that in apply we mannequin whole movies and video fine-tune the latent diffusion mannequin to generate temporally constant body sequences.

Our Video LDM initially generates sparse keyframes at low body charges, that are then temporally upsampled twice by one other interpolation latent diffusion mannequin.
Furthermore, optionally coaching Video LDMs for video prediction by conditioning on beginning frames permits us to generate lengthy movies in an autoregressive method.
To realize high-resolution technology, we additional leverage spatial diffusion mannequin upsamplers and temporally align them for video upsampling.
All the technology stack is proven under.


Video LDM Stack. We first generate sparse key frames. Then we temporally interpolate in two steps with the identical interpolation mannequin to realize excessive body charges. These operations use latent diffusion fashions (LDMs) that share the identical picture spine. Lastly, the latent video is decoded to pixel house and optionally a video upsampler diffusion mannequin is utilized.

Functions.
We validate our strategy on two related however distinct functions: Technology of in-the-wild driving scene movies and inventive content material creation with text-to-video modeling. For
driving video synthesis, our Video LDM permits technology of temporally coherent, a number of minute lengthy movies at decision 512 x 1024, attaining state-of-the-art efficiency. For text-to-video, we reveal synthesis
of quick movies of a number of seconds lengths with decision as much as 1280 x 2048, leveraging Steady Diffusion as spine picture LDM in addition to the Steady Diffusion upscaler. We additionally discover the convolutional-in-time utility of our fashions as a substitute strategy to increase the size of movies. Our most important keyframe fashions solely prepare the newly inserted temporal layers,
however don’t contact the layers of the spine picture LDM. Due to that the learnt temporal layers may be transferred to different picture LDM backbones, as an example to ones that
have been fine-tuned with DreamBooth. Leveraging this property, we moreover present preliminary outcomes for personalised text-to-video technology.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top