The geometry of diffusion steerage – Sander Dieleman
Steerage is a strong methodology that can be utilized to boost diffusion mannequin sampling. As I’ve mentioned in an earlier blog post, it’s virtually like a cheat code: it could possibly enhance pattern high quality a lot that it’s as if the mannequin had ten instances the variety of parameters – an order of magnitude enchancment, principally at no cost! This followup publish gives a geometrical interpretation and visualisation of the diffusion sampling process, which I’ve discovered significantly helpful to elucidate how steerage works.
Sampling algorithms for diffusion fashions sometimes begin by initialising a canvas with random noise, after which repeatedly updating this canvas primarily based on mannequin predictions, till a pattern from the mannequin distribution ultimately emerges.
We’ll characterize this canvas by a vector (mathbf{x}_t), the place (t) represents the present time step within the sampling process. By conference, the diffusion course of which step by step corrupts inputs into random noise strikes ahead in time from (t=0) to (t=T), so the sampling process goes backward in time, from (t=T) to (t=0). Due to this fact (mathbf{x}_T) corresponds to random noise, and (mathbf{x}_0) corresponds to a pattern from the info distribution.
(mathbf{x}_t) is a highdimensional vector: for instance, if a diffusion mannequin produces photos of dimension 64×64, there are 12,288 completely different scalar depth values (3 color channels per pixel). The sampling process then traces a path by way of a 12,288dimensional Euclidean house.
It’s fairly troublesome for the human mind to understand what that really appears to be like like in apply. As a result of our instinct is firmly rooted in our 3D environment, it truly tends to fail us in shocking methods in highdimensional areas. Some time again, I wrote a blog post about among the implications for highdimensional likelihood distributions particularly. This note about why highdimensional spheres are “spikey” can also be price a learn, when you shortly need to get a really feel for a way bizarre issues can get. A extra thorough remedy of highdimensional geometry could be present in chapter 2 of ‘Foundations of Information Science’^{ by Blum, Hopcroft and Kannan, which is available to download in PDF format.}
However, on this weblog publish, I’ll use diagrams that characterize (mathbf{x}_t) in two dimensions, as a result of sadly that’s all of the spatial dimensions obtainable in your display. That is harmful: following our instinct in 2D may lead us to the fallacious conclusions. However I’m going to do it anyway, as a result of despite this, I’ve discovered these diagrams fairly useful to elucidate how manipulations equivalent to steerage have an effect on diffusion sampling in apply.
Right here’s some recommendation from Geoff Hinton on coping with highdimensional areas which will or might not assist:
I am laughing so arduous at this slide a buddy despatched me from one among Geoff Hinton’s programs;
“To take care of hyperplanes in a 14dimensional house, visualize a 3D house and say ‘fourteen’ to your self very loudly. Everybody does it.” pic.twitter.com/nTakZArbsD
— Robbie Barrat (@videodrome) June 10, 2018
… anyway, you’ve been warned!
Visualising diffusion sampling
To begin off, let’s visualise what a step of diffusion sampling sometimes appears to be like like. I’ll use an actual {photograph} to which I’ve added various quantities of noise to face in for intermediate samples within the diffusion sampling course of:
Throughout diffusion mannequin coaching, examples of noisy photos are produced by taking examples of fresh photos from the info distribution, and including various quantities of noise to them. That is what I’ve achieved above. Throughout sampling, we begin from a canvas that’s pure noise, after which the mannequin step by step removes random noise and replaces it with significant construction in accordance with the info distribution. Word that I can be utilizing this set of photos to characterize intermediate samples from a mannequin, regardless that that’s not how they have been constructed. If the mannequin is nice sufficient, you shouldn’t have the ability to inform the distinction anyway!
Within the diagram under, we have now an intermediate noisy pattern (mathbf{x}_t), someplace in the midst of the sampling course of, in addition to the ultimate output of that course of (mathbf{x}_0), which is noisefree:
Think about the 2 spatial dimensions of your display representing simply two of many hundreds of pixel color intensities (crimson, inexperienced or blue). Totally different spatial positions within the diagram correspond to completely different photos. A single step within the sampling process is taken through the use of the mannequin to predict the place the ultimate pattern will find yourself. We’ll name this prediction (hat{mathbf{x}}_0):
Word how this prediction is roughly within the course of (mathbf{x}_0), however we’re not capable of predict (mathbf{x}_0) precisely from the present level within the sampling course of, (mathbf{x}_t), as a result of the noise obscures lots of info (particularly finegrained particulars), which we aren’t capable of fill in multi functional go. Certainly, if we have been, there can be no level to this iterative sampling process: we might simply go instantly from pure noise (mathbf{x}_T) to a clear picture (mathbf{x}_0) in a single step. (As an apart, this is kind of what Consistency Fashions^{ attempt to obtain.)}
Diffusion fashions estimate the expectation of (mathbf{x}_0), given the present noisy enter (mathbf{x}_t): (hat{mathbf{x}}_0 = mathbb{E}[mathbf{x}_0 mid mathbf{x}_t]). On the highest noise ranges, this expectation principally corresponds to the imply of all the dataset, as a result of very noisy inputs will not be very informative. In consequence, the prediction (hat{mathbf{x}}_0) will seem like a really blurry picture when visualised. At decrease noise ranges, this prediction will turn into sharper and sharper, and it’ll ultimately resemble a pattern from the info distribution. In a previous blog post, I am going into a bit of bit extra element about why diffusion fashions find yourself estimating expectations.
In apply, diffusion fashions are sometimes parameterised to foretell noise, somewhat than clear enter, which I additionally mentioned in the same blog post. Some fashions additionally predict timedependent linear mixtures of the 2. Lengthy story brief, all of those parameterisations are equal as soon as the mannequin has been skilled, as a result of a prediction of one among these portions could be became a prediction of one other by way of a linear mixture of the prediction itself and the noisy enter (mathbf{x}_t). That’s why we are able to all the time get a prediction (hat{mathbf{x}}_0) out of any diffusion mannequin, no matter the way it was parameterised or skilled: for instance, if the mannequin predicts the noise, merely take the noisy enter and subtract the anticipated noise.
Diffusion mannequin predictions additionally correspond to an estimate of the socalled rating perform, (nabla_{mathbf{x}_t} log p(mathbf{x}_t)). This may be interpreted because the course in enter house alongside which the loglikelihood of the enter will increase maximally. In different phrases, it’s the reply to the query: “how ought to I alter the enter to make it extra probably?”
y s
Diffusion sampling now proceeds by taking a small step within the course of this prediction:
This could look acquainted to any machine studying practitioner, because it’s similar to neural community coaching through gradient descent: backpropagation provides us the course of steepest descent on the present level in parameter house, and at every optimisation step, we take a small step in that course. Taking a really giant step wouldn’t get us wherever attentiongrabbing, as a result of the estimated course is barely legitimate regionally. The identical is true for diffusion sampling, besides we’re now working within the enter house, somewhat than within the house of mannequin parameters.
What occurs subsequent is dependent upon the particular sampling algorithm we’ve chosen to make use of. There are numerous to select from: DDPM^{ (additionally referred to as ancestral sampling), DDIM, DPM++ and ODEbased sampling (with many subvariants utilizing completely different ODE solvers) are just some examples. A few of these algorithms are deterministic, which suggests the one supply of randomness within the sampling process is the preliminary noise on the canvas. Others are stochastic, which signifies that additional noise is injected at every step of the sampling process.}
We’ll use DDPM for example, as a result of it is among the oldest and mostly used sampling algorithms for diffusion fashions. This can be a stochastic algorithm, so some random noise is added after taking a step within the course of the mannequin prediction:
Word that I’m deliberately glossing over among the particulars of the sampling algorithm right here (for instance, the precise variance of the noise (varepsilon) that’s added at every step). The diagrams are schematic and the main target is on constructing instinct, so I feel I can get away with that, however clearly it’s fairly essential to get this proper while you truly need to implement this algorithm.
For deterministic sampling algorithms, we are able to merely skip this step (i.e. set (varepsilon = 0)). After this, we find yourself in (mathbf{x}_{t1}), which is the following iterate within the sampling process, and will correspond to a barely much less noisy pattern. To proceed, we rinse and repeat. We are able to once more make a prediction (hat{mathbf{x}}_0):
As a result of we’re in a distinct level in enter house, this prediction may even be completely different. Concretely, because the enter to the mannequin is now barely much less noisy, the prediction can be barely much less blurry. We now take a small step within the course of this new prediction, and add noise to finish up in (mathbf{x}_{t2}):
We are able to preserve doing this till we ultimately attain (mathbf{x}_0), and we may have drawn a pattern from the diffusion mannequin. To summarise, under is an animated model of the above set of diagrams, displaying the sequence of steps:
Classifier steerage
Classifier steerage^{ gives a solution to steer diffusion sampling within the course that maximises the likelihood of the ultimate pattern being categorized as a selected class. Extra broadly, this can be utilized to make the pattern mirror any kind of conditioning sign that wasn’t supplied to the diffusion mannequin throughout coaching. In different phrases, it permits posthoc conditioning.}
For classifier steerage, we want an auxiliary mannequin that predicts (p(y mid mathbf{x})), the place (y) represents an arbitrary enter function, which could possibly be a category label, a textual description of the enter, or perhaps a extra structured object like a segmentation map or a depth map. We’ll name this mannequin a classifier, however remember that we are able to use many alternative sorts of fashions for this objective, not simply classifiers within the slender sense of the phrase. What’s good about this setup, is that such fashions are often smaller and simpler to coach than diffusion fashions.
One essential caveat is that we’ll be making use of this auxiliary mannequin to noisy inputs (mathbf{x}_t), at various ranges of noise, so it must be sturdy in opposition to this explicit sort of enter distortion. This appears to preclude using offtheshelf classifiers, and implies that we have to practice a customized noiserobust classifier, or on the very least, finetune an offtheshelf classifier to be noiserobust. We are able to additionally explicitly situation the classifier on the time step (t), so the extent of noise doesn’t need to be inferred from the enter (mathbf{x}_t) alone.
Nevertheless, it seems that we are able to assemble an inexpensive noiserobust classifier by combining an offtheshelf classifier (which expects noisefree inputs) with our diffusion mannequin. Slightly than making use of the classifier to (mathbf{x}_t), we first predict (hat{mathbf{x}}_0) with the diffusion mannequin, and use that as enter to the classifier as a substitute. (hat{mathbf{x}}_0) remains to be distorted, however by blurring somewhat than by Gaussian noise. Offtheshelf classifiers are usually way more sturdy to this sort of distortion out of the field. Bansal et al.^{ named this trick “ahead common steerage”, although it has been recognized for a while. In addition they counsel some extra superior approaches for posthoc steerage.}
Utilizing the classifier, we are able to now decide the course in enter house that maximises the loglikelihood of the conditioning sign, just by computing the gradient with respect to (mathbf{x}_t): (nabla_{mathbf{x}_t} log p(y mid mathbf{x}_t)). (Word: if we used the above trick to assemble a noiserobust classifier from an offtheshelf one, this implies we’ll have to backpropagate by way of the diffusion mannequin as properly.)
To use classifier steerage, we mix the instructions obtained from the diffusion mannequin and from the classifier by including them collectively, after which we take a step on this mixed course as a substitute:
In consequence, the sampling process will hint a distinct trajectory by way of the enter house. To regulate the affect of the conditioning sign on the sampling process, we are able to scale the contribution of the classifier gradient by an element (gamma), which is known as the steerage scale:
The mixed replace course will then be influenced extra strongly by the course obtained from the classifier (supplied that (gamma > 1), which is often the case):
This scale issue (gamma) is a crucial sampling hyperparameter: if it’s too low, the impact is negligible. If it’s too excessive, the samples can be distorted and lowquality. It’s because gradients obtained from classifiers don’t essentially level in instructions that lie on the picture manifold – if we’re not cautious, we may very well find yourself in adversarial examples, which maximise the likelihood of the category label however don’t truly seem like an instance of the category in any respect!
In my previous blog post on diffusion guidance, I made the connection between these operations on vectors within the enter house, and the underlying manipulations of distributions they correspond to. It’s price briefly revisiting this connection to make it extra obvious:

We’ve taken the replace course obtained from the diffusion mannequin, which corresponds to (nabla_{mathbf{x}_t} log p_t(mathbf{x}_t)) (i.e. the rating perform), and the (scaled) replace course obtained from the classifier, (gamma cdot nabla_{mathbf{x}_t} log p(y mid mathbf{x}_t)), and mixed them just by including them collectively: (nabla_{mathbf{x}_t} log p_t(mathbf{x}_t) + gamma cdot nabla_{mathbf{x}_t} log p(y mid mathbf{x}_t)).

This expression corresponds to the gradient of the logarithm of (p_t(mathbf{x}_t) cdot p(y mid mathbf{x}_t)^gamma).

In different phrases, we have now successfully reweighted the mannequin distribution, altering the likelihood of every enter in accordance with the likelihood the classifier assigns to the specified class label.

The steerage scale (gamma) corresponds to the temperature of the classifier distribution. A excessive temperature implies that inputs to which the classifier assigns excessive chances are upweighted extra aggressively, relative to different inputs.

The result’s a brand new mannequin that’s more likely to supply samples that align with the specified class label.
An animated diagram of a single step of sampling with classifier steerage is proven under:
Classifierfree steerage
Classifierfree steerage^{ is a variant of steerage that doesn’t require an auxiliary classifier mannequin. As a substitute, a Bayesian classifier is constructed by combining a conditional and an unconditional generative mannequin.}
Concretely, when coaching a conditional generative mannequin (p(mathbf{x}mid y)), we are able to drop out the conditioning (y) some proportion of the time (often 1020%) in order that the identical mannequin may also act as an unconditional generative mannequin, (p(mathbf{x})). It seems that this doesn’t have a detrimental impact on conditional modelling efficiency. Utilizing Bayes’ rule, we discover that (p(y mid mathbf{x}) propto frac{p(mathbf{x}mid y)}{p(mathbf{x})}), which supplies us a solution to flip our generative mannequin right into a classifier.
In diffusion fashions, we have a tendency to precise this when it comes to rating features, somewhat than when it comes to likelihood distributions. Taking the logarithm after which the gradient w.r.t. (mathbf{x}), we get (nabla_mathbf{x} log p(y mid mathbf{x}) = nabla_mathbf{x} log p(mathbf{x} mid y) – nabla_mathbf{x} log p(mathbf{x})). In different phrases, to acquire the gradient of the classifier loglikelihood with respect to the enter, all we have now to do is subtract the unconditional rating perform from the conditional rating perform.
Substituting this expression into the system for the replace course of classifier steerage, we receive the next:
[nabla_{mathbf{x}_t} log p_t(mathbf{x}_t) + gamma cdot nabla_{mathbf{x}_t} log p(y mid mathbf{x}_t)] [= nabla_{mathbf{x}_t} log p_t(mathbf{x}_t) + gamma cdot left( nabla_{mathbf{x}_t} log p(mathbf{x}_t mid y) – nabla_{mathbf{x}_t} log p(mathbf{x}_t) right)] [= (1 – gamma) cdot nabla_{mathbf{x}_t} log p_t(mathbf{x}_t) + gamma cdot nabla_{mathbf{x}_t} log p(mathbf{x}_t mid y) .]The replace course is now a linear mixture of the unconditional and conditional rating features. It might be a convex mixture if it have been the case that (gamma in [0, 1]), however in apply (gamma > 1) tends to be have been the magic occurs, so that is merely a barycentric mixture. Word that (gamma = 0) reduces to the unconditional case, and (gamma = 1) reduces to the conditional (unguided) case.
How can we make sense of this geometrically? With our hybrid conditional/unconditional mannequin, we are able to make two predictions (hat{mathbf{x}}_0). These can be completely different, as a result of the conditioning info might permit us to make a extra correct prediction:
Subsequent, we decide the distinction vector between these two predictions. As we confirmed earlier, this corresponds to the gradient course supplied by the implied Bayesian classifier:
We now scale this vector by (gamma):
Ranging from the unconditional prediction for (hat{mathbf{x}}_0), this vector factors in the direction of a brand new implicit prediction, which corresponds to a stronger affect of the conditioning sign. That is the prediction we’ll now take a small step in the direction of:
Classifierfree steerage tends to work lots higher than classifier steerage, as a result of the Bayesian classifier is way more sturdy than a individually skilled one, and the ensuing replace instructions are a lot much less more likely to be adversarial. On high of that, it doesn’t require an auxiliary mannequin, and generative fashions could be made suitable with classifierfree steerage merely by way of conditioning dropout throughout coaching. On the flip aspect, meaning we are able to’t use this for posthoc conditioning – all conditioning alerts need to be obtainable throughout coaching of the generative mannequin itself. My previous blog post on guidance covers the variations in additional element.
An animated diagram of a single step of sampling with classifierfree steerage is proven under:
Closing ideas
What’s shocking about steerage, in my view, is how highly effective it’s in apply, regardless of its relative simplicity. The modifications to the sampling process required to use steerage are all linear operations on vectors within the enter house. That is what makes it doable to interpret the process geometrically.
How can a set of linear operations have an effect on the end result of the sampling process so profoundly? The hot button is iterative refinement: these easy modifications are utilized repeatedly, and crucially, they’re interleaved with a really nonlinear operation, which is the applying of the diffusion mannequin itself, to foretell the following replace course. In consequence, any linear modification of the replace course has a nonlinear impact on the following replace course. Throughout many sampling steps, the ensuing impact is extremely nonlinear and highly effective: small variations in every step accumulate, and lead to trajectories with very completely different endpoints.
I hope the visualisations on this publish are a helpful complement to my previous writing on the topic of guidance. Be happy to let me know your ideas within the feedback, or on Twitter/X (@sedielem) or Threads (@sanderdieleman).
If you need to quote this publish in a tutorial context, you should use this BibTeX snippet:
@misc{dieleman2023geometry,
writer = {Dieleman, Sander},
title = {The geometry of diffusion steerage},
url = {https://sander.ai/2023/08/28/geometry.html},
yr = {2023}
}
Acknowledgements
Due to Bundle for modelling and to kipply for permission to make use of this photograph. Due to my colleagues at Google DeepMind for numerous discussions, which proceed to form my ideas on this subject!