Explaining the SDXL latent house

November 20, 2023
TL;DR
or check out the interactive demonstration
A Short background story
The 4 channels of the SDXL latents
The 8-bit pixel space has 3 channels
The SDXL latent representation of an image has 4 channels
Direct conversion of SDXL latents to RGB with a linear approximation
A probable reason why the SDXL color range is biased towards yellow
What needs correcting?
Let’s take an example output from SDXL
A complete demonstration
Increasing color range / removing color bias
Long prompts at high guidance scales becoming possible
A brief background story
Particular because of: Ollin Boer Bohan Haoming, Cristina Segalin and Birchlabs for serving to with data, dialogue and data!
I used to be creating correction filters for the SDXL inference process to an UI I am creating for diffusion fashions.
After having a few years of expertise with picture correction, I wished the elemental functionality to enhance the precise output from SDXL.
There have been many methods which I wished out there within the UX, which I got down to repair myself.
I seen that SDXL output is sort of all the time both noisy in common patterns or overly easy.
The colour house all the time wanted white balancing, with a biased and restricted coloration vary, merely due to how SD fashions work.
Making corrections in a put up course of after the picture is generated and transformed to 8-bit RGB made little or no sense, if it was doable to enhance the data and coloration vary earlier than the precise output.
A very powerful factor to know with a purpose to create filters and correction instruments is to grasp the information you’re working with.
This led me to an experimental exploration of the SDXL latents with the intention of understanding them.
The tensor, which the diffusion fashions based mostly on the SDXL structure work with, appears to be like like this:
[batch_size, 4 channels, height (y), width (x)]
My first query was merely “What precisely are these 4 channels?“.
To which most solutions I obtained have been alongside the strains of “It is not one thing {that a} human can perceive.”
However it’s most undoubtedly comprehensible. It is even very straightforward to grasp and helpful to know.
The 4 channels of the SDXL latents
For a 1024×1024px picture generated by SDXL, the latents tensor is 128×128px, the place each pixel within the latent house represents 64 (8×8) pixels within the pixel house. If we generate and decode the latents into a normal 8-bit jpg picture, then…
The 8-bit pixel house has 3 channels
Pink (R), Inexperienced (G) and Blue (B), every with 256 doable values ranging between 0-255.
So, to retailer the total data of 64 pixels, we’d like to have the ability to retailer 64×256 = 16,384 values, per channel, in each latent pixel.
The SDXL latent representation of an image has 4 channels
Click on the heading for an interactive demo!
0: Luminance
1: Cyan/Pink => equal to rgb(0, 255, 255)/rgb(255, 0, 0)
2: Lime/Medium Purple => equal to rgb(127, 255, 0)/rgb(127, 0, 255)
3: Sample/construction.
If every worth can vary between -4 and 4 on the level of decoding, then in a 16-bit floating level format with half precision, every latent pixel can include 16,384 distinct values for every of the 4 channels.
Direct conversion of SDXL latents to RGB with a linear approximation
With this understanding, we are able to create an approximation perform which straight converts the latents to RGB:
def latents_to_rgb(latents):
weights = (
(60, -60, 25, -70),
(60, -5, 15, -50),
(60, 10, -5, -35)
)
weights_tensor = torch.t(torch.tensor(weights, dtype=latents.dtype).to(latents.machine))
biases_tensor = torch.tensor((150, 140, 130), dtype=latents.dtype).to(latents.machine)
rgb_tensor = torch.einsum("...lxy,lr -> ...rxy", latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1)
image_array = rgb_tensor.clamp(0, 255)[0].byte().cpu().numpy()
image_array = image_array.transpose(1, 2, 0)
return Picture.fromarray(image_array)
Right here now we have the latents_to_rgb end result and an everyday decoded output, resized for comparability:


A possible motive why the SDXL coloration vary is biased in direction of yellow
Comparatively few issues in nature are blue, or white. These colours are most distinguished within the sky, throughout pleasant circumstances.
So, the mannequin, realizing actuality by photos, thinks in luminance (channel 0) cyan/crimson (channel 1) and lime/medium purple (channel 2), the place Pink and Inexperienced are main and blue is secondary. Because of this fairly often, SDXL generations are biased in direction of yellow (crimson + inexperienced).
Throughout inference, the values within the tensor will start at min < -30
and max > 30
and the min/max boundary at time of decoding is round -4
to 4
. At greater guidance_scale
the values could have the next distinction between min
and max
.
One key in understanding the boundary is to take a look at what occurs within the decoding course of:
decoded = vae.decode(latents / vae.scaling_factor).pattern
decoded = decoded.div(2).add(0.5).clamp(0, 1)
If the values at this level are exterior of the vary 0 to 1, some data will probably be misplaced within the clamp.
So if we are able to make corrections throughout denoising to serve the VAE what it expects, we could get higher outcomes.
What wants correcting?
How do you sharpen a blurry picture, white steadiness, enhance element, improve distinction or improve the colour vary?
The easiest way is to start with a pointy picture, which is accurately white balanced with nice distinction, crisp particulars and a excessive vary.
It is simpler to blur a pointy picture, shift the colour steadiness, scale back distinction, get nonsensical particulars and restrict the colour vary than to enhance it.
SDXL has a really distinguished tendency to paint bias and put values exterior of the particular boundaries (left picture). Which is well solved by centering the values and getting them inside the boundaries (proper picture):


Authentic output exterior boundaries
Exaggerated correction for illustrative functions


def center_tensor(input_tensor, per_channel_shift=1, full_tensor_shift=1, channels=[0, 1, 2, 3]):
for channel in channels:
input_tensor[0, channel] -= input_tensor[0, channel].imply() * per_channel_shift
return input_tensor - input_tensor.imply() * full_tensor_shift
Let’s take an instance output from SDXL
seed: 77777777
guidance_scale: 20 # A excessive steering scale could be mounted too
steps with base: 23
steps with refiner: 10
immediate: Cinematic.Lovely smile motion lady in detailed white mecha gundam armor with crimson particulars,inexperienced particulars,blue particulars,colourful,star wars universe,lush backyard,flowers,volumetric lighting,good eyes,good enamel,blue sky,vibrant,intricate particulars,excessive element of atmosphere,infinite focus,nicely lit,fascinating garments,radial gradient fade,directional particle lighting,wow
negative_prompt: helmet, bokeh, portray, art work, blocky, blur, ugly, outdated, boring, photoshopped, drained, wrinkles, scar, grey hair, large brow, crosseyed, dumb, silly, cockeyed, disfigured, crooked, blurry, unrealistic, grayscale, unhealthy anatomy, unnatural irises, no pupils, blurry eyes, darkish eyes, further limbs, deformed, disfigured eyes, out of body, no irises, assymetrical face, damaged fingers, further fingers, disfigured palms
Discover that I’ve purposely chosen a excessive steering scale.
How can we repair this picture? It is half portray, half {photograph}. The colours vary is biased in direction of yellow. To the proper is a set era with the very same settings.


But additionally with a wise guidance_scale
set to 7.5, we are able to nonetheless conclude that the mounted output is best, with out nonsensical particulars and proper white steadiness.


There are numerous issues we are able to do within the latent house to typically enhance a era and there are some quite simple issues which we are able to do to focus on particular errors in a era:
Outlier removing
It will management the quantity of nonsensical particulars, by pruning values which can be the farthest from the imply of the distribution. It additionally helps in producing at greater guidance_scale.
def soft_clamp_tensor(input_tensor, threshold=3.5, boundary=4):
if max(abs(input_tensor.max()), abs(input_tensor.min())) < 4:
return input_tensor
channel_dim = 1
max_vals = input_tensor.max(channel_dim, keepdim=True)[0]
max_replace = ((input_tensor - threshold) / (max_vals - threshold)) * (boundary - threshold) + threshold
over_mask = (input_tensor > threshold)
min_vals = input_tensor.min(channel_dim, keepdim=True)[0]
min_replace = ((input_tensor + threshold) / (min_vals + threshold)) * (-boundary + threshold) - threshold
under_mask = (input_tensor < -threshold)
return torch.the place(over_mask, max_replace, torch.the place(under_mask, min_replace, input_tensor))
Coloration balancing and elevated vary
I’ve two important strategies of attaining this. The primary one is to shrink in direction of the imply whereas normalizing the values (Which will even take away outliers) and the second is to repair when the values get biased in direction of some coloration. This additionally helps in producing at greater guidance_scale.
def center_tensor(input_tensor, channel_shift=1, full_shift=1, channels=[0, 1, 2, 3]):
for channel in channels:
input_tensor[0, channel] -= input_tensor[0, channel].imply() * channel_shift
return input_tensor - input_tensor.imply() * full_shift
Tensor maximizing
That is mainly finished by multiplying the tensors by a really small quantity like 1e-5
for a number of steps and to guarantee that the ultimate tensor is utilizing the total doable vary ( nearer to -4/4) earlier than changing to RGB. Keep in mind, within the pixel house, it is simpler to cut back distinction, saturation and sharpness with intact dynamics than to extend it.
def maximize_tensor(input_tensor, boundary=4, channels=[0, 1, 2]):
min_val = input_tensor.min()
max_val = input_tensor.max()
normalization_factor = boundary / max(abs(min_val), abs(max_val))
input_tensor[0, channels] *= normalization_factor
return input_tensor
Callback implementation instance
def callback(pipe, step_index, timestep, cbk):
if timestep > 950:
threshold = max(cbk["latents"].max(), abs(cbk["latents"].min())) * 0.998
cbk["latents"] = soft_clamp_tensor(cbk["latents"], threshold*0.998, threshold)
if timestep > 700:
cbk["latents"] = center_tensor(cbk["latents"], 0.8, 0.8)
if timestep > 1 and timestep < 100:
cbk["latents"] = center_tensor(cbk["latents"], 0.6, 1.0)
cbk["latents"] = maximize_tensor(cbk["latents"])
return cbk
picture = base(
immediate,
guidance_scale = guidance_scale,
callback_on_step_end=callback,
callback_on_step_end_inputs=["latents"]
).photos[0]
This straightforward implementation of the three strategies are used within the final set of photos, with the women in the garden.
Click the heading or this link for an interactive demo!
This demonstration makes use of a extra superior implementation of the methods by detecting outliers utilizing Z-score, by shifting in direction of imply dynamically and by making use of energy to every approach.
Authentic SDXL (too yellow) and slight modification (white balanced)


Medium modification and laborious modification (each with all 3 methods utilized)


Growing coloration vary / eradicating coloration bias
For the beneath, SDXL has restricted the colour vary to crimson and inexperienced within the common output. As a result of there may be nothing within the immediate suggesting that there’s such a factor as blue. It is a relatively good era, however the coloration vary has change into restricted.
In the event you give somebody a palette of black, crimson, inexperienced and yellow after which inform them to color a transparent blue sky, the pure response is to ask you to provide blue and white.
To incorporate blue within the era, we are able to merely realign the colour house when it will get restricted and SDXL will appropriately embrace the total coloration spectrum within the era.


Lengthy prompts at excessive steering scales turning into doable
Here’s a typical state of affairs, the place the elevated coloration vary makes the entire immediate doable.
This instance apply the simple, hard modification shown earlier, for example the distinction extra clearly.
immediate: {Photograph} of lady in crimson costume in a luxurious backyard surrounded with blue, yellow, purple and flowers in many colours, excessive class, award-winning images, Portra 400, full format. blue sky, intricate particulars even to the smallest particle, excessive element of the atmosphere, sharp portrait, nicely lit, fascinating outfit, lovely shadows, vibrant, photoquality, extremely life like, masterpiece


Listed below are some extra comparisons on the identical idea
Remember that these all simply use the same static modifications.









