# Implementing Mario’s Stack Blur 15 instances in C++ (with exams and benchmarks) · Melatonin

*by*Phil Tadros

This put up walks by means of implementing Mario Klingemann’s Stack Blur algorithm, a CPU picture blurring algorithm. It’s the put up I want I had whereas constructing my vectorized C++ blur for the JUCE framework.

Go straight to the implementation section, or try the Benchmarks.

## Why CPU blurring and what’s Stack Blur?

There are nonetheless instances and locations in 2023 the place entry to the GPU isn’t out there or is convoluted. In these instances, Stack Blur is used (fairly broadly) for effectively blurring pictures on the CPU.

In my case, I’m constructing cross-platform audio plugins with the JUCE framework. OpenGL is out there, however deprecated and crusty on macOS. And it’s a can of worms I’d desire to not open when all I would like are some drop shadows and internal shadows (so I don’t have to make use of picture strips prefer it’s the Nineties).

That is an instance of a slider management in my synth:

It’s vector-only and “dropped at life” by way of 10 drop shadows and internal shadows that present depth and lighting. There are as much as 10 of those on a web page, animating at 60FPS, with lots of different UI. So that they should be *quick*.

A real gaussian picture blur is kind of costly: CPU time scales up with each picture dimension and radii. That makes the Stack Blur algorithm an ideal match for locations the place GPU entry is proscribed or unavailable.

Gaussian is a flowery synonym for a “regular distribution”. That’s simply statistics jargon for “bell curve-like”. In our context, to get a pleasant blur, we would like every pixel to be affected by all of the closest pixels surrounding it. The “affect” from these neighboring pixels tapers off like a bell curve.

As a substitute of describing what the algorithm does , I’m going to start out from the underside up, strolling by means of some elementary ideas.

## Single Channel Stack Blur

Overlook fancy issues like shade proper now. We’ll begin by dealing with a picture format that’s a single shade, or **single channel**. Single channel pictures are what’s used underneath the hood to make drop shadows. Every pixel is only one worth: the quantity of brightness (later composited with a specified shade).

We’ll make the idea that our brightness pixel values go from `0`

to `255`

. So every pixel is saved as 8 bits. (In C++ this could be the `uint8_t`

or `unsigned char`

).

Right here’s what our picture would possibly seem like with some arbitrary values:

## The Stack Blur Queue idea

In in picture convolution (reminiscent of gaussian blurring) there’s a 2D **matrix** that slides over the picture. It makes use of fancy dot product magic (insert hand waving) to calculate the results of the middle pixel.

GPUs are nice at parallelizing this type of convolution. CPUs not a lot.

Stack Blur is totally different. It operates one row (or column) at a time (simulating 1D convolution) utilizing weighted sums. As a substitute of dwelling the flamboyant matrix life, we’re chilling with a humble array of values that Mario referred to as the **queue** (in picture convolution it’s often known as a **kernel**).

We’ll use the queue/kernel surrounding the pixel to carry out some magic 🪄 to calculate the blurred worth for the middle pixel. Within the illustration above, we’re calculating the blur worth for `255`

utilizing the 5 pixels within the queue. We’ll get into the *precisely how* a bit later.

As a result of we all the time want a pixel on the middle of the queue, our queue dimension is all the time odd-numbered. Meaning it’s simple to suppose a blurred pixel having a **radius**.

Our radius is `2`

since we have now 2 pixels on both facet of the middle pixel. Specifying radius can be how individuals usually specify blur when designing for the online and in design applications reminiscent of Figma:

The queue dimension is double the radius plus the middle pixel.

`queueSize = radius*2 + 1`

or in our instance: `5`

.

Once we wish to course of the following pixel, we slide our queue throughout the row of pixel values within the picture. The queue positive aspects a pixel from the suitable hand dimension and loses its leftmost pixel:

You may as well think about this course of from the queue’s standpoint, as Mario illustrates:

## Stack Blur Queue Implementation

Let’s maintain taking a look at issues from the queue’s standpoint.

With a radius of `2`

, we have now a queue dimension of `5`

. Meaning technically we solely want 5 items of reminiscence for our queue. We’ll give every bit of reminiscence a 0-based index, simply to be clear going ahead.

When a brand new worth comes onto the queue, the leftmost worth is eliminated. The opposite worth then all transfer left one spot to make room for the incoming worth.

That is good and straightforward to implement with one thing like `std::dequeue`

in C++:

```
queue.pop_front(); // eliminate leftmost pixel
queue.push_back (incomingValue); // add new pixel
```

Nonetheless, it’s additionally inefficient. Every time the queue slides, *each* worth within the queue would change place in reminiscence. This can be a lot of labor to do per-pixel and it scales up poorly for bigger radii.

## Round Buffer Implementation

The environment friendly factor to do is implement a **round buffer**, generally referred to as a **ring buffer**. It’s usually referred to as a **FIFO** (First In, First Out) by audio programmers.

The trick is to summary our queue from its bodily reminiscence by utilizing one other variable, reminiscent of `queueIndex`

. This index will specify the place our new fancy *digital queue *begins.

Once we transfer to the following pixel, as an alternative all of the values shifting left, we’ll simply increment the index and transfer it one spot over to the suitable.

Conveniently, the incoming pixel can substitute the pixel at outdated queue index, which is now the tip of our digital queue.

Now, once we wish to learn our full queue, we are able to now begin at `queueIndex`

and browse the following 5 pixels, wrapping again to the start once we hit the tip of the reminiscence block. The wrapping again to 0 is why it’s referred to as a round/ring buffer.

## Stacking the Bricks

Now that we perceive the queue implementation, let’s get into the magic. ✨

We wish to emulate a easy gaussian blur (wherein a pixel is extra affected by nearer neighbors).

Mario’s algorithm does this with an idea he calls a stack. He notes** the stack isn’t an actual construction** (when it comes to reminiscence and even implementation). The objective is to provide a heavier weight to the middle pixel within the queue after which taper off as we transfer away from the middle.

As an instance, let’s assign some distinctive pixel values to our easy queue:

To calculate the blur for the middle pixel, we may simply take the typical of all pixels within the queue: `(1+2+3+4+5) / 5`

(which occurs to equal `3`

once more, lulz). That is precisely how a box blur is carried out. Sadly it’s field blur is each inefficient *and* artifact-y.

So how can we give extra emphasis to the pixels across the middle to make it extra easy and bell-curve-y? Effectively, one concept is to make a duplicate of the queue, cut back its diameter and embody that new layer in our calculations. And maintain stacking layers till we’re have a single middle pixel:

Welcome to Stack Blur. It “stacks” smaller and smaller digital queues on prime of one another, giving a heavier weight to pixels nearer to the middle of the queue.

Mockingly, on this instance `(1+2+2+3+3+3+4+4+5) / 9`

truly finally ends up equalling `3`

once more! All that arduous work for nothing :). Don’t rage give up but, right here’s a greater instance:

The common of the **queue** is `(0+0+9+0+0)/5`

= `1.8`

. However the common of the **stack** is `(0+0+0+9+9+9+0+0+0)/9`

= `3`

. So the result’s skewed in direction of our increased middle pixel worth.

## Stack Blur Implementation

Keep in mind: the stack isn’t actual. Conserving all these additional pixel values round could be an excessive amount of bookkeeping and quantity crunching. As a substitute, we implement a shifting common, a sum. We’ll use a variable referred to as `stackSum`

to symbolize it.

So how will we alter the `stackSum`

when the queue strikes? Once we look intently, the stack appears to achieve and lose a *bunch* of various values.

It’s best to suppose the state each *earlier than* and *after* the queue strikes.

The reply is to make use of our queue to calculate two intermediate sums.

`sumOut`

is all the pieces we’ll take away from the stack every iteration. In our, instance it’s `1+2+3`

.

`sumOut`

is what we have to add to the stack. In our instance it’s `4+5+6`

.

After all, the stack is all in our head, maaaaan. So how precisely will we get these values? Let’s take a better have a look at the queue:

That appears like what we would like! However once more, we are able to’t continually be summing particular person pixels. So `sumIn`

and `sumOut`

are *additionally* operating sums. That approach, all we have now to do is add and subtract a quantity from them because the queue strikes.

There are alternative ways of implementing this, but when we’re optimizing for monitoring and storing the fewest variety of pixels, right here’s an excellent workflow:

- Earlier than shifting,
`sumOut`

loses the leftmost pixel (or the pixel at`queueIndex`

):`1`

- We transfer the queue, including the incoming
`6`

and eradicating the outdated and drained`1`

- After shifting,
`sumOut`

positive aspects the incoming`4`

, which is the the brand new middle pixel (at`queueIndex + radius`

) `sumIn`

loses that very same new middle pixel`4`

`sumIn`

positive aspects the incoming new pixel`6`

You possibly can see an implementation here. That definitely **feels** like extra work than “including and subtracting a worth from the queue”. Nevertheless it’s additionally surprisingly few directions for lots of heavy lifting. And most significantly, the variety of directions doesn’t change, regardless of how massive the radius will get.

## Literal Edge Circumstances

What to do when the pixel we wish to calculate is on the fringe of a picture? That is fairly essential as the sting can be the place we’re beginning out! We’ll need to invent some numbers for the leftmost pixel and for `sumIn`

and `sumOut`

, gained’t we?

Most implementations pre-populate the left facet of the queue with the leftmost pixel. Our instance imaginary stack would initialize like so:

`sumOut`

begins with the sum of the 6 highlighted pixels (`6`

right here).

`sumIn`

might be initialized with the remaining pixels from the stack (worth of `7`

).

We do the very same factor when approaching the suitable edge, reusing that final pixel to fill the queue as wanted.

This works nice, however does create a bias in direction of these edge pixels. The “affect” of the sting pixels creates what’s referred to as edge bleed.

For a extra correct blur, some implementations will range the denominator (aka, range the stack dimension) for the sting pixels.

Various the denominator with our radius of `5`

, the stacks for the primary two pixels would seem like this:

## Horizontal and Vertical Stack Blur passes

Just like the Field Blur algorithm, we’re aiming to effectively approximate a 2D gaussian kernel blur. We do this with operating 2 passes over the picture: one horizontal and one vertical. The vertical move will function on the *consequence* of the horizontal move, compounding the blur.

Job van der Zwan has a put up on Stack Blur with a great interactive (!) visualization of how sliding the stack in two passes finally ends up simulating a gaussian-esqe 2D kernel.

## Multi-Channel Photographs

Fortunately, coping with multi-channel pictures is conceptually an identical to single channel pictures.

Most pictures in compositing applications are saved in a local interleaved pixel format. That simply means every every of the 8-bit values for Pink, Inexperienced, Blue, Alpha (transparency) for one pixel are saved collectively in reminiscence (right into a 32-bit area).

There are important implementation details round scary sounding issues like “little endian” storage of pixel values and “premultiplied alpha”. However principally, the pixels are packed collectively in a `BGRA`

order on Home windows and MacOS.

## Why does Stack Blur work, although?

If we have a look at our stack and take note of the columns, we would begin to perceive how precisely the stack weights the pixel values.

As a substitute of doing all of this summing, what if we simply attempt to do one thing with these weights? Listed here are the weights for the stack above:

We may normalize the weights by dividing by the `stackSize`

. That will allow us to apply them to on to pixel values.

Attention-grabbing! Discover that the complete worth `(9/9`

) of the pixel is distributed throughout all 5 locations the place the pixel is related! We will lastly sorta perceive how Stack Blur *blurs*.

This kernel of summary pixel weights appears… fascinating. Can we simply use that? Sure, however we’d need to multiply these towards the queue one after the other, and a `queueSize`

variety of multiplications per transfer isn’t ideally suited for bigger radii.

In order that’s Stack Blur’s main optimization (having a set variety of operations per queue transfer).

Plus, possibly it will get sophisticated with our environment friendly round buffer? And Isn’t this beginning to odor a bit like an costly convolution kernel we’re making an attempt to keep away from? 🙈

## The half the place I check, implement and benchmark 15 Stack Blur implementations

I wished to verify I used to be getting the quickest efficiency potential. I don’t wish to fear about lots of of drop shadows out of the blue making my synth carry out worse (particularly on Home windows).

As I maintain saying, Stack Blur’s main optimization is making certain there are the identical variety of operation per pixel when the radii scales. However in actual life (utilizing drop and internal shadows to construct UI), radii are often within the 4-36px vary. So maybe the algorithm is over-optimized for radii dimension?

Additionally, the unique Stack Blur algorithm was made for an atmosphere with out entry to SIMD or Intel/Apple vendor libs, in 2004. Vector processing and SIMD choices have come a great distance since then.

Nowadays there are some fancy and quick libraries for convolution in C++, reminiscent of Apple’s Accelerate and Intel’s IPP. Certainly these vendor libraries have some tips up their very own sleeve? Certain, possibly a full blown vector accelerated gaussian blur will probably be too costly (spoiler: it was), however what if we do two passes of `vector x matrix`

convolution?

Effectively, quick ahead a pair weeks and 20+ implementations later (and counting) and sure: each Apple and Intel’s vector libraries are fairly a bit quicker for Single Channel Photographs. Full benchmarks here.

The lengthy story: For essentially the most picture and radii sizes, vendor libraries (using SIMD, and many others underneath the hood) outperform Stack Blur, particularly because the picture dimensions scale up.

Apple has vImageSepConvolve for separated passes, which principally beats stack blur in a pair traces of code.

Intel’s choices have been much less thrilling. I’ve a graveyard of failed Intel makes an attempt to make use of issues like `ippiFilterSeparable`

, their separable kernel providing. Fairly positive it’s doing 2D convolution underneath the hood, as a result of efficiency is killed by bigger radii.

I did write a fast single channel vector IPP Stack Blur which performs decently:

It “rotates” the queue so that you’ve `n`

vectors the place `n`

is your queue dimension.

Nonetheless, it seems that the Stack Blur is definitely already *very* optimized.

There are instances the place Stack Blur outperforms all the pieces I attempted (I nonetheless haven’t discovered an algorithm quicker than it for ARGB pictures on Home windows).

So, how can I take advantage of lots of of CPU made drop shadows to construct my UI? The true trick is to render the shadow as soon as and cache it for future repaints. That brings shadow repainting time all the way down to nearly uncooked picture compositing ranges. Getting quick preliminary paints remains to be price it although! Particularly for something animating. Not less than what I inform myself to sleep effectively at night time 🙂

## Different Stack Blur optimizations

One apparent optimization for drop shadows particularly is that the bigger the scale get, the much less of the picture is definitely related to the tip consequence – content material will probably be drawn on prime of the drop shadow — however solely the perimeters should be blurred. Somebody within the feedback can in all probability present a components for this.

A Stack Blur optimization for that is what I name the “Martin optimization,” named after my father-in-law. I sat down with him and my spouse (each extraordinarily logical thinkers who helped me work out the algorithm and who additionally play code optimizing video video games!) and requested them how they’d optimize the algorithm. One in every of his insights was that if the values within the queue have been all the identical *and* the incoming pixel was that very same worth, it’s a no-op: the algorithm might be skipped for that incoming pixel.

When compositing pictures to a graphics context, the within rectangle of the only shade crammed path will also be clipped out.

## Additional exploring the convolutional roots

One other fascinating tangent that my spouse and I went on:

If we concentrate on the **relative** weight of a pixel because it strikes by means of the queue, we are able to see the primary time the pixel enters the queue, its weight is `1/9`

. When it strikes, the burden doubles to `2/9`

, then it’s multiplied by `3/2`

to for a weight of `3/9`

. Because it strikes into the suitable facet of the kernel, it shrinks, first by `2/3`

after which by `1/2`

.

So we may have a look at the multipliers that describe a single pixel’s “journey” by means of the queue:

We will then normalize for any queue dimension by scaling the rightmost pixel upon entry to the queue. For our `stackSize`

of 9, let’s have it scale the incoming pixel by `1/9`

:

This seems promising! However what can we do with it? Utilizing it towards the queue would save us from 4 add/subtracts per queue transfer and having to learn and write to `sumIn`

and `sumOut`

— however we achieve 5 multiplies (and extra because the radius scales up).

Effectively, the reply is I don’t know how this might be helpful. Please let me know within the feedback if in case you have concepts 🙂

## Stack Blur Implementations and Sources

It’s an additive beast with 1000 oscillators

and a ton of enjoyable sound shaping instruments