Now Reading
Porting WaterLily.jl right into a backend-agnostic solver

Porting WaterLily.jl right into a backend-agnostic solver

2023-05-08 08:44:16

WaterLily.jl is a straightforward and quick fluid simulator written in pure Julia. It solves the unsteady incompressible 2D or 3D Navier-Stokes equations on a Cartesian grid.
The strain Poisson equation is solved with a geometric multigrid technique.
Strong boundaries are modelled utilizing the Boundary Data Immersion Method.

v1.0 has ported the solver from a serial CPU execution to a backend-agnostic execution together with multi-threaded CPU and GPU from totally different distributors (NVIDIA and AMD) because of KernelAbstractions.jl (KA).
On this put up, we are going to overview our method to port the code along with the challenges now we have confronted.

Introducing KernelAbstractions.jl

The principle ingredient of this port is the @kernel macro from KA.
This macro takes a perform definition and converts it right into a kernel specialised for a given backend. KA can work with CUDA, ROCm, oneAPI, and Steel backends.

As instance, think about the divergence operator for a 2D vector subject $vec{u}=(u, v)$.
Within the finite-volume technique (FVM) and utilizing a Cartesian (uniform) grid with unit cells, that is outlined as

[begin{align}
sigma=unicode{x2230}(nablacdotvec{u}),mathrm{d}V = unicode{x222F}vec{u}cdothat{n},mathrm{d}Srightarrow sigma_{i,j} = (u_{i+1,j} – u_{i,j}) + (v_{i,j+1} – v_{i,j}),
end{align}]

the place $i$ and $j$ are the indices of the discretised staggered grid:

staggered grid

In WaterLily, we outline loops primarily based on the CartesianIndex such that I=(i,j,...), thus writing an n-dimensional solver in a really straight-forward method.
With this, to compute the divergence of a 2D vector subject we will use

δ(d,::CartesianIndex{D}) the place {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline (a,I::CartesianIndex{D},u::AbstractArray{T,n}) the place {D,T,n} = u[I+δ(a,I),a] - u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:dimension(a)[i]-1, ndims(a)))

N = (10, 10) # area dimension
σ = zeros(N) # scalar subject
u = rand(N..., size(N)) # 2D vector subject with ghost cells

for d  1:ndims(σ), I  inside(σ)
    σ[I] += (d, I, u)
finish

the place a loop for every dimension d and every Cartesian index I is used.
The perform δ supplies a Cartesian step within the course d, for instance δ(1, 2) returns CartesianIndex(1, 0) and δ(2, 3) returns CartesianIndex(0, 1, 0).
That is used within the spinoff perform to implement the divergence equation as described above.
inside(σ) supplies the CartesianIndices of σ excluding the ghost parts, ie. a variety of Cartesian indices beginning at (2, 2) and ending at (9, 9) when dimension(σ) == (10, 10).

Notice that the divergence operation is unbiased for every I, and that is nice as a result of it means we will parallelize it!
That is the place KA comes into place.
To have the ability to generate the divergence operator utilizing KA, we have to write the divergence kernel which is simply the divergence operator written in KA model.

We outline the divergence kernel _divergence! and a wrapper divergence! as follows

utilizing KernelAbstractions: get_backend, @index, @kernel

@kernel perform _divergence!(σ, u, @Const(I0))
    I = @index(World, Cartesian)
    I += I0
    σ_sum = zero(eltype(σ))
    for d  1:ndims(σ)
        σ_sum += (d, I, u)
    finish
    σ[I] = σ_sum
finish
perform divergence!(σ, u)
    R = inside(σ)
    _divergence!(get_backend(σ), 64)(σ, u, R[1]-oneunit(R[1]), ndrange=dimension(R))
finish

Notice that within the _divergence! kernel we function once more utilizing Cartesian indices by calling @index(World, Cartesian) from the KA @index macro.
The vary of Cartesian indices is given by the ndrange argument within the wrapper perform the place we cross the inside(σ) Cartesian indices vary, and the backend is inferred with the get_backend technique.
Additionally be aware that we cross a further (fixed) argument I0 which supplies the offset index to use to the indices given by @index naturally beginning on (1,1,...).
Utilizing a workgroup dimension of 64 (dimension of the group of threads performing in parallel, see terminology), KA will parallelize over I by multi-threading in CPU or GPU.
On this regard, we simply want to alter the array sort of σ and u from Array (CPU backend) to CuArray (NVIDIA GPU backend) or ROCArray (AMD GPU backend), and KA will specialise the kernel for the specified backend

utilizing CUDA: CuArray

N = (10, 10)
σ = zeros(N) |> CuArray
u = rand(N..., size(N)) |> CuArray

divergence!(σ, u)

Computerized loop and kernel era

As a stencil-based CFD solver, WaterLily closely makes use of for loops to iterate over the n-dimensional arrays.
To automate the era of such loops, the macro @loop is outlined

macro loop(args...)
    ex,_,itr = args
    op,I,R = itr.args
    @assert op  (:(),:(in))
    return quote
        for $I  $R
            $ex
        finish
    finish |> esc
finish

This macro takes an expression reminiscent of @loop <expr> over I ∈ R the place R is a CartesianIndices vary, and produces the loop for I ∈ R <expr> finish.
For instance, the serial divergence operator may now be merely outlined utilizing

for d  1:ndims(σ)
    @loop σ[I] += (d, I, u) over I  inside(σ)
finish

which generates the I ∈ inside(σ) loop mechanically.

Regardless that this may very well be seen as small enchancment (if any), the good factor about writing loops utilizing this method is that the computationally-demanding a part of the code will be abstracted out of the principle workflow.
For instance, it’s simple so as to add efficiency macros reminiscent of @inbounds and/or @fastmath to every loop by altering the quote block within the @loop macro

macro loop(args...)
    ex,_,itr = args
    op,I,R = itr.args
    @assert op  (:(),:(in))
    return quote
        @inbounds for $I  $R
            @fastmath $ex
        finish
    finish |> esc
finish

And, even nicer, we will additionally use this method to mechanically generate KA kernels for each loop within the code!
To take action, we modify @loop to generate the KA kernel utilizing @kernel and the wrapper perform that units the backend and the workgroup dimension

macro loop(args...)
    ex,_,itr = args
    _,I,R = itr.args; sym = []
    seize!(sym,ex)     # get arguments and exchange composites in `ex`
    setdiff!(sym,[I]) # do not need to cross I as an argument
    @gensym kern      # generate distinctive kernel perform title
    return quote
        @kernel perform $kern($(rep.(sym)...),@Const(I0)) # exchange composite arguments
            $I = @index(World,Cartesian)
            $I += I0
            $ex
        finish
        $kern(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=dimension($R))
    finish |> esc
finish
perform seize!(sym,ex::Expr)
    ex.head == :. && return union!(sym,[ex])      # seize composite title and return
    begin = ex.head==:(name) ? 2 : 1              # do not seize perform names
    foreach(a->seize!(sym,a),ex.args[start:end])   # recurse into args
    ex.args[start:end] = rep.(ex.args[start:end]) # exchange composites in args
finish
seize!(sym,ex::Image) = union!(sym,[ex])        # seize image title
seize!(sym,ex) = nothing
rep(ex) = ex
rep(ex::Expr) = ex.head == :. ? Image(ex.args[2].worth) : ex

The helper capabilities seize! and rep enable to extract the arguments required by the expression ex and the Cartesian index vary that will probably be handed to the kernel.

The code generated by @loop and @kernel will be explored utilizing @macroexpand. For instance, for d=1

@macroexpand @loop σ[I] += (1, I, u) over I  inside(σ)

we will observe that the code for each CPU and GPU kernels is produced:

Generated code
@macroexpand @loopKA σ[I] += ∂(1, I, u) over I ∈ inside(σ)
quote
    start
        perform var"cpu_##kern#339"(__ctx__, σ, u, I0; )
            let I0 = (KernelAbstractions.constify)(I0)
                $(Expr(:aliasscope))
                start
                    var"##N#341" = size((KernelAbstractions.__workitems_iterspace)(__ctx__))
                    start
                        for var"##I#340" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
                            (KernelAbstractions.__validindex)(__ctx__, var"##I#340") || proceed
                            I = KernelAbstractions.__index_Global_Cartesian(__ctx__, var"##I#340")
                            start
                                I += I0
                                σ[I] += ∂(1, I, u)
                            finish
                        finish
                    finish
                finish
                $(Expr(:popaliasscope))
                return nothing
            finish
        finish
        perform var"gpu_##kern#339"(__ctx__, σ, u, I0; )
            let I0 = (KernelAbstractions.constify)(I0)
                if (KernelAbstractions.__validindex)(__ctx__)
                    start
                        I = KernelAbstractions.__index_Global_Cartesian(__ctx__)
                        I += I0
                        σ[I] += ∂(1, I, u)
                    finish
                finish
                return nothing
            finish
        finish
        start
            if !($(Expr(:isdefined, Image("##kern#339"))))
                start
                    $(Expr(:meta, :doc))
                    var"##kern#339"(dev) = start
                            var"##kern#339"(dev, (KernelAbstractions.NDIteration.DynamicSize)(), (KernelAbstractions.NDIteration.DynamicSize)())
                        finish
                finish
                var"##kern#339"(dev, dimension) = start
                        var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(dimension), (KernelAbstractions.NDIteration.DynamicSize)())
                    finish
                var"##kern#339"(dev, dimension, vary) = start
                        var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(dimension), (KernelAbstractions.NDIteration.StaticSize)(vary))
                    finish
                perform var"##kern#339"(dev::Dev, sz::S, vary::NDRange) the place {Dev, S <: KernelAbstractions.NDIteration._Size, NDRange <: KernelAbstractions.NDIteration._Size}
                    if (KernelAbstractions.isgpu)(dev)
                        return (KernelAbstractions.assemble)(dev, sz, vary, var"gpu_##kern#339")
                    else
                        return (KernelAbstractions.assemble)(dev, sz, vary, var"cpu_##kern#339")
                    finish
                finish
            finish
        finish
    finish
    (var"##kern#339"(get_backend(σ), 64))(σ, u, (inside(σ))[1] - oneunit((inside(σ))[1]), ndrange = dimension(inside(σ)))
finish

The very best function we obtain when modifying @loop to supply KA kernels is that the divergence operator stays the identical as earlier than utilizing KA

for d  1:ndims(σ)
    @loop σ[I] += (d, I, u) over I  inside(σ)
finish

This actual method is what has allowed WaterLily to have the identical LOC as earlier than utilizing KA, simply round 800!

Benchmarking

Now that now we have all of the gadgets in place, we will benchmark the speedup achieved by KA in comparison with the serial execution utilizing BenchmarkTools.jl.
Let’s now collect all of the code now we have used and create a small benchmarking MWE (see beneath or download it here).
On this code showcase, we are going to consult with the serial CPU execution as “serial”, the multi-threaded CPU execution as “CPU”, and the GPU execution as “GPU”:

utilizing KernelAbstractions: get_backend, synchronize, @index, @kernel, @groupsize
utilizing CUDA: CuArray
utilizing BenchmarkTools

δ(d,::CartesianIndex{D}) the place {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline (a,I::CartesianIndex{D},u::AbstractArray{T,n}) the place {D,T,n} = u[I+δ(a,I),a]-u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:dimension(a)[i]-1,ndims(a)))

# serial loop macro
macro loop(args...)
    ex,_,itr = args
    op,I,R = itr.args
    @assert op  (:(),:(in))
    return quote
        for $I  $R
            $ex
        finish
    finish |> esc
finish
# KA-adapted loop macro
macro loopKA(args...)
    ex,_,itr = args
    _,I,R = itr.args; sym = []
    seize!(sym,ex)     # get arguments and exchange composites in `ex`
    setdiff!(sym,[I]) # do not need to cross I as an argument
    @gensym kern      # generate distinctive kernel perform title
    return quote
        @kernel perform $kern($(rep.(sym)...),@Const(I0)) # exchange composite arguments
            $I = @index(World,Cartesian)
            $I += I0
            $ex
        finish
        $kern(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=dimension($R))
    finish |> esc
finish
perform seize!(sym,ex::Expr)
    ex.head == :. && return union!(sym,[ex])      # seize composite title and return
    begin = ex.head==:(name) ? 2 : 1              # do not seize perform names
    foreach(a->seize!(sym,a),ex.args[start:end])   # recurse into args
    ex.args[start:end] = rep.(ex.args[start:end]) # exchange composites in args
finish
seize!(sym,ex::Image) = union!(sym,[ex])        # seize image title
seize!(sym,ex) = nothing
rep(ex) = ex
rep(ex::Expr) = ex.head == :. ? Image(ex.args[2].worth) : ex

perform divergence!(σ, u)
    for d  1:ndims(σ)
        @loop σ[I] += (d, I, u) over I  inside(σ)
    finish
finish
perform divergenceKA!(σ, u)
    for d  1:ndims(σ)
        @loopKA σ[I] += (d, I, u) over I  inside(σ)
    finish
finish

N = (2^8, 2^8, 2^8)
# CPU serial arrays
σ_serial = zeros(N)
u_serial = rand(N..., size(N))
# CPU multi-threading arrays
σ_CPU = zeros(N)
u_CPU = copy(u_serial)
# GPU arrays
σ_GPU = zeros(N) |> CuArray
u_GPU = copy(u_serial) |> CuArray

# Benchmark warmup (power compilation) and validation
divergence!(σ_serial, u_serial)
divergenceKA!(σ_CPU, u_CPU)
divergenceKA!(σ_GPU, u_GPU)
@assert σ_serial  σ_CPU  σ_GPU |> Array

# Create and run benchmarks
suite = BenchmarkGroup()
suite["serial"] = @benchmarkable divergence!($σ_serial, $u_serial)
suite["CPU"] = @benchmarkable start
    divergenceKA!($σ_CPU, $u_CPU)
    synchronize(get_backend($σ_CPU))
finish
suite["GPU"] = @benchmarkable start
    divergenceKA!($σ_GPU, $u_GPU)
    synchronize(get_backend($σ_GPU))
finish
outcomes = run(suite, verbose=true)

On this benchmark now we have used a 3D array σ (scalar subject) as an alternative of the 2D array used earlier than, therefore demonstrating the n-dimensional capabilities of the present methodology.
For N=(2^8,2^8,2^8), the next benchmark outcomes are achieved on a 6-core laptop computer outfitted with an NVIDIA GeForce GTX 1650 Ti GPU card

"CPU" => Trial(52.651 ms)
"GPU" => Trial(7.589 ms)
"serial" => Trial(234.347 ms)

The GPU executions yields a 30x speed-up in comparison with the serial execution and 7x in comparison with the multi-threaded CPU execution. The multi-threaded CPU execution yields 4.5x speed-up in comparison with the serial execution (ideally needs to be 6x within the 6-core machine).
As a closing be aware on this part, see that synchronize is used when operating the KA benchmarks. If not used, we’d solely be measuring the time that it takes to launch a kernel however to not really run it.

Challenges

Porting the entire solver to GPU has been largely a studying train.
With no earlier expertise on software program growth for GPUs, KA smoothens the educational curve, so it’s an effective way to get began.
In fact, a number of stuff doesn’t simply work out of the field, and now we have confronted some challenges whereas doing the port. Listed here are a few of them.

Offset indices in KA kernels

Offset indices are vital for boundary-value issues the place arrays might include each the answer and the boundary circumstances of an issue.
Within the stencil-based finite-volume and finite-difference strategies, the boundary parts are solely accessed to compute the stencil, however circuitously modified when looping via the answer parts of an array.
It’s on this state of affairs the place offset indices are vital, for instance.
KA @index macro solely supplies pure indices in Julia (beginning at 1), and this minor lacking function initially derailed us into utilizing OffsetArrays.jl.
In fact this added complexity to the code, and we even noticed degraded efficiency in some kernels.
A while after this (greater than we wish to admit), the concept of manually passing the offset index into the KA kernel took form and rapidly yield a a lot cleaner answer.
Fortunately, this function will probably be natively supported in KA sooner or later (see KA issue #384).

To inline capabilities will be vital in GPU kernels

In KA, GPU kernels are after all extra delicate than CPU kernels on the subject of capabilities which may be referred to as inside.
We’ve got noticed this sensitivity each at compilation time and at runtime.
For instance, the δ perform was initially applied with a number of dispatch as

@inline δ(i,N::Int) = CartesianIndex(ntuple(j -> j==i ? 1 : 0, N))
δ(d,I::CartesianIndex{N}) the place {N} = δ(d, N)

The principle drawback right here is that this implementation is type-unstable, and with out @inline the GPU kernel was complaining a few dynamic perform (see KA issue #392).
One other inline-related drawback will be noticed with the spinoff perform .
When eradicating the @inline macro from its definition, the GPU efficiency decays considerably, and the GPU benchmark will get even with the CPU one.
This demonstrates that the compiler can do performant methods when the data on the required directions shouldn’t be nested on exterior capabilities to the kernel.

Usually we use capabilities such because the norm2 from LinearAlgebra.jl to compute the norm of an array.
A shock is that a few of these don’t work inside a kernel because the GPU compiler will not be outfitted to take action. Therefore, these should be manually written in an appropriate kind.
On this case, we use norm2(x) = √sum(abs2,x).
One other instance is the sum perform utilizing generator syntax reminiscent of

@kernel perform _divergence(σ, u)
    I = @index(World, Cartesian)
    σ[I] = sum(u[I+δ(d),d]-u[I,d] for d  1:ndims(σ))
finish

which errors throughout compilation for a GPU kernel.
Right here an answer will be to make use of a special type of sum

@kernel perform _divergence(σ, u)
    I = @index(World, Cartesian)
    σ[I] = sum(j -> u[I+δ(j),j]-u[I,j], 1:ndims(σ), init=zero(eltype(σ)))
finish

although now we have noticed lowered efficiency within the latter model (extra data in Discourse post #96658).
There are efforts in KA directed in direction of offering a discount interface for kernels (see KA issue #234).

Limitations of the automated kernel era on loops

Whereas the @loop macro that generates KA kernels is pretty normal, it additionally has some limitations.
For instance, it might have been seen that now we have not nested the loop over the size d ∈ 1:ndims(σ) within the kernel.
The rationale behind that is that even when turning

for d  1:ndims(σ)
    @loop σ[I] += (d, I, u) over I  inside(σ)
finish

into

@loop σ[I] = sum(d->(d, I, u), 1:ndims(σ)) over I  inside(σ)

would scale back the variety of kernel evaluations, the limitation of the sum perform talked about earlier than makes this method not as performant as writing a kernel for every dimension.
Additionally associated to this situation is the truth that passing multiple expression per kernel would scale back the general variety of kernel evaluations, however gluing expressions collectively will be not straight-forward with the present implementation of @loop.

Look after race circumstances!

When shifting from serial to parallel computations, race circumstances are a recurring situation.
For WaterLily, this situation popped up for the linear solver used within the strain Poisson equation.
Previous to the port, WaterLily relied on Successive Over Rest (SOR) technique (a Gauss-Seidel-type solver) which makes use of (ordered) backsubstitution, therefore not appropriate for parallel executions.
The answer right here was simply to modify to a greater suited solver such because the Conjugate-Gradient technique.

Acknowledgements

Particular because of Valentin Churavy for creating KernelAbstractions.jl and revising this text. And, after all, Gabriel D. Weymouth for creating WaterLily.jl and for serving to within the revising of this text too! 🙂

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top