Porting WaterLily.jl right into a backend-agnostic solver
WaterLily.jl is a straightforward and quick fluid simulator written in pure Julia. It solves the unsteady incompressible 2D or 3D Navier-Stokes equations on a Cartesian grid.
The strain Poisson equation is solved with a geometric multigrid technique.
Strong boundaries are modelled utilizing the Boundary Data Immersion Method.
v1.0 has ported the solver from a serial CPU execution to a backend-agnostic execution together with multi-threaded CPU and GPU from totally different distributors (NVIDIA and AMD) because of KernelAbstractions.jl (KA).
On this put up, we are going to overview our method to port the code along with the challenges now we have confronted.
Introducing KernelAbstractions.jl
The principle ingredient of this port is the @kernel
macro from KA.
This macro takes a perform definition and converts it right into a kernel specialised for a given backend. KA can work with CUDA, ROCm, oneAPI, and Steel backends.
As instance, think about the divergence operator for a 2D vector subject $vec{u}=(u, v)$.
Within the finite-volume technique (FVM) and utilizing a Cartesian (uniform) grid with unit cells, that is outlined as
sigma=unicode{x2230}(nablacdotvec{u}),mathrm{d}V = unicode{x222F}vec{u}cdothat{n},mathrm{d}Srightarrow sigma_{i,j} = (u_{i+1,j} – u_{i,j}) + (v_{i,j+1} – v_{i,j}),
end{align}]
the place $i$ and $j$ are the indices of the discretised staggered grid:
In WaterLily, we outline loops primarily based on the CartesianIndex
such that I=(i,j,...)
, thus writing an n-dimensional solver in a really straight-forward method.
With this, to compute the divergence of a 2D vector subject we will use
δ(d,::CartesianIndex{D}) the place {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline ∂(a,I::CartesianIndex{D},u::AbstractArray{T,n}) the place {D,T,n} = u[I+δ(a,I),a] - u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:dimension(a)[i]-1, ndims(a)))
N = (10, 10) # area dimension
σ = zeros(N) # scalar subject
u = rand(N..., size(N)) # 2D vector subject with ghost cells
for d ∈ 1:ndims(σ), I ∈ inside(σ)
σ[I] += ∂(d, I, u)
finish
the place a loop for every dimension d
and every Cartesian index I
is used.
The perform δ
supplies a Cartesian step within the course d
, for instance δ(1, 2)
returns CartesianIndex(1, 0)
and δ(2, 3)
returns CartesianIndex(0, 1, 0)
.
That is used within the spinoff perform ∂
to implement the divergence equation as described above.
inside(σ)
supplies the CartesianIndices
of σ
excluding the ghost parts, ie. a variety of Cartesian indices beginning at (2, 2)
and ending at (9, 9)
when dimension(σ) == (10, 10)
.
Notice that the divergence operation is unbiased for every I
, and that is nice as a result of it means we will parallelize it!
That is the place KA comes into place.
To have the ability to generate the divergence operator utilizing KA, we have to write the divergence kernel which is simply the divergence operator written in KA model.
We outline the divergence kernel _divergence!
and a wrapper divergence!
as follows
utilizing KernelAbstractions: get_backend, @index, @kernel
@kernel perform _divergence!(σ, u, @Const(I0))
I = @index(World, Cartesian)
I += I0
σ_sum = zero(eltype(σ))
for d ∈ 1:ndims(σ)
σ_sum += ∂(d, I, u)
finish
σ[I] = σ_sum
finish
perform divergence!(σ, u)
R = inside(σ)
_divergence!(get_backend(σ), 64)(σ, u, R[1]-oneunit(R[1]), ndrange=dimension(R))
finish
Notice that within the _divergence!
kernel we function once more utilizing Cartesian indices by calling @index(World, Cartesian)
from the KA @index
macro.
The vary of Cartesian indices is given by the ndrange
argument within the wrapper perform the place we cross the inside(σ)
Cartesian indices vary, and the backend is inferred with the get_backend
technique.
Additionally be aware that we cross a further (fixed) argument I0
which supplies the offset index to use to the indices given by @index
naturally beginning on (1,1,...)
.
Utilizing a workgroup dimension of 64
(dimension of the group of threads performing in parallel, see terminology), KA will parallelize over I
by multi-threading in CPU or GPU.
On this regard, we simply want to alter the array sort of σ
and u
from Array
(CPU backend) to CuArray
(NVIDIA GPU backend) or ROCArray
(AMD GPU backend), and KA will specialise the kernel for the specified backend
utilizing CUDA: CuArray
N = (10, 10)
σ = zeros(N) |> CuArray
u = rand(N..., size(N)) |> CuArray
divergence!(σ, u)
Computerized loop and kernel era
As a stencil-based CFD solver, WaterLily closely makes use of for
loops to iterate over the n-dimensional arrays.
To automate the era of such loops, the macro @loop
is outlined
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
for $I ∈ $R
$ex
finish
finish |> esc
finish
This macro takes an expression reminiscent of @loop <expr> over I ∈ R
the place R
is a CartesianIndices
vary, and produces the loop for I ∈ R <expr> finish
.
For instance, the serial divergence operator may now be merely outlined utilizing
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
finish
which generates the I ∈ inside(σ)
loop mechanically.
Regardless that this may very well be seen as small enchancment (if any), the good factor about writing loops utilizing this method is that the computationally-demanding a part of the code will be abstracted out of the principle workflow.
For instance, it’s simple so as to add efficiency macros reminiscent of @inbounds
and/or @fastmath
to every loop by altering the quote block within the @loop
macro
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
@inbounds for $I ∈ $R
@fastmath $ex
finish
finish |> esc
finish
And, even nicer, we will additionally use this method to mechanically generate KA kernels for each loop within the code!
To take action, we modify @loop
to generate the KA kernel utilizing @kernel
and the wrapper perform that units the backend and the workgroup dimension
macro loop(args...)
ex,_,itr = args
_,I,R = itr.args; sym = []
seize!(sym,ex) # get arguments and exchange composites in `ex`
setdiff!(sym,[I]) # do not need to cross I as an argument
@gensym kern # generate distinctive kernel perform title
return quote
@kernel perform $kern($(rep.(sym)...),@Const(I0)) # exchange composite arguments
$I = @index(World,Cartesian)
$I += I0
$ex
finish
$kern(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=dimension($R))
finish |> esc
finish
perform seize!(sym,ex::Expr)
ex.head == :. && return union!(sym,[ex]) # seize composite title and return
begin = ex.head==:(name) ? 2 : 1 # do not seize perform names
foreach(a->seize!(sym,a),ex.args[start:end]) # recurse into args
ex.args[start:end] = rep.(ex.args[start:end]) # exchange composites in args
finish
seize!(sym,ex::Image) = union!(sym,[ex]) # seize image title
seize!(sym,ex) = nothing
rep(ex) = ex
rep(ex::Expr) = ex.head == :. ? Image(ex.args[2].worth) : ex
The helper capabilities seize!
and rep
enable to extract the arguments required by the expression ex
and the Cartesian index vary that will probably be handed to the kernel.
The code generated by @loop
and @kernel
will be explored utilizing @macroexpand
. For instance, for d=1
@macroexpand @loop σ[I] += ∂(1, I, u) over I ∈ inside(σ)
we will observe that the code for each CPU and GPU kernels is produced:
Generated code
@macroexpand @loopKA σ[I] += ∂(1, I, u) over I ∈ inside(σ) quote start perform var"cpu_##kern#339"(__ctx__, σ, u, I0; ) let I0 = (KernelAbstractions.constify)(I0) $(Expr(:aliasscope)) start var"##N#341" = size((KernelAbstractions.__workitems_iterspace)(__ctx__)) start for var"##I#340" = (KernelAbstractions.__workitems_iterspace)(__ctx__) (KernelAbstractions.__validindex)(__ctx__, var"##I#340") || proceed I = KernelAbstractions.__index_Global_Cartesian(__ctx__, var"##I#340") start I += I0 σ[I] += ∂(1, I, u) finish finish finish finish $(Expr(:popaliasscope)) return nothing finish finish perform var"gpu_##kern#339"(__ctx__, σ, u, I0; ) let I0 = (KernelAbstractions.constify)(I0) if (KernelAbstractions.__validindex)(__ctx__) start I = KernelAbstractions.__index_Global_Cartesian(__ctx__) I += I0 σ[I] += ∂(1, I, u) finish finish return nothing finish finish start if !($(Expr(:isdefined, Image("##kern#339")))) start $(Expr(:meta, :doc)) var"##kern#339"(dev) = start var"##kern#339"(dev, (KernelAbstractions.NDIteration.DynamicSize)(), (KernelAbstractions.NDIteration.DynamicSize)()) finish finish var"##kern#339"(dev, dimension) = start var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(dimension), (KernelAbstractions.NDIteration.DynamicSize)()) finish var"##kern#339"(dev, dimension, vary) = start var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(dimension), (KernelAbstractions.NDIteration.StaticSize)(vary)) finish perform var"##kern#339"(dev::Dev, sz::S, vary::NDRange) the place {Dev, S <: KernelAbstractions.NDIteration._Size, NDRange <: KernelAbstractions.NDIteration._Size} if (KernelAbstractions.isgpu)(dev) return (KernelAbstractions.assemble)(dev, sz, vary, var"gpu_##kern#339") else return (KernelAbstractions.assemble)(dev, sz, vary, var"cpu_##kern#339") finish finish finish finish finish (var"##kern#339"(get_backend(σ), 64))(σ, u, (inside(σ))[1] - oneunit((inside(σ))[1]), ndrange = dimension(inside(σ))) finish
The very best function we obtain when modifying @loop
to supply KA kernels is that the divergence operator stays the identical as earlier than utilizing KA
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
finish
This actual method is what has allowed WaterLily to have the identical LOC as earlier than utilizing KA, simply round 800!
Benchmarking
Now that now we have all of the gadgets in place, we will benchmark the speedup achieved by KA in comparison with the serial execution utilizing BenchmarkTools.jl.
Let’s now collect all of the code now we have used and create a small benchmarking MWE (see beneath or download it here).
On this code showcase, we are going to consult with the serial CPU execution as “serial”, the multi-threaded CPU execution as “CPU”, and the GPU execution as “GPU”:
utilizing KernelAbstractions: get_backend, synchronize, @index, @kernel, @groupsize
utilizing CUDA: CuArray
utilizing BenchmarkTools
δ(d,::CartesianIndex{D}) the place {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline ∂(a,I::CartesianIndex{D},u::AbstractArray{T,n}) the place {D,T,n} = u[I+δ(a,I),a]-u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:dimension(a)[i]-1,ndims(a)))
# serial loop macro
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
for $I ∈ $R
$ex
finish
finish |> esc
finish
# KA-adapted loop macro
macro loopKA(args...)
ex,_,itr = args
_,I,R = itr.args; sym = []
seize!(sym,ex) # get arguments and exchange composites in `ex`
setdiff!(sym,[I]) # do not need to cross I as an argument
@gensym kern # generate distinctive kernel perform title
return quote
@kernel perform $kern($(rep.(sym)...),@Const(I0)) # exchange composite arguments
$I = @index(World,Cartesian)
$I += I0
$ex
finish
$kern(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=dimension($R))
finish |> esc
finish
perform seize!(sym,ex::Expr)
ex.head == :. && return union!(sym,[ex]) # seize composite title and return
begin = ex.head==:(name) ? 2 : 1 # do not seize perform names
foreach(a->seize!(sym,a),ex.args[start:end]) # recurse into args
ex.args[start:end] = rep.(ex.args[start:end]) # exchange composites in args
finish
seize!(sym,ex::Image) = union!(sym,[ex]) # seize image title
seize!(sym,ex) = nothing
rep(ex) = ex
rep(ex::Expr) = ex.head == :. ? Image(ex.args[2].worth) : ex
perform divergence!(σ, u)
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
finish
finish
perform divergenceKA!(σ, u)
for d ∈ 1:ndims(σ)
@loopKA σ[I] += ∂(d, I, u) over I ∈ inside(σ)
finish
finish
N = (2^8, 2^8, 2^8)
# CPU serial arrays
σ_serial = zeros(N)
u_serial = rand(N..., size(N))
# CPU multi-threading arrays
σ_CPU = zeros(N)
u_CPU = copy(u_serial)
# GPU arrays
σ_GPU = zeros(N) |> CuArray
u_GPU = copy(u_serial) |> CuArray
# Benchmark warmup (power compilation) and validation
divergence!(σ_serial, u_serial)
divergenceKA!(σ_CPU, u_CPU)
divergenceKA!(σ_GPU, u_GPU)
@assert σ_serial ≈ σ_CPU ≈ σ_GPU |> Array
# Create and run benchmarks
suite = BenchmarkGroup()
suite["serial"] = @benchmarkable divergence!($σ_serial, $u_serial)
suite["CPU"] = @benchmarkable start
divergenceKA!($σ_CPU, $u_CPU)
synchronize(get_backend($σ_CPU))
finish
suite["GPU"] = @benchmarkable start
divergenceKA!($σ_GPU, $u_GPU)
synchronize(get_backend($σ_GPU))
finish
outcomes = run(suite, verbose=true)
On this benchmark now we have used a 3D array σ
(scalar subject) as an alternative of the 2D array used earlier than, therefore demonstrating the n-dimensional capabilities of the present methodology.
For N=(2^8,2^8,2^8)
, the next benchmark outcomes are achieved on a 6-core laptop computer outfitted with an NVIDIA GeForce GTX 1650 Ti GPU card
"CPU" => Trial(52.651 ms)
"GPU" => Trial(7.589 ms)
"serial" => Trial(234.347 ms)
The GPU executions yields a 30x speed-up in comparison with the serial execution and 7x in comparison with the multi-threaded CPU execution. The multi-threaded CPU execution yields 4.5x speed-up in comparison with the serial execution (ideally needs to be 6x within the 6-core machine).
As a closing be aware on this part, see that synchronize
is used when operating the KA benchmarks. If not used, we’d solely be measuring the time that it takes to launch a kernel however to not really run it.
Challenges
Porting the entire solver to GPU has been largely a studying train.
With no earlier expertise on software program growth for GPUs, KA smoothens the educational curve, so it’s an effective way to get began.
In fact, a number of stuff doesn’t simply work out of the field, and now we have confronted some challenges whereas doing the port. Listed here are a few of them.
Offset indices in KA kernels
Offset indices are vital for boundary-value issues the place arrays might include each the answer and the boundary circumstances of an issue.
Within the stencil-based finite-volume and finite-difference strategies, the boundary parts are solely accessed to compute the stencil, however circuitously modified when looping via the answer parts of an array.
It’s on this state of affairs the place offset indices are vital, for instance.
KA @index
macro solely supplies pure indices in Julia (beginning at 1), and this minor lacking function initially derailed us into utilizing OffsetArrays.jl.
In fact this added complexity to the code, and we even noticed degraded efficiency in some kernels.
A while after this (greater than we wish to admit), the concept of manually passing the offset index into the KA kernel took form and rapidly yield a a lot cleaner answer.
Fortunately, this function will probably be natively supported in KA sooner or later (see KA issue #384).
To inline capabilities will be vital in GPU kernels
In KA, GPU kernels are after all extra delicate than CPU kernels on the subject of capabilities which may be referred to as inside.
We’ve got noticed this sensitivity each at compilation time and at runtime.
For instance, the δ
perform was initially applied with a number of dispatch as
@inline δ(i,N::Int) = CartesianIndex(ntuple(j -> j==i ? 1 : 0, N))
δ(d,I::CartesianIndex{N}) the place {N} = δ(d, N)
The principle drawback right here is that this implementation is type-unstable, and with out @inline
the GPU kernel was complaining a few dynamic perform (see KA issue #392).
One other inline-related drawback will be noticed with the spinoff perform ∂
.
When eradicating the @inline
macro from its definition, the GPU efficiency decays considerably, and the GPU benchmark will get even with the CPU one.
This demonstrates that the compiler can do performant methods when the data on the required directions shouldn’t be nested on exterior capabilities to the kernel.
In style capabilities might not work inside kernels
Usually we use capabilities such because the norm2
from LinearAlgebra.jl to compute the norm of an array.
A shock is that a few of these don’t work inside a kernel because the GPU compiler will not be outfitted to take action. Therefore, these should be manually written in an appropriate kind.
On this case, we use norm2(x) = √sum(abs2,x)
.
One other instance is the sum
perform utilizing generator syntax reminiscent of
@kernel perform _divergence(σ, u)
I = @index(World, Cartesian)
σ[I] = sum(u[I+δ(d),d]-u[I,d] for d ∈ 1:ndims(σ))
finish
which errors throughout compilation for a GPU kernel.
Right here an answer will be to make use of a special type of sum
@kernel perform _divergence(σ, u)
I = @index(World, Cartesian)
σ[I] = sum(j -> u[I+δ(j),j]-u[I,j], 1:ndims(σ), init=zero(eltype(σ)))
finish
although now we have noticed lowered efficiency within the latter model (extra data in Discourse post #96658).
There are efforts in KA directed in direction of offering a discount interface for kernels (see KA issue #234).
Limitations of the automated kernel era on loops
Whereas the @loop
macro that generates KA kernels is pretty normal, it additionally has some limitations.
For instance, it might have been seen that now we have not nested the loop over the size d ∈ 1:ndims(σ)
within the kernel.
The rationale behind that is that even when turning
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
finish
into
@loop σ[I] = sum(d->∂(d, I, u), 1:ndims(σ)) over I ∈ inside(σ)
would scale back the variety of kernel evaluations, the limitation of the sum
perform talked about earlier than makes this method not as performant as writing a kernel for every dimension.
Additionally associated to this situation is the truth that passing multiple expression per kernel would scale back the general variety of kernel evaluations, however gluing expressions collectively will be not straight-forward with the present implementation of @loop
.
Look after race circumstances!
When shifting from serial to parallel computations, race circumstances are a recurring situation.
For WaterLily, this situation popped up for the linear solver used within the strain Poisson equation.
Previous to the port, WaterLily relied on Successive Over Rest (SOR) technique (a Gauss-Seidel-type solver) which makes use of (ordered) backsubstitution, therefore not appropriate for parallel executions.
The answer right here was simply to modify to a greater suited solver such because the Conjugate-Gradient technique.
Acknowledgements
Particular because of Valentin Churavy for creating KernelAbstractions.jl and revising this text. And, after all, Gabriel D. Weymouth for creating WaterLily.jl and for serving to within the revising of this text too! 🙂