Now Reading
Lucas Pluvinage – Implementing worth hypothesis in OCaml

Lucas Pluvinage – Implementing worth hypothesis in OCaml

2023-05-06 13:36:26

CPUs are superb at doing issues in parallel, even in a single core context.
Certainly, speculative execution of code and instruction reordering helps the CPU
be sure that the pipeline is at all times full. Nonetheless, knowledge dependencies within the
sequence of directions would possibly trigger the CPU to have to attend for knowledge, be it
from the L1 cache or the a lot slower RAM storage. Francesco Mazzoli reveals of their weblog put up, Beating the L1 cache with value speculation ,
that optimizing the crucial path will in some circumstances yield big efficiency

This text demonstrates that this optimisation could be carried out within the
OCaml programming language. Understanding how OCaml values are represented in reminiscence is helpful, here’s a chapter of Actual World OCaml on that matter: Memory Representation of Values.

# Quick listing iterations utilizing one easy trick™

Let’s begin by instantiating a linked listing of 10000 random numbers.

let l = Checklist.init 10000 (enjoyable _ -> 1024)

Now, let’s sum the numbers 100k occasions and see how lengthy that takes. To acquire these statistics, perf stat was used with the default settings on packages compiled with OCaml 5.0.

let rec sum (accumulator: int) (cell: int listing) =

match cell with

| head::tail -> sum (accumulator + head) tail

| [] -> accumulator

for _ = 1 to 100000 do

ignore (sum 0 l)


Efficiency counter stats for './a.out sum':

1 368,91 msec task-clock:u # 0,999 CPUs utilized

4 835 597 233 cycles:u # 3,532 GHz

9 005 641 989 directions:u # 1,86 insn per cycle

3 001 229 709 branches:u # 2,192 G/sec

195 819 branch-misses:u # 0,01% of all branches

Conveniently, we’re doing one billion additions (10000 listing gadgets, repeated 100k occasions). So every iteration is taking:

  • 1.36 nanoseconds
  • 4.8 cycles
  • 9 directions
  • 3 branches

We will already see that the CPU is cramming a number of directions per cycle.
Utilizing the compiler explorer, the next meeting is obtained:


subq $8, %rsp


cmpq (%r14), %r15 # test if GC is ready for us

jbe .L102 # department #1


testb $1, %bl # test if at finish of listing

je .L100 # department #2

addq $8, %rsp



movq 8(%rbx), %rdi # load tail aspect of listing

movq (%rbx), %rbx # load head aspect of listing

leaq -1(%rbx,%rax), %rax # add it to accumulator

movq %rdi, %rbx # transfer tail aspect for subsequent iteration

jmp .L101


name caml_call_gc@PLT


jmp .L103

I’ve tried to suppose like a CPU in an effort to clarify the numbers based on the perf outcomes, however as so many items are concerned I assumed it could be laborious for every part to be right. As an alternative, we’ll attempt to get a superb instinct by pondering by way of knowledge dependency. Principally, the belief is that issues that don’t rely on one another could be ran in parallel. The second assumption is that due to the department predictor, department directions are speculated to have zero price and so they do not introduce knowledge dependencies. As an alternative, the CPU predicts whether or not this system will undergo the department and proceed execution. In case of misprediction, the CPU will roll again computations for us.

So right here is the information dependency chart for this program, displaying how two iterations are anticipated to be ran, with the crucial path in crimson:

data chart

Even when the tail pointer is loaded from cache, we nonetheless must pay a 4 cycles price to fetch from the L1 cache.

The Test GC a part of the loop is unbiased from the remainder, and is helpful in a multicore context as each area of this system has to synchronize when performing rubbish assortment.

# Worth hypothesis

It is time to open the rune e-book and carry out some darkish magic. We all know that we
will not get previous that 4 cycles per iteration bottleneck except there’s a strategy to
bypass the pointer chasing. To try this, we’ll do what was finished in
the worth hypothesis article, however in pure OCaml.

The precept is identical: cell is transformed to a “pointer” utilizing the forbidden Obj.magic operate. Utilizing the worth of the earlier listing
cell pointer, we will compute delta, the variety of bytes between the 2 final cells.
That is used to estimate the place would be the subsequent cell. If the prediction is right,
the reminiscence load of the following cell deal with is successfully bypassed, and the CPU can
immediately begin engaged on the following iteration. If not, the department predictor rolls again the

let present : int = Obj.magic cell in

let delta : int = present - earlier

let prediction : int listing = Obj.magic (present + delta)

illustration of the delta computing mechanism

Right here is the total program. Be aware that the delta calculation was inlined for efficiency causes.

(present + delta = present + (present - earlier) = 2 * present - earlier)

let rec seum (earlier : int) (accu : int) (cell : int listing) =

match cell with

| [] -> accu

| head::tail ->

let present : int = Obj.magic cell in

let prediction : int listing =

Obj.magic (2 * present - earlier)




(accu + head)

(if prediction == tail then prediction else tail)

This system is executed with the identical enter and…

Efficiency counter stats for './a.out seum':

944,67 msec task-clock:u # 1,000 CPUs utilized

3 422 107 884 cycles:u # 3,623 GHz

16 006 647 403 directions:u # 4,68 insn per cycle

5 001 230 197 branches:u # 5,294 G/sec

225 781 branch-misses:u # 0,00% of all branches

Per iteration:

  • 0.94 nanoseconds
  • 3.4 cycles
  • 16 directions
  • 5 branches

We bought previous the bottleneck ! What we’re successfully doing is remodeling the listing iteration into an array iteration. And this works, as a result of there are conditions the place OCaml lists are allotted in a linear trend, making the cell addresses predictible. Be aware specifically the massive variety of directions per cycle.

If you’re not satisfied, right here is an up to date dependency diagram for this program.

seum chart

By shifting the tail pointer load exterior of the crucial path, the CPU is ready to do extra issues directly. The next chart reveals how one can count on the CPU to execute directions for 3 iterations in parallel.

seum chart tiled

# The crux of the hack

The astute reader will marvel how is it potential to make use of Obj.magic to transmute the int listing to an int. Certainly these OCaml sorts have totally different representations.

  • An int listing is a pointer to an OCaml block allotted on the heap. Attributable to alignment, it at all times finish with a zero bit.
  • An int is a primitive OCaml worth, the worth n being represented because the tagged integer 2 * n + 1.

Which means ordinary operations on int values will not work on uncooked pointers.
For instance, the addition is carried out as Int.add a b = a + b - 1. The consequence is that including two uncooked numbers collectively utilizing the OCaml addition will subtract one from the consequence to account for the anticipated tags.

For subtraction, it’s comparable: Int.sub a b = a - b + 1.

The magic occurs once we’re combining an addition and a subtraction, in order that the entire operation is right on each pointers and integers:

Int.add present (Int.sub present earlier)

= present + (present - earlier + 1) - 1

= present + (present - earlier)

Right here is the total annotated meeting for the curious


subq $8, %rsp


movq %rax, %rsi

movq %rdi, %rax

cmpq (%r14), %r15 # test if GC is ready for us

jbe .L104 # department #1


testb $1, %al # test if at finish of listing

je .L102 # department #2

movq %rbx, %rax

See Also

addq $8, %rsp



movq 8(%rax), %rdx # load tail aspect of listing cell

movq %rax, %rdi # transfer cell pointer to %rdi

salq $1, %rdi # multiply pointer by two

subq %rsi, %rdi # subtract earlier cell pointer to acquire the prediction

cmpq %rdx, %rdi # evaluate tail aspect with prediction

jne .L101

jmp .L100


movq %rdx, %rdi # prediction is inaccurate, transfer the tail aspect in %rdi


movq (%rax), %rsi # load head aspect of listing cell

leaq -1(%rbx,%rsi), %rbx # add to accumulator

jmp .L103 # loop


name caml_call_gc@PLT


jmp .L105

# Optimized

I used to be not glad by this mere 45% enchancment. What if the loop was unrolled, in order that the CPU has extra freedom to maneuver directions round ?

let rec seum_unroll (earlier : int) (accu : int) (cell : int listing) =

match cell with

| [] -> accu

| [a] -> accu + a

| [a; b] -> accu + a + b

| [a; b; c] -> accu + a + b + c

| a::b::c::d::tail ->

let present = Obj.magic cell in

let prediction : int listing =

Obj.magic (2 * present - earlier)




(accu + a + b + c + d)

(if prediction == tail then prediction else tail)

The thought is to iterate on the listing by chunks of 4 gadgets. There may be nonetheless the identical quantity of labor to do, however the worth prediction trick is carried out as soon as each 4 aspect.

Efficiency counter stats for './a.out seum_unroll':

715,58 msec task-clock:u # 0,999 CPUs utilized

2 410 830 333 cycles:u # 3,369 GHz

7 756 543 257 directions:u # 3,22 insn per cycle

2 001 229 820 branches:u # 2,797 G/sec

112 956 branch-misses:u # 0,01% of all branches

Per iteration:

  • 0.72 nanoseconds
  • 2.4 cycles
  • 7.7 directions
  • 2 branches

91.3% quicker. Good.

In accordance the unique article, we’re now quicker than the naive C implementation, however 2x slower than the hand-optimized one. That is all very artificial however I discovered it fairly fascinating that OCaml is ready to profit from the identical hacks as C. After all this may not be helpful in a variety of conditions, however perhaps one will acquire some insights on how fashionable CPU are in a position to make packages go quick.

For those who like this text I might be glad to have your suggestions.
Public dialogue on Twitter

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top