SIMD in Pure Python | Weblog
By David Buchanan, 4th January 2024
Initially, this text is an train in leisure “as a result of I can” programming. For those who simply need to make your Python code go quick, that is maybe not the article for you. And maybe Python just isn’t the language you need, both!
By the top, I am going to clarify how I applied Game of Life in pure Python (plus pysdl2 for graphics output) working in 4K decision at 180fps, which represents a ~3800x speedup over a naive implementation.
For those who’re already conversant in SIMD and vectorization, you may need to skip this subsequent part.
A Transient Introduction to SIMD
If you wish to get essentially the most efficiency out of a contemporary CPU, you are in all probability going to be utilizing SIMD directions – Single Instruction, A number of Knowledge.
For instance, you probably have two arrays of size 4, containing 32-bit integers, most trendy CPUs will allow you to add the pairwise parts collectively in a single machine instruction (assuming you’ve got already loaded the information into registers). Hopefully it is apparent why that is extra environment friendly than looping over the person array parts. Intel CPUs have had this particular functionality since SSE2 was launched within the yr 2000, however SIMD as an idea predates it by a lot.
Newer CPUs cores have been increasing on these capabilities ever since, that means SIMD directions are extra related than ever for maximising CPU throughput.
For those who’re programming in a language like C, an optimising compiler will recognise code that may be accelerated utilizing SIMD directions, and robotically emit acceptable machine code. Compilers cannot optimise every little thing completely, although, so anybody who desires to squeeze out the utmost efficiency may find yourself utilizing Intrinsics to explicitly inform the compiler which directions to make use of. And if that nonetheless is not sufficient, you may find yourself programming straight in meeting.
There are lots of methods to specific SIMD applications, however the remainder of them are out of scope for this text!
“Vectorization” is the method of reworking a typical program into one which operates over entire arrays of knowledge (i.e. vectors) directly (for instance, utilizing SIMD). The work finished by the optimising compiler described above, or the human writing Intrinsic operations, is vectorization.
A Transient Introduction to CPython
CPython is the reference implementation of the Python language, written largely in C, therefore the title. Different implementations exist, however when individuals say “Python” they’re usually implicitly referring to CPython. I am going to attempt to solely say Python once I’m referring to the language as an entire, and CPython once I’m speaking a couple of CPython implementation element (which we’ll be entering into later).
The TL;DR is that CPython compiles your code right into a bytecode format, after which interprets that bytecode at run-time. I will be referring to that bytecode interpreter because the “VM”.
SIMD in Python
Python doesn’t natively have an idea of SIMD. Nevertheless, libraries like NumPy exist, permitting for comparatively environment friendly vectorized code. NumPy enables you to outline vectors, and even n-dimensional arrays, and carry out operations on them in a single API name.
import numpy as np a = np.array([1, 2, 3, 4]) b = np.array([2, 4, 6, 8]) print(a + b) # [ 3 6 9 12] |
With out NumPy, the above instance would require a loop over array parts (or a listing comprehension, which is fancy syntax for a similar factor, kind of).
Internally, NumPy is applied utilizing native C extensions, which in flip use Intrinsics to specific SIMD operations. I am not an skilled on NumPy implementation particulars, however you may peruse their SIMD code here. Notice that the code has been customised for varied CPU architectures.
CPython itself, being an interpreted Python implementation, is gradual. However in the event you can construction your program so that each one the “actual work” will get finished inside a library like NumPy, it may be surprisingly environment friendly general.
NumPy is great and extensively used for getting actual work finished. Nevertheless, NumPy just isn’t “pure” Python!
SIMD in Pure Python
By “pure,” I imply utilizing solely performance constructed into the Python language itself, or the Python standard library.
That is a completely arbitrary and self-imposed constraint, however I believe it is a enjoyable one to work inside. It is also vaguely helpful, since libraries like NumPy aren’t obtainable in sure environments.
Earlier, I stated Python would not natively have an idea of SIMD. This is not fully true; in any other case the article would finish right here. Python helps bitwise operations over pairs of integers: AND (&
), OR (|
), XOR (^
). If you consider these as operations over vectors of booleans, every bit being one bool, it’s SIMD!
Not like many different programming languages, Python integers have limitless precision. That’s, they will precisely characterize integers containing arbitrarily many digits—no less than, till you run out of reminiscence. This implies we are able to consider a vast variety of conceptually-parallel boolean operations with a single python operator.
SIMD over booleans may sound esoteric, but it surely’s an thought we are able to instantly put to work. A typical operation in cryptography is to XOR two byte buffers collectively. An idiomatic implementation may appear like this:
def xor_bytes(a: bytes, b: bytes) -> bytes: assert(len(a) == len(b)) return bytes(x ^ y for x, y in zip(a, b)) |
This takes every particular person byte (as an integer between 0 and 255) in a and b, applies the xor operator to every, and constructs a brand new bytes
object from the outcomes. Arguably, we’re already utilizing the boolean-SIMD idea right here; every of the 8 bits in a byte are getting XORed in parallel. However we are able to do higher than that:
def xor_bytes_simd(a: bytes, b: bytes) -> bytes: assert(len(a) == len(b)) return ( int.from_bytes(a, "little") ^ int.from_bytes(b, "little") ).to_bytes(len(a), "little") |
This may look a bit ridiculous, and that is as a result of it’s. We convert every bytes
object into an arbitrary-precision integer, XOR these two integers collectively, after which convert the ensuing integer again to bytes
. The python ^
operator solely will get executed as soon as, processing all the information in a single step (or extra explicitly, one CPython VM operation). I’ll name this strategy “pseudo-SIMD”. However with all this conversion between bytes and integers, certainly it should be slower general? Let’s benchmark it.
Listed below are my outcomes. The quantity is time to execute 1 million iterations. I examined on CPython 3.11 on an M1 Professional macbook. Try it on your own machine!
naive, n=1 0.3557752799242735 simd, n=1 0.21655898913741112 numpy, n=1 0.798536550020799 naive, n=16 0.8749550790525973 simd, n=16 0.23561427788808942 numpy, n=16 0.7937424059491605 naive, n=128 4.5441425608005375 simd, n=128 0.5077524171210825 numpy, n=128 0.8012108108960092 naive, n=1024 34.96425646613352 simd, n=1024 2.811028849100694 numpy, n=1024 0.9388492209836841
Even for the trivial case of n=1, one way or the other our wacky pseudo-SIMD perform wins, by a couple of third!
The distinction turns into much more pronounced because the buffer measurement will increase, with the pseudo-SIMD perform being 12x quicker for n=1024.
I additionally threw a numpy implementation into the combo. This is not actually honest on numpy, as a result of changing the bytes to and from numpy arrays seems to have fairly a excessive fixed overhead. In additional sensible numpy code, you’d find yourself conserving your knowledge in numpy format all through. Due to this, numpy ended up slower all the best way till n=1024, the place it turned 3x quicker. Evidently, these fixed overheads grow to be much less related as grows.
However what if we eradicate the conversion overheads, permitting our capabilities to enter and output knowledge of their “native” codecs, moderately than bytes? For pseudo-SIMD, that format is integers, and for numpy it is a np.array
object.
simd, n=1 0.03484031208790839 numpy, n=1 0.2326297229155898 simd, n=16 0.042713511968031526 numpy, n=16 0.23679199093021452 simd, n=128 0.046673570992425084 numpy, n=128 0.23861141502857208 simd, n=1024 0.08742194902151823 numpy, n=1024 0.28949279501102865 simd, n=32768 0.9535991169977933 numpy, n=32768 0.9617231499869376 simd, n=131072 4.984845655970275 numpy, n=131072 4.609246583888307
Pseudo-SIMD has the sting for small inputs (maybe as a result of it would not need to do any FFI), however for big buffers, numpy edges forward. However solely barely! How is our pure-python XOR perform (which at this level is simply the XOR operator itself) capable of sustain with NumPy’s optimised SIMD code?
CPython Internals
Let’s take a better look. Here is the inner loop of the code for XORing collectively two arbitrary-precision integers in CPython 3.11.
for (i = 0; i < size_b; ++i) z->ob_digit[i] = a->ob_digit[i] ^ b->ob_digit[i]; |
It is a easy loop over arrays. No SIMD directions? Effectively, there are not any express ones, however what does the C compiler do to it? Let’s take an even nearer look.
I loaded libpython into Ghidra and had a go searching. The library on my system did not have full symbols, so I looked for cross-references to the exported image _PyLong_New
. There have been 82 matches, however by an especially bizarre stroke of luck it was the primary perform I clicked on.
On my system, the (aarch64) meeting equivalent to the above loop is as follows:
. 00299ec0 01 03 80 d2 mov x1,#0x18 00299ec4 00 00 80 d2 mov x0,#0x0 ,->LAB_00299ec8 XREF[1]: 00299ee4(j) | 00299ec8 80 6a e1 3c ldr q0,[x20,x1] | 00299ecc 00 04 00 91 add x0,x0,#0x1 | 00299ed0 a1 6a e1 3c ldr q1,[x21,x1] | 00299ed4 00 1c 21 6e eor v0.16B,v0.16B,v1.16B | 00299ed8 e0 6a a1 3c str q0,[x23,x1] | 00299edc 21 40 00 91 add x1,x1,#0x10 | 00299ee0 5f 00 00 eb cmp x2,x0 _ 00299ee4 21 ff ff 54 b.ne LAB_00299ec8 00299ee8 61 f6 7e 92 and x1,x19,#-0x4 00299eec 7f 06 40 f2 tst x19,#0x3 00299ef0 e0 02 00 54 b.eq LAB_00299f4c LAB_00299ef4 XREF[1]: 0029a2a0(j) 00299ef4 20 f4 7e d3 lsl x0,x1,#0x2 00299ef8 22 04 00 91 add x2,x1,#0x1 00299efc 85 02 00 8b add x5,x20,x0 00299f00 a4 02 00 8b add x4,x21,x0 00299f04 e0 02 00 8b add x0,x23,x0 00299f08 86 18 40 b9 ldr w6,[x4, #0x18] 00299f0c a3 18 40 b9 ldr w3,[x5, #0x18] 00299f10 63 00 06 4a eor w3,w3,w6 00299f14 03 18 00 b9 str w3,[x0, #0x18] 00299f18 7f 02 02 eb cmp x19,x2 00299f1c 8d 01 00 54 b.le LAB_00299f4c 00299f20 83 1c 40 b9 ldr w3,[x4, #0x1c] 00299f24 21 08 00 91 add x1,x1,#0x2 00299f28 a2 1c 40 b9 ldr w2,[x5, #0x1c] 00299f2c 42 00 03 4a eor w2,w2,w3 00299f30 02 1c 00 b9 str w2,[x0, #0x1c] 00299f34 7f 02 01 eb cmp x19,x1 00299f38 advert 00 00 54 b.le LAB_00299f4c 00299f3c 82 20 40 b9 ldr w2,[x4, #0x20] 00299f40 a1 20 40 b9 ldr w1,[x5, #0x20] 00299f44 21 00 02 4a eor w1,w1,w2 00299f48 01 20 00 b9 str w1,[x0, #0x20] LAB_00299f4c
For those who’re not a reverse engineer, and even if you’re, you are in all probability considering “WTF is happening right here?” This can be a pretty typical results of compiler auto-vectorization. Ghidra does a horrible job of changing it to significant pseudocode, so I am going to present my very own model (observe, this isn’t a 1:1 mapping, but it surely ought to convey the overall thought)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
i = 0 words_left_to_xor = size_b whereas words_left_to_xor > 3: # xor 4 phrases (16 bytes) concurrently utilizing q0, q1 registers z[i:i+4] = a[i:i+4] ^ b[i:i+4] i += 4 words_left_to_xor -= 4 # cope with remaining 32-bit phrases individually if words_left_to_xor > 0: z[i] = a[i] ^ b[i] if words_left_to_xor > 1: z[i+1] = a[i+1] ^ b[i+1] if words_left_to_xor > 2: z[i+2] = a[i+2] ^ b[i+2] |
The principle loop operates utilizing the q0
and q1
registers, which in line with ARM docs are 128-bit vast NEON registers. So far as I can inform, NEON would not stand for something specifically, but it surely’s what ARM calls its SIMD options (by the best way, their “next-gen” SIMD instruction set known as SVE).
After the primary loop, it xors the remaining 32-bit phrases, one by one.
The important thing statement right here is that our “pseudo-SIMD” implementation is utilizing actual SIMD directions underneath the hood! At the least, it’s on my system; this will likely rely in your platform, compiler, configuration, and so forth.
If we restrict ourselves to bitwise operators, and our numbers are sufficiently big that the interpreter overhead is small (in relative phrases), we are able to get actual SIMD efficiency speedups in pure Python code. Effectively, kinda. For those who have been implementing a selected algorithm in meeting, you possibly can generate way more tightly optimised SIMD routines, conserving knowledge in registers the place attainable, avoiding pointless hundreds and shops from reminiscence, and so forth.
One other massive caveat with this strategy is that it includes creating an entire new integer to comprise the outcome. This wastes reminiscence area, places strain on the reminiscence allocator/gc, and maybe most significantly, it wastes reminiscence bandwidth. Though you may write a ^= b
in Python to indicate an in-place XOR operation, it nonetheless finally ends up internally allocating a brand new object to retailer the outcome.
For those who’re questioning why the NumPy implementation was nonetheless barely quicker, I consider the reply lies in the best way CPython represents its integers. Every entry within the ob_digit
array solely represents 30 bits of the general quantity. I am guessing this makes dealing with carry propagation less complicated throughout arithmetic operations. This implies the in-memory illustration has a ~7% overhead in comparison with optimum packing. Whereas I have never checked NumPy’s implementation particulars, I think about they pack array parts tightly.
Doing Helpful Work
Now that we all know we are able to do efficient-ish bitwise SIMD operations, can we construct one thing helpful from that?
One use case is bitsliced cryptography. Here’s my implementation of bitsliced AES-128-ECB in pure Python. It is over 20x quicker than the subsequent quickest pure-python AES implementation I might discover, and in concept it is safer too, attributable to not having any data-dependent array indexing (however I nonetheless would not belief it as a safe implementation; use a proper cryptography library!)
For a extra detailed introduction to bitslicing, try this article. The thought is to specific your entire algorithm as a circuit of logic gates, or in different phrases, a bunch of boolean expressions. You are able to do this for any computation, however AES is especially amenable to it. After you have a boolean expression, you should utilize bit-parallel operations (i.e., bitwise SIMD operations) to compute a number of situations of your algorithm in parallel. Since AES is a block-based cipher, you should utilize this concept to compute a number of AES cipher blocks concurrently.
SWAR
We are able to use Python integers for extra than simply parallel bitwise operations. We are able to use them for parallel additions, too!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
a = 0x4567 # this represents 4 4-bit unsigned integers, [4, 5, 6, 7] b = 0x6789 # as does this: [6, 7, 8, 9] print(hex(a + b)) # 0xacf0 => [0xa, 0xc, 0xf, 0x0] == [10, 12, 15, 0] # oh no, that is the improper reply... 6+8 must be 14, not 15. # it is improper as a result of the results of 9+7 was 16 (0x10), inflicting carry propagation # into the adjoining "lane". # resolution: padding and masking: a = 0x04050607 b = 0x06070809 m = 0x0f0f0f0f print(hex((a + b) & m)) # 0xa0c0e00 => [0xa, 0xc, 0xe, 0x0] == [10, 12, 14, 0] |
As proven right here, we are able to pack a number of fixed-width integers right into a single Python integer, and add all of them collectively directly. Nevertheless, in the event that they’re tightly packed and an integer overflow happens, this causes undesirable carry propagation between lanes. The answer is easy: area them out a bit. On this instance I take advantage of beneficiant 4-bit-wide padding to make issues extra apparent, however in precept you solely want a single little bit of padding. Lastly, we use the bitwise AND operator to masks off any overflow bits. If we did not do that, the overflowing bits might accumulate over the course of a number of additions and begin inflicting overflows between lanes once more. The extra padding bits you employ between lanes, the extra chained additions you may survive earlier than masking is required.
You are able to do comparable issues for subtraction, multiplication by a small fixed, and bit shifts/rotations, as long as you could have sufficient padding bits to forestall overflows in every situation.
The overall time period for this idea is SWAR, which stands for SIMD Within A Register. However right here, moderately than utilizing a machine register, we’re utilizing an arbitrarily lengthy Python integer. I am calling this variant SWAB: SIMD Inside A Bigint.
SWAB is a helpful thought in Python as a result of it maximises the quantity of labor finished per VM instruction, decreasing the interpreter overhead; the CPU will get to spend nearly all of its time within the quick native code that implements the integer operations.
Doing Helpful Work One thing Enjoyable
That is sufficient concept; now it is time to make Sport of Life go quick. First, let me describe the “downside” I am attempting to unravel. I am going to assume you are already broadly conversant in Sport of Life, but when not, go learn the Wikipedia article.
There are some very intelligent algorithms for making GoL go quick, essentially the most well-known being Hashlife, which works by recognizing repeating patterns in each area and time. Nevertheless, my favorite GoL sample to simulate is “soup”, i.e., a big random beginning grid. These chaotic patterns aren’t nicely suited to the Hashlife algorithm, so we have to return to the fundamentals.
While you simulate soup within the basic GoL ruleset, it sometimes dies out after a couple of thousand generations, producing a moderately boring association of oscillating “ash” (which is again to one thing Hashlife can simulate shortly). I desire my simulations to dwell on in everlasting chaos, and there is a variant of the basic ruleset that kind of ensures this, known as DryLife. I like DryLife as a result of it nonetheless reveals many of the acquainted GoL behaviours (for instance, gliders) and but the soup lives on indefinitely, creating a delightful screensaver-like animation.
The “apparent” implementation of the inside loop of the GoL algorithm seems one thing like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
for y in vary(1, peak + 1): for x in vary(width): neighbor_count = sum( get_cell(state, (x + dx) % width, y + dy) for dx, dy in [ (-1, -1), (0, -1), (1, -1), (-1, 0), (1, 0), (-1, 1), (0, 1), (1, 1) ] ) this_cell = get_cell(state, x, y) next_value = neighbor_count == 3 or (this_cell and neighbor_count == 2) if cfg.drylife: # one other alternative for lifeless cells to come back alive next_value |= (not this_cell) and neighbor_count == 7 set_cell(next_state, x, y, next_value) |
It iterates over each cell within the grid, counts up its fast 8 neighbours, and applies the principles to determine if the cell must be alive or not within the subsequent iteration. In big-O phrases, that is an algorithm, the place is the variety of cells within the grid (i.e., width peak).
Though it is not made express within the snippet above, the state of the cells is being saved in a giant array, accessed by way of the get_cell
and set_cell
helper capabilities. What if as a substitute of utilizing an array, we saved the entire state in a single very lengthy integer, and used SWAB arithmetic to course of the entire thing directly? The trickiest a part of this course of can be counting up the neighbours, which might sum to as much as 8 (or 9 if we additionally depend the preliminary cell worth). That is a 4-bit worth, and we could be positive it’ll by no means overflow right into a 5-bit worth, so we are able to retailer every cell as a 4-bit vast “SWAB lane”. With out additional ado, here is the equal inner-loop code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# depend neighbors summed = state summed += (summed >> 4) + (summed << 4) summed += (summed >> COLSHIFT) + (summed << COLSHIFT) # test if there are precisely 3 neighbors has_3_neighbors = summed ^ MASK_NOT_3 # at this level, a price of all 1s means it was initially 3 has_3_neighbors &= has_3_neighbors >> 2 # fold in half has_3_neighbors &= has_3_neighbors >> 1 # fold in half once more # test if there are precisely 4 neighbors has_4_neighbors = summed ^ MASK_NOT_4 # at this level, a price of all 1s means it was initially 4 has_4_neighbors &= has_4_neighbors >> 2 # fold in half has_4_neighbors &= has_4_neighbors >> 1 # fold in half once more if cfg.drylife: # test if there are precisely 7 neighbors has_7_neighbors = summed ^ MASK_NOT_7 # at this level, a price of all 1s means it was initially 7 has_7_neighbors &= has_7_neighbors >> 2 # fold in half has_7_neighbors &= has_7_neighbors >> 1 # fold in half once more # variable title right here is deceptive... has_3_neighbors |= (~state) & has_7_neighbors # apply game-of-life guidelines state = (has_3_neighbors | (state & has_4_neighbors)) & MASK_CANVAS |
(I’ve omitted some particulars like wraparound dealing with; you may see the complete code, together with definitions of these magic constants, here)
Apart from the totally different state illustration, this achieves precisely the identical factor because the earlier snippet. The loop over each cell has been fully eradicated! Effectively, nearly. There are CPython VM operations, however there may be nonetheless work being finished, hidden away contained in the SIMD-accelerated native code of CPython’s bigint arithmetic routines. This can be a big efficiency win, which I am going to quantify later. However first, how will we flip that state
integer into pixels on the display screen?
Blitting
To get pixels on the display screen, we have to convert the information right into a extra commonplace format. The very first step is straightforward: Python integers have a to_bytes
technique that serialises them into bytes (like I used within the XOR perform instance earlier on this article). What to do with these bytes subsequent is much less apparent.
Within the spirit of solely utilizing “pure” Python, I got here up with a ridiculous hack: craft a compressed gzip stream that, when decompressed, converts the bizarre 4-bits-per-pixel buffer right into a extra commonplace 8-bits-per-pixel grayscale buffer, and surrounds it within the vital framing knowledge to be a YUV4MPEG video stream. The output of the Python script could be piped right into a gzip decompressor, and subsequently, a video participant. That code is here, and possibly deserves an article of its personal, however I am not going to enter it at present.
Whereas this was an amazing hack, gzip just isn’t an particularly environment friendly blitter. I used to be capable of get about 24fps at full-screen decision on my 2021 macbook, however I actually wished no less than 60fps, and that strategy wasn’t going to chop it.
The ultimate strategy would in all probability be to ship the 4bpp knowledge off to the GPU as a texture, as-is, and write a fragment shader able to unpacking it onto the display screen pixels. The one purpose I did not do it is because it feels foolish. It feels foolish as a result of if we’re doing GPU programming, we’d as nicely simply implement the entire GoL algorithm on the GPU. It would be approach quicker, but it surely would not be inside the spirit of the fully arbitrary constraints I might set for myself.
My compromise right here was to make use of SDL2, by way of the pysdl2
bindings. Positive, it isn’t “pure Python,” but it surely is very commonplace and extensively obtainable. It feels “proper” as a result of I can use it to its full extent with out fully defeating the aim (like working GoL fully on the GPU would do).
SDL2 helps a pixel format known as SDL_PIXELFORMAT_INDEX4LSB
, which is a 4-bit palleted mode. If we arrange an acceptable palette (specifying the colors for “lifeless” and “alive” cells, respectively), then we are able to move our 4bpp buffer to SDL as-is, and it will know learn how to convert it into the proper format for sending to the GPU (on this case, SDL_PIXELFORMAT_ARGB8888
). This course of is not terribly environment friendly, as a result of the conversion nonetheless occurs on the CPU, and the quantity of knowledge despatched over to the GPU is far bigger than vital. Regardless of that, it is an entire lot quicker than the gzip technique, getting round 48fps.
Parallelization
We nonetheless have not hit the 60fps goal, and to get there I added some parallelization. I summarised my strategy in a code remark:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
""" ┌───────────┐ Graphics Course of │ ┌────┐ │ ┌───────────────────────────────┐ │ │ ┌─▼────┴─┐ │ ┌─────────┐ ┌────────────┐ │ │ │ │ Life ├─────► Blitter ├──► │ │ │ │ └─┬────▲─┘ │ └─────────┘ │ │ │ ▼ │ ┌─▼────┴─┐ │ ┌─────────┐ │ │ │ │ ▲ │ Life ├─────► Blitter ├──► │ │ │ │ └─┬────▲─┘ │ └─────────┘ │ GUI │ │ │ │ ┌─▼────┴─┐ │ ┌─────────┐ │ Renderer │ │ │ │ │ Life ├─────► Blitter ├──► │ │ │ │ └─┬────▲─┘ │ └─────────┘ │ │ │ ▼ │ ┌─▼────┴─┐ │ ┌─────────┐ │ │ │ │ ▲ │ Life ├─────► Blitter ├──► │ │ │ │ └─┬────▲─┘ │ └─────────┘ └────────────┘ │ │ └────┘ │ └───────────────────────────────┘ └───────────┘ "Life" threads implement the SWAR life algorithm, for a horizontal strip of the general canvas. Blitter threads use SDL2 capabilities to unpack the SWAR buffers into RGBA8888 surfaces, that are handed to the primary Renderer thread. The renderer thread is chargeable for importing the surfaces to a GPU texture, and making it present up on the display screen. It is also chargeable for coping with SDL occasions. Every Life thread lives in its personal course of, to keep away from the GIL. They discuss to one another (for overlap/wraparound), and to the Blitter threads (to report their outcomes), utilizing Pipes. Every part else occurs in the primary course of, so the Blitters can discuss to the most important thread utilizing commonplace Queues - the laborious work right here is finished inside SDL2, which doesn't maintain the GIL (that means we are able to use multithreading, as opposed to multiprocessing). """ |
That is sufficient to smash previous the 60fps goal, reaching ~250fps at my full-screen decision of 3024×1890, utilizing 8 parallel Life processes. At 4K decision (3840×2160), it could attain 180fps.
There are many remaining inefficiencies right here that could possibly be improved on, however 250fps is already approach quicker than I care about, so I am not going to optimise any additional. I believe 60fps is essentially the most visually fascinating velocity to look at at, anyway (which could be achieved by turning on vsync).
Edit: After having some individuals check on non-Apple-silicon programs, many are struggling to hit 4K60fps. I have never finished any profiling, however my guess is that the bottleneck is the CPU->GPU bandwidth, or perhaps simply reminiscence bandwidth normally. I would revisit this sooner or later, maybe implementing the buffer-unpacking fragment shader thought I discussed.
Earlier, I stated that upgrading from the naive implementation to SWAB gave “big” efficiency wins, so simply how big are they? If I am going again to the naive strategy, I get an unimaginable 0.4fps at 4K decision (and that is with 8x parallelism), representing a ~450x efficiency distinction. If I get additional imply and pressure it to run on a single thread, it is a 3800x distinction relative to the totally optimised model. Rattling!
As a reminder, each approaches are , however the quicker model will get to spend extra time inside environment friendly native code.
If I rewrote this entire factor in C utilizing SIMD intrinsics (mixed with SWAR or different tips), I predict that I might get someplace between 10x and 100x additional speedup, attributable to extra environment friendly use of SIMD registers and reminiscence accesses. A GPU implementation could possibly be quicker nonetheless. These speedups could be good to have, however I believe it is fascinating how far I used to be capable of get simply by optimising the Python model.
The supply code is offered here.