Explaining my quick 6502 code generator
To find out how optimizing compilers are made, I built one concentrating on the 6502 architecture.
In a weird twist, my compiler generates sooner code than GCC, LLVM, and each different compiler I in contrast it to.
I reckon my compiler is not doing extra in relation to high-level optimizations,
so the positive aspects have to be from the code generation facet.
This is sensible, as most compilers are multi-target, with backends
designed for contemporary RISC-like techniques, not the traditional 6502.
It would not matter how good GCC or LLVM’s high-level optimizations are
in the event that they falter on the final leg of the race.
Nonetheless, my compiler additionally beats these designed for retro and embedded techniques, like VBCC, SDCC, and KickC.
For that reason, it appeared like a good suggestion to put in writing about my method.
Dishonest Disclaimer
Earlier than I get into it, I wish to cowl three areas my compiler has a bonus, which are not tied to algorithmic design.
I name these my “cheats”, as different compilers may do this stuff, however there are trade-offs concerned.
First, my compiler generates what are know as
“illegal” instructions.
An unlawful instruction is one which is not formally documented by the producer,
however nonetheless exists within the {hardware}.
On most 6502 chips, a number of unlawful directions exist which mix the habits of two “authorized” directions.
Their use can save a number of cycles, however not each backend generates them.
Second, a few of my compiler’s positive aspects will be defined by its computational effort.
My compiler spends the vast majority of its CPU price range (a number of milliseconds) doing instruction choice,
however not all compilers do.
Sometimes, the extra time spent trying to find an answer, the higher the outcomes.
Lastly, there may be loop unrolling and different optimizations that commerce house effectivity for velocity,
which is difficult to match pretty when benchmarking completely different compilers.
I attempted to select exams which weren’t affected by this,
however clearly I am unable to be good.
With my conscience clear, onto the algorithm!
A Fast, Imprecise Overview
I have been calling my very own try “outsider art“,
as I did not know a lot about code era when writing it.
In reality, the method did not even originate in my compiler;
I primarily based it on an previous animation system I wrote years prior.
My algorithm is fascinating as a result of it combines instruction choice
with register allocation, but additionally as a result of it is written utilizing continuations.
In all my years of programming, it is the one sensible use I’ve discovered for such black magic.
To briefly clarify the way it works, I am going to say that every primary block in my
IR is represented as a
DAG in
SSA form.
Step one of code era is to interrupt each of those properties,
changing primary blocks to totally-ordered lists that don’t include phi nodes.
The ordered primary blocks are then processed from starting to finish,
producing a number of meeting code combos per operation.
The checklist of combos will develop at an exponential price,
however most will be pruned utilizing concepts from dynamic programming
and branch-and-bound.
To enhance the generated code, output states of every primary block are fed into their successor blocks as inputs.
This course of repeats till a hard and fast level is reached.
A cooling schedule a-la simulated annealing
is utilized to search out outcomes sooner.
After this, every primary block can have a number of code sequences to select from.
The costliest ones will be heuristically pruned,
then a wide selection will be made by fixing a partitioned boolean quadratic problem (PBQP).
Lastly, extra directions are inserted as transitions between the essential blocks,
and some optimization passes are run on the ensuing meeting code for good measure.
A Extra Detailed Clarification
Primary Block IR
As said, the
IR
of every primary block is a
DAG in
in SSA form,
which sounds sophisticated till you see an image of it.
Given the code under:
foo = fn() ^ 5 return foo - (foo & 3)
The IR DAG would appear to be this:
Every node within the graph represents a worth, with edges denoting the movement of knowledge.
To compute a worth, the nodes pointing to it have to be computed first —
e.g. to compute the node labeled (&)
, the nodes (^)
and (3)
have to be computed first.
It is a dependency graph.
Ordering
Step one of code era is to take the IR and create a
total order
out of its nodes, positioning every node to happen after its dependencies.
That is akin to placing every node on a quantity line such that each edge factors in the identical course (downwards within the diagram).
The generated meeting code will observe this order.
How does one discover this order?
It is easy to do utilizing a topological sort,
however there’s an issue: DAGs can have a number of legitimate orderings, and a few will generate higher code than others.
To discover a good ordering (one which generates environment friendly code),
the topological sorting algorithm remains to be used, however it may be modified utilizing a number of heuristics.
First, “faux” edges will be added to the graph to constrain nodes to happen after different nodes.
Second, a grasping algorithm can be utilized to affect the traversal order of the topological kind.
In my compiler, I prioritize nodes which kind the
longest path via the graph,
which produces an ordering that has small
live ranges,
leading to much less register stress and shops.
This is an precise DAG from my compiler.
Primary Block Instruction Choice – Half 1: Animation
With the IR ready, now it is time to generate some meeting code.
I discussed earlier that my algorithm was primarily based on an animation system,
and though that sounds bizarre and unrelated, it will likely be simpler for me
to elucidate the animation system first after which prolong it to work like my compiler.
The animation I am speaking about was for a NES recreation.
On that system, animation will be accomplished by loading a brand new tileset into video reminiscence every body,
overwriting the earlier body’s tileset.
Most video games accomplish this utilizing particular {hardware} on the cartridge referred to as a mapper,
which lets the system web page a tileset into video reminiscence shortly.
However in my case, I used to be making a recreation that wasn’t going to make use of that piece of {hardware}.
As an alternative, I needed to copy the bytes utilizing the CPU as a middle-man.
The plain approach to copy bytes is with a loop (suppose “memcpy” from C),
however this turned out to be too gradual.
For every iteration, an excessive amount of time was being spent incrementing, then evaluating,
then branching again to the loop.
The one approach I may see it working was if I unrolled the loop — not as soon as — however totally.
For instance, if I wished to repeat the byte sequence (10, 20, 30) to video reminiscence, I may write meeting code like this:
lda #20 ; Load fixed: 20 sta PPUDATA ; Add to video RAM lda #30 ; Load fixed: 30 sta PPUDATA ; Add to video RAM lda #40 ; Load fixed: 40 sta PPUDATA ; Add to video RAM
In different phrases, I used to be writing code like a noob who stop programming class earlier than studying what a for-loop was.
But it surely was quick, and that is what mattered.
Quite than writing these sequences by hand, it was clear I ought to automate it.
A naive method could be to put in writing a script which generates one load and one retailer per uploaded byte,
but it surely’s potential to do higher.
As registers protect their values, a sequence like (1, 8, 1, 8, 1, 8) requires solely two masses:
ldx #1 ; Load register X lda #8 ; Load register A stx PPUDATA ; Add register X to video RAM sta PPUDATA ; Add register A to video RAM stx PPUDATA sta PPUDATA stx PPUDATA sta PPUDATA
How does one decide these masses optimally?
To maintain issues easy, I am going to first clarify how one can clear up this for a CPU with one register,
then broaden to a CPU with three.
For a CPU with one register, the algorithm boils right down to discovering redundant masses and eliding them.
By redundant load, I imply a load which has no impact, just like the second load under:
lda #10 sta PPUDATA lda #10 // redundant sta PPUDATA
Discovering and eliding redundant masses is straightforward to do; simply preserve observe of the beforehand loaded worth.
The pseudocode under does simply that, optimally:
prev = -1 for every byte to add: if prev != byte: emit_load(byte) prev = byte emit_store()
This concept can virtually be prolonged to make use of three registers
through the use of three “prev” variables as an alternative of 1.
You would possibly method it utilizing a struct or document kind to maintain issues clear:
struct cpu_state { int a int x int y }
Sadly, there’s an issue.
Though we are able to preserve observe of every register this and elide shops,
it is not clear how we should always deal with masses.
Given a worth not present in any register, which register ought to load it?
We will make these selections optimally by producing meeting code for each potential mixture of masses (brute drive)
and solely retaining the perfect one, however this shortly turns into infeasible.
For every byte that wants loading, there are three selections potential.
The variety of combos will develop at an exponential price.
Diagram of the brute-force search tree.
Fortunately, this brute-force method remains to be usable if a trick is used.
It is potential to prune most combos, shrinking the scale of the search house.
If two items of meeting code have been generated, and each have the identical observable facet impact,
the upper costing one will be pruned.
By unwanted side effects, I imply that each add the identical byte sequence
and end with similar register states.
For instance, these two items of code are equal:
lda #1 sta PPUDATA ldx #2 stx PPUDATA ldy #3 sty PPUDATA sta PPUDATA
ldy #1 sty PPUDATA ldx #2 stx PPUDATA ldy #3 sty PPUDATA lda #1 sta PPUDATA
Each add the sequence (1, 2, 3, 1), and each go away the registers at (A = 1, X = 2, Y = 3).
The second requires an additional load although, so it may be pruned.
Remember that this works for partial snippets of code too.
As an alternative producing complete code sequences and evaluating them,
it is potential to generate the code instruction-by-instruction and evaluate at every step.
To implement this, the algorithm will deal with the combos in a breadth-first vogue.
It should first generate all of the combos for the primary byte and prune some,
then it is going to generate all of the combos for the second byte and prune some, and so forth.
The pruning is finished by storing the leads to a hash desk, listed by every mixture’s “cpu_state”
(the struct launched earlier).
For higher efficiency, an extra methodology of pruning might be used.
If we preserve observe of the present finest mixture at every iteration step,
we are able to prune different combos which are slower by three or extra masses.
This preserves optimality of the ultimate outcome, as three masses is sufficient to convert from any “cpu_state” to a different.
Placing all this collectively, the algorithm to elide masses in a 3-register CPU turns into:
prev_hash_map.clear() for every byte to add: next_lowest_cost = INFINITY next_hash_map.clear() for every mixture in prev_hash_map: if mixture.value >= prev_lowest_cost + 3: // Second methodology of pruning proceed for every register (A, X, Y): new_combination = mixture if new_combination.cpu_state.register != byte: new_combination.code.append_load() new_combination.value += 1 new_combination.cpu_state.register = byte if mixture.value < next_lowest_cost: next_lowest_cost = mixture.value next_hash_map.insert(new_combination) if next_hash_map already had this cpu_state: // First methodology of pruning Maintain the lowest-costing one swap(next_hash_map, prev_hash_map) prev_lowest_cost = next_lowest_cost
(If you wish to see precise code, I’ve put a C++ implementation
here
and here.)
I transformed the Cowboy Bebop intro to a NES ROM utilizing this animation course of.
The GIF runs at a decrease framerate.
Primary Block Instruction Choice – Half 2: The Compiler
Alright, alright. . . we are able to generate code for animations — So what?
That is an article about compilers, proper?
Properly, think about this: As an alternative of processing a sequence of bytes with this algorithm,
let’s course of the sequence of IR operations created earlier.
This is the ordered sequence of IR operations, from earlier.
I’ve added variable names (Q, R, S, T) as placeholders for every node:
Q = name fn R = Q ^ 5 S = R & 3 T = R - S return T
Like we did for the sequence of bytes, we are able to generate combos per operation and prune the worst ones,
however there are two huge variations:
- Completely different IR operations generate completely different meeting directions (not simply masses).
- The “cpu_state” struct can even observe variable names.
Placing this into follow, I am going to present how the algorithm handles the ordered IR.
The algorithm begins by producing combos for the primary operation: “name fn”,
which there’s just one:
; name fn: jsr fn ; Name the perform sta Q ; Retailer the outcome into variable Q ; cpu_state is now (A = Q, X = ?, Y = ?)
After that, it is going to generate the combos for the XOR operation (^).
There is a single meeting instruction which does this: “EOR”,
however since XOR is commutative, two code combos exist:
; name fn: jsr fn ; Name the perform sta Q ; Retailer the outcome into variable Q ; cpu_state is now (A = Q, X = ?, Y = ?) ; xor: lda #5 ; Load 5 into register A eor Q ; XOR register A with variable Q sta R ; Retailer the outcome into variable R ; cpu_state is (A = R, X = ?, Y = ?)
; name fn: jsr fn ; Name the perform sta Q ; Retailer the outcome into variable Q ; cpu_state is now (A = Q, X = ?, Y = ?) ; xor: eor #5 ; XOR register A with variable 5 sta R ; Retailer the outcome into variable R ; cpu_state is (A = R, X = ?, Y = ?)
The second mixture doesn’t want a load instruction for the Q variable, because it’s already loaded within the register.
Since each of those code sequences have the identical facet impact, the primary mixture might be pruned.
Now, for the AND operation (&).
Whereas there may be an AND instruction, there’s additionally an unlawful instruction referred to as SAX,
which carry out a bitwise AND of register A with register X and shops the outcome.
This leads to 4 combos, so I am going to solely present the 2 that will not get pruned:
; name fn: jsr fn ; Name the perform sta Q ; Retailer the outcome into variable Q ; cpu_state is now (A = Q, X = ?, Y = ?) ; xor: eor #5 ; XOR register A with variable 5 sta R ; Retailer the outcome into variable R ; cpu_state is (A = R, X = ?, Y = ?) ; and: and #3 ; AND register A with 5 sta S ; Retailer the outcome into variable S ; cpu_state is (A = S, X = ?, Y = ?)
; name fn: jsr fn ; Name the perform sta Q ; Retailer the outcome into variable Q ; cpu_state is now (A = Q, X = ?, Y = ?) ; xor: eor #5 ; XOR register A with variable 5 sta R ; Retailer the outcome into variable R ; cpu_state is (A = R, X = ?, Y = ?) ; and: ldx #3 ; Load 3 into register X sax S ; Retailer the AND of registers A and X into variable S ; cpu_state is (A = R, X = 3, Y = ?)
For the subtraction operation (-), the SBC instruction might be used following a SEC instruction to set the carry flag.
As subtraction is not commutative, there’s just one approach to do that, however because the earlier operation generated two combos,
this operation will end in two combos too.
; name fn: jsr fn ; Name the perform sta Q ; Retailer the outcome into variable Q ; cpu_state is now (A = Q, X = ?, Y = ?) ; xor: eor #5 ; XOR register A with variable 5 sta R ; Retailer the outcome into variable R ; cpu_state is (A = R, X = ?, Y = ?) ; and: and #3 ; AND register A with 5 sta S ; Retailer the outcome into variable S ; cpu_state is (A = S, X = ?, Y = ?) ; subtract: lda R ; Load R into register A sec sbc S ; Subtract S from register A sta T ; Retailer the outcome into variable T ; cpu_state is (A = T, X = ?, Y = ?)
; name fn: jsr fn ; Name the perform sta Q ; Retailer the outcome into variable Q ; cpu_state is now (A = Q, X = ?, Y = ?) ; xor: eor #5 ; XOR register A with variable 5 sta R ; Retailer the outcome into variable R ; cpu_state is (A = R, X = ?, Y = ?) ; and: ldx #3 ; Load 3 into register X sax S ; Retailer the AND of registers A and X into variable S ; cpu_state is (A = R, X = 3, Y = ?) ; subtract: sec sbc S ; Subtract S from register A sta T ; Retailer the outcome into variable T ; cpu_state is (A = T, X = 3, Y = ?)
Lastly, the return operation, which is carried out with a single RTS instruction.
This one is so easy, I am not going as an instance it with a code instance.
Merely think about a “RTS” appended onto the tip of the earlier two combos.
Ending up
As soon as the combos have been run, the algorithm selects the lowest-costing one.
On this case, it is the one which makes use of a SAX instruction as an alternative of AND,
because it requires one much less instruction.
The ensuing choice appears like this, however we’re not accomplished but:
jsr fn sta Q eor #5 sta R ldx #3 sax S sec sbc S sta T rts
The problem is, most of those retailer directions are unecessary, as nothing will ever load them.
To repair this, an additional cross is run which identifies shops which are by no means loaded.
This leads to the ultimate code:
jsr fn eor #5 ldx #3 sax S sec sbc S rts
Within the precise compiler, the code generator is a bit smarter about shops,
and might often estimate in the event that they’re wanted or not whereas producing the combos.
That is vital, because the presence of a retailer influences a mix’s value.
Primary Block Instruction Choice – Half 3: Continuations
As proven, compiling an IR operation boils right down to producing a bunch of combos.
Nonetheless, in the event you have been write code to translate IR operations into meeting combos,
you’d discover it to be a repetive, menial coding activity.
The problem pertains to the variety of combos.
For every IR operation, there might be a number of legitimate directions that may implement it,
and every of these directions can have its personal variants, leading to extra combos.
Multiply these collectively (and double it for communicative operations),
and you will find yourself with plenty of combos per IR operation.
Our combos have combos!
It will be good to summary these particulars away and create constructing blocks for
implementing operations. Fortunately, I discovered an answer through the use of continuations.
Every step (or “constructing block”) might be carried out as a perform written in
continuation-passing style.
As an alternative of returning, these features will name their continuations with every instruction they wish to generate a mix for.
For instance, to generate three combos, the continuation might be referred to as 3 times with three completely different directions.
This is how a perform for loading a worth in register A would possibly look, in pseudocode:
load_A(desired_value, cpu_state, continuation) if cpu_state.A == desired_value continuation(cpu_state, NOP) else if cpu_state.X == desired_value continuation(cpu_state, TXA) else if cpu_state.Y == desired_value continuation(cpu_state, TYA) else continuation(cpu_state, LDA) continuation(cpu_state, LAX)
If the worth is already current in one of many registers: both nothing occurs (NOP), or a switch between registers happens (TXA, TYA).
In any other case, two combos will happen: one utilizing the LDA instruction,
and one other utilizing the LAX instruction.
The purpose of writing features like that is that they are often composed.
With a little bit of plumbing, it is potential to finish up with a
domain-specific language that resembles:
compose(load_A(L), load_X(R), instruction(SAX))
A single line like this could cowl 50 completely different meeting sequences, all dealing with completely different combos of directions and addressing modes.
It is a lot simpler to put in writing than attempting to enumerate all the chances by hand.
Management Circulation – Half 1: Fundamentals
Though the algorithm I’ve described is tied to primary blocks, it may well deal with management movement (branches, loops, and so forth) too.
Department operations are usually not particular — they generate meeting combos like every other.
For instance, check out this management movement graph (the rectangles are primary blocks):
Every primary block on this graph might be assigned a novel label identify (L1, L2, L3, and so forth).
Now we’ll generate the code for every primary block individually, implementing management movement as directions involving these labels.
As soon as each primary block has its code generated, we are able to concatenate the outcomes:
L1: ; (Extra code would go right here) beq L1 ; Department instruction bne L2 L2: ; (Extra code would go right here) jmp L4 ; Bounce instruction L3: ; (Extra code would go right here) beq L1 bne L4 L4: ; (Extra code would go right here) rts
With this carried out, the compiler now works.
It could possibly generate executables!
Management Circulation – Half 2: Optimizations
The method above works, however doesn’t produce good code throughout primary block boundaries.
The problem stems from compiling primary blocks individually with out sharing data of the register states between them.
If one primary block wants a worth from a earlier block, it is going to emit a load instruction
even when it is redundant.
For instance, that is how a loop would look:
L0: ldx #0 ; Load register X with 0 stx I ; Retailer variable I L1: ldx I ; Load variable I inx ; Increment register X stx I ; Retailer variable I bne L1 ; Department if register X isn't zero
However that is how a loop ought to look:
ldx #0 ; Load register X with 0 L1: inx ; Increment register X bne L1 ; Department if register X isn't zero
The second loop lacks redundant masses.
To resolve this, the ultimate “cpu_state” from primary blocks must be shared.
This may be accomplished by passing every primary block’s outcome into its successors as enter.
Outdated methodology: Generate code for every primary block individually.
New methodology: Use the output states from producing a primary block because the enter states to the following.
That is an iterative course of the place every node will be processed a number of instances.
When a node is processed, it is going to set off its successors to be processed subsequent passing outputs as inputs. This can repeat till a hard and fast level is reached.
Going again to the loop instance, preliminary iterations will produce inefficient code,
however later iterations will propagate the register states via the loop from (L3) to (L1).
As soon as variable “I” seems pre-loaded in register X, optimum code will observe.
To expedite this course of, it is useful to solely cross the lowest-costing outcomes alongside
and get stricter every iteration.
This resembles simulated annealing.
Management Circulation – Half 3: Some Hundreds Required
Though the strategy above generates improved code,
it is not potential to get a working program by concatenating the lowest-costing combos.
The problem is, every code mixture now expects some particular register values as inputs,
however the output of 1 mixture could not correspond to the enter of one other.
You may get nonsensical outcomes like initializing a loop variable utilizing the Y register,
however incrementing utilizing the X register:
; This code is incorrectly compiled! L0: ldy #0 ; Output Y because the iteration depend ; NOTE: A sty instruction was eliminated by the "take away unused shops" cross. L1: ; Enter X because the iteration depend inx bne L1
To repair this, the enter states of every primary block should match the output states of their predecessors.
If there’s a discrepency, a load will be inserted to right it.
By inserting masses, the code turns into:
L0: ldy #0 sty I ; This retailer is not eliminated. L1_entry: ldx I ; Inserted load L1: inx bne L1
Which works, however is not excellent.
Management Circulation – Half 4: PBQP
The woes above are brought on by choosing solely the lowest-costing combos per primary block,
with out taking into account the compatability between enter/output states.
Ideally, we would consider the price of inserted masses.
Doing so is an optimization drawback, and one which intently maps to the
Partitioned Boolean Quadratic Problem (PBQP).
If we are able to clear up a PBQP drawback, we are able to choose optimum combos for our primary blocks.
Sadly, not a lot has been written on PBQP, and what does exist is difficult to learn for many:
My recommendation is that as an alternative of struggling over equations, it is simpler to be taught PBQP by considering visually, by way of graphs.
Think about a graph the place every node has a finite checklist of selections,
and every of these selections has a value.
The purpose PBQP is to select one alternative for every node, minimizing the full value.
The catch is, an extra value is incurred for each edge within the graph
primarily based on the alternatives made on the edge’s vertices.
The image under illustrates such a graph.
Every node has a alternative of two colours, and people colours have a value subsequent to them.
When choosing two colours, the price of the sting have to be paid as properly.
That is accomplished by wanting up the price in a desk distinctive to every edge, utilizing the 2 chosen colours as keys.
Under, I’ve made some selections, and have crossed out every part I did not select.
The full value is the sum of the seen numbers, which is 3 + 1 + 7 + 3 + 6 + 7 = 27.
The purpose of PBQP is to search out selections which minimizes this sum.
A fast algorithm
exists to resolve PBQP issues near-optimally.
The algorithm works on the graph, decreasing and mixing nodes till none stay.
On the finish, the algorithm works backwards, utilizing the steps it took to resolve which selections to make.
Going again to the compiler, the issue I used to be describing suits completely into this visualization:
- The graph is the control-flow graph.
- The nodes are the essential blocks.
- The node selections are the code combos generated prior.
- The sting prices are the variety of additional masses that need to be inserted given two code combos.
By implementing a PBQP solver and utilizing it on our combos, the tip outcome will produce near-optimal code.
The meeting loop which gave us bother earlier than will turn into:
ldx #0 L1: inx bne L1
Which is perfect.
Complexity and Efficiency
- Creating the order is O(N2), however has a negligable runtime value.
In follow, typical code may be very simple to order. - I do not know what the complexity of going out of SSA is, however I am assuming its O(N2).
Regardless, the runtime value is negligable. - Code-generating primary blocks is O(N) with respect to the variety of IR operations, with a excessive fixed multiplier.
This has a impactful runtime value, with the bottleneck being the hash desk.
I exploit a really quick hash-table of my very own design, which alleviates this. - The PBQP drawback is solved in O(N) relative to the variety of edges, and O(N3) relative to the variety of selections.
This can be an enormous runtime value, however is not a difficulty as long as the variety of selections is stored small.
I reckon the PBQP solver could possibly be carried out utilizing SIMD if mandatory, however I have never discovered the necessity. - Repeatedly producing primary blocks has some terrible complexity,
however due to the simulated annealing it turns into O(N).
A big value at this step is cache misses brought on by studying meeting combos.
With correct method, many of those cache misses will be prevented.
Conclusion
This code generator has labored properly sufficient for me.
Though components of it are bushy, the continuations make it simple so as to add new operations
and I do not see a lot technical debt within the code. The generated meeting code has been good, albeit not good.
As said, it beat different compilers for the benchmarks I ran, however I am positive others may beat mine as properly.
My solely purpose was to supply a great compiler, not the perfect.
As I discussed at first, I haven’t got deep knowlege of different code era methods.
Many of the papers on code era I’ve seen contain ILP / SAT solvers and take minutes, if not hours to compile an extended perform.
In addition to, every part these days is designed for RISC architectures, which is completely in contrast to the 6502.
You would possibly suppose that papers from the Nineteen Seventies could be extra helpful, however within the 70s computer systems have been sluggish
and the algorithms designed for them have been O(N) and compensating.
You simply do not see recommendation like “hash every part and run it 1,000,000 instances” again then.