Checking statistical properties of protocols utilizing TLA+
In my previous post, I discussed how excited I used to be about Jack Vanlightly and Markus Kuppe’s talk at TLA+ Convention. They confirmed a brand new mode (referred to as “generate”) added to the TLA+ checker to allow acquiring statistical properties from the fashions. Right here is how this works in a nutshell. Your TLA+ mannequin/protocol produces a tree of runs, and, within the generate mode the TLA+ checker chooses these runs in a uniform method. You additionally specify some state constraints, and when these are happy the TLA+ checker executes/evaluates your state constraint method: you add the statistics logging because the aspect impact of this method, and document details about the present run to a CSV file. This fashion, you may accumulate statistical hyper properties (abort charges, completion rounds, and so on.) concerning the runs, along with checking it for correctness invariants. Later you need to use Excel or Rscript to visualise the statistics collected and analyze the mannequin conduct.
Checking statistics is very useful, because ensuring correctness addresses only one part of our concerns with practical systems. Efficiency, availability, restoration pace are additionally necessary dimensions of sensible programs. We might after all write a Python simulator and test these properties there, however then we’d lose the ensures that the system is coded accurately (it’s simple to introduce bugs whereas translating the mannequin). Doing these statistical checks in TLA+ permits us to optimize and fine-tune the system all of the whereas preserving its safety-liveness properties by way of mannequin checking.
That is so cool, and I assumed I ought to study this, and what higher method is there to study one thing than making an attempt it fingers on utilizing a easy instance. I selected Dijkstra’s elegant self-stabilizing token ring problem (EWD 426) as the instance. Checking statistical properties of this went properly and simple. In fact, I botched it and obtained caught couple of occasions: Making errors is how we study. I obtained assist from Markus to get unstuck, so the ensuing mannequin is joint work with him. The model is available here under TLA+ examples.
I now have a good understanding of how this works, and I stay up for making use of it to extra advanced fashions. The fundamental scheme may be very hackable and it’s potential to increase statistics assortment for extra bold efforts. Jack already gave some examples of making use of statistics checking to some SWIM, RabbitMQ, and Kafka fashions.
EWD 426 is a straightforward protocol, however it’s also a profound one. This easy protocol, “Self-stabilizing programs despite distributed management”, began the entire field of stabilizing fault-tolerance.
I wrote about this protocol before and even gave a Pluscal model. Right here is how the protocol works.
The nodes/processes are organized in a hoop topology. The protocol maintains a novel token circulating this ring. You may consider the token as offering mutual exclusion to the processes; whichever course of has the token can entry the critical-section/shared-resource.
Within the good/invariant states there may be precisely one token. So within the defective states, there are two broad circumstances: 1) we find yourself with a hoop with no tokens, 2) we find yourself with a hoop with a number of tokens.
Let’s think about how Dijkstra’s token ring algorithm offers with case 1. Dijkstra’s token ring algorithm maps the idea of a bodily token to a relation between two neighboring processes. This mapping is completed skillfully to restrict the state house so the case 1 faults can’t presumably occur in that state house. Within the mapped state house, any arbitrary state is assured to comprise 1 or extra tokens. How does this work? Dijstra’s algorithm encodes a token with the relation: “the c variable of a course of is DIFFERENT than the c variable of its rapid predecessor within the ring”. So, if all c variables are the identical, there will not be any token within the system. To forestall this, the algorithm encodes the token in another way for course of 0: “There’s a token at course of 0, if its c variable is the SAME as that of its predecessor, course of N”. Encoded this manner, we can’t not have 0 tokens in any arbitrary state within the state house.
With a view to cope with case 2, Dijkstra’s algorithm makes use of a token overwriting motion to eat some tokens on a finest effort trend. The motion is to repeat the c worth of your predecessor, when your c worth is completely different than its c worth (therefore, you’ve the token). Within the regular/fault-free case, by taking an motion a course of consumes a token that exists between its predecessor and itself and passes this token to its successor. Within the defective states case, that is additionally the worst harm the motion can do, as a result of the method will all the time devour the token at itself, and since its state change impacts solely its successor, it might or could not cross the token to its successor. If the method execution doesn’t cross the token to its successor, that is truly good: two tokens are faraway from the system (one between it and its predecessor, and the opposite between it and its successor). If sufficient of those take away token conduct occurs, the system will stabilize to precisely one token case.
However what if the take away token conduct results in a system with no tokens? Recall that this can’t occur as a result of dialogue relating to Case 1.
The Pluscal model is ok, however I needed to present a clear and easier TLA+ model. I used to be impressed by the slick VSCode interface Markus demoed to me earlier within the TLA+ convention to jot down clear stable engineered code. If written cleanly, TLA+ fashions are simple to learn. (I’m channeling the Paxos Made Easy summary.)
Above is the mannequin. I make the nodes begin with arbitrary c values in line 23, since I wish to mannequin test stabilization to a novel token. Line 26 is the CreateToken motion, which solely course of 0 executes. It’s enabled if c[0] = c[N-1], and it units c[0] to C[N-1]+1 in modulo M.
Line 31 is the PassToken motion executed by processes that aren’t course of 0. That is enabled if the predecessor has a special c worth than the node c [i] # c [i-1], and this copies the worth of c[i-1] to c[i].
Within the invariant states, there is just one token within the system. The 2-line UniqueToken predicate captures this property. It says that distinctive token is true iff there exists a node i such that
- for all nodes from 0 to i, c values are equal to the c worth of 0, and
- for all nodes from i to N-1, c values are equal to c[0] -1 in modulo M.
So right here is the .cfg for mannequin checking EWD 426. I’ll change this barely for statistics assortment (there I’ll un-comment the state constraints line).
Since EWD426 is meant to point out case stabilization, I make the system begin in an arbitrary state (see line 23) and test that it stabilizes to invariant states. So I’m not checking UniqueToken because the invariant, clearly it will be violated firstly. As a substitute, I’m mannequin checking stabilization to UniqueToken. That is very simple to jot down. Stab == <>[] UniqueToken
Relation between M and N
If we select M
At M=N-1 (or above), a part transition occurs. The node 0 is now in a position to act as a gate. Think about the case the place node 0 by no means fires. On this case the opposite nodes fireplace, and finally all of them copy the c worth at node 0, and in that configuration the system reduces to at least one token (at node 0) and stabilization is achieved. For M=N-1 and above, the pigeon-hole principle applies to point out that the c worth of Node 0 finally reaches a special worth than within the interior nodes, and node 0 doesn’t fireplace till all different nodes copy its worth.
When you’re mannequin checking, deliver up the command palette by urgent option-X, and choose “TLA+ Verify Mannequin with TLC” and enter “-coverage 1”, in order that the checker will report on its mannequin checking progress again to you each minute.
When you’re checking statistics, you do the identical factor once more. Press option-X, and choose “TLA+ Verify Mannequin with TLC”. However this time you’ll enter a special immediate: “-generate -depth N”. The generate mode is just lately added: on this mode the TLA+ checker chooses enabled actions in a uniformly random method and executes pattern runs out of your mannequin one after the opposite.
N is an integer. N=-1 because of this the TLA+ checker executes every run till impasse, after which strikes to new run. N=50 signifies that the TLA+ checker will execute every run upto 50 steps deep, then transfer to a brand new run.
Now, I uncomment the state constraints line within the cfg file, and we’re able to test statistics for EWD 426.
The core of concept in logging/accumulating statistics is to State/Motion constraints as a hook to document statistics from every run to a CSV file. Recording each motion in a run wouldn’t be helpful. You need to document when one thing vital happens. TLA+ let’s you write state predicates and action properties to seize any fascinating stuff you may consider.
Initially, the TLA+ checker used these constraints to cease exploring a run when the execution isn’t fascinating anymore. Now we hijack the identical mechanism, and use the negative effects of the state/motion constraints to carry out statistics assortment. Right here is the state constraint I added to gather statistics.
This can be a hook that’s run for every state encountered. The left hand aspect of the implication evaluates the UniqueToken predicate on the present international state the TLA+ checker encountered on this specific run. If UniqueToken doesn’t maintain, the left hand aspect turns into FALSE, and makes the implication assertion TRUE, so the state constraint returns TRUE. The appropriate-hand aspect isn’t even evaluated as a result of the left-hand-side is FALSE. Returning TRUE means the run will proceed to the subsequent state. The TLA+ checker chooses one other enabled motion uniformly random and evaluates the StateConstraint on the brand new state discovered.
When UniqueToken holds for a state, the right-hand aspect of the implication is taken. Right here we use TLCGet to get “stage”, which signifies the present depth of steps in to this specific run, when UniqueToken is truthified. CSVWrite provides this line of statistics into the CSVFile we denoted earlier. For this to work, we imported CSV within the EXTENDS line.
When UniqueToken holds, we wish to cease exploring so we add conjunction FALSE to the best hand aspect, making the expression return FALSE (after having finished the CSVWrite aspect impact). FALSE tells the checker to cease exploring this specific run, as a result of we obtained to the place we needed. (I had beforehand model-checked and ensured that the system all the time stabilizes, I’m not all in favour of exploring this run to test UniqueToken retains holding.)
That’s how statistics assortment works. We invoke this generate mode writing “-generate -depth -1” and let it rip by means of sampling the runs and accumulating statistics. Once we cease the TLA checker after a minute, we could have tens of 1000’s of runs, and therefore that many entries in our CSV file. The remaining is as much as us to course of/visualize this information. Use excel should you like. However a extra programmatic method of doing that is to make use of Rscript and write visualizations to a picture file, like svg.
We are able to do that visualization offline, or utilizing a intelligent hack Markus devised on-line, whereas TLA+ checker is operating and exploring the runs. Simply append this to AtStabilization state constraint. TLCGet(“stats”) supplies info like what number of of every motion is executed in a run, what number of runs has been explored, and so on. This line invokes Rscript on the CSV file after each 250 runs explored. Every time it’s invoked, Rscript regenerates the svg picture, and utilizing the svg viewer in VSCode, we see the statistics altering in actual time.
There may be additionally TLCSet command, which may give you extra
flexibility in logging/accumulating statistics. TLCSet can be utilized to
retailer issues to a register, registers are numbered ranging from 1.
Later you may TLCGet that register to learn what you memorized earlier in
this specific run, and even in earlier runs. Markus confirmed an instance
of this in his Termination Detection model.
Let’s test the statistics we collected. Right here is the histogram for N=7 and M=6 configuration. There’s a good uniform distribution of variety of steps required for stabilization. 10 steps appear to be the common. There’s a bump at step 1, however this is because of some preliminary states being these for which UniqueToken already holds.
We all know M=2 that’s modulo binary does not work, we go to a loop to simply. Even within the generate mode with depth 50, the TLA+ checker complains, saying a loop is detected and stabilization doesn’t happen. “Error: Temporal properties have been violated. Error: The next conduct constitutes a counter-example.”
However for M=3, this works, as a result of probabilistically it’s onerous to hit the looping run right here. This makes the common stabilization at 8-9 steps and skews the distribution towards the left, and by some means makes the tail extra pronounced. That isn’t a lot of an enchancment. Though we capped the M area, the N area nonetheless throws us vital variety of state configurations. Since there are a lot of nodes whose states should be fastened and aligned globally.
The theoretical worst case for stabilization of EWD 426 is M*N. For N=7 and M=6 that makes 42, however we see that in apply stabilization happens a lot earlier. And for N=7 and M=3 (an unsafe configuration), we do not see vital enchancment in stabilization time over the protected configuration.
What blew my thoughts within the debuggin mode was that should you add a breakpoint within the Spec line and run TLC in Debug mode utilizing -simulate immediate, you need to use the VSCode debug menu to skip over an motion or step into that motion, backtrack and transfer ahead once more. This fashion you may management how the run goes, navigate it to specific tough conditions and see what goes there.
One other cool trick is so as to add an ALIAS within the cfg file, to enhance how debugging info is introduced to you. As a substitute of uncooked state debuggging, you can also make the definitions give you the results you want and supply you the predicates you care concerning the uncooked state. Right here is the alias predicate I used to get a excessive stage view of the state.