Fixing sturdy execution’s immutability drawback Restate
In the previous few years we’ve seen an explosion of sturdy execution instruments and platforms.
The final precept is that this; computer systems at the moment are so quick that they will write down the results of each non-trivial job to a persistent retailer.
And by doing this, they’ve the power to completely recuperate from transient failure by replaying the journal of duties already accomplished, quick forwarding to the purpose the place they failed and persevering with prefer it by no means occurred.
With some care and a spotlight, this may be achieved with minimal affect to the programming mannequin or the efficiency traits, resulting in an irresistible worth proposition.
Proper?
Properly, there are a couple of arduous issues remaining.
In all probability the toughest drawback in sturdy execution, as in lots of areas of infrastructure, is secure code updates.
With out sturdy execution, updating code is often not such a giant drawback. Assuming some type of retries occur, a request may begin on an outdated model of code, get evicted partway by, after which retry on the up to date model of the code.
In apply, this isn’t an issue if each handlers are idempotent (as they should be anyway for retries to be viable), and in the event that they settle for the identical enter parameters and have roughly appropriate behaviour.
You may find yourself with some facets of the enterprise logic of the outdated model, and a few facets of the brand new model, however more often than not you wouldn’t be too involved about this.
In sturdy execution land, nevertheless, even pretty minor modifications to a handler whereas a request is in-flight could cause the request to start out failing, requiring human intervention.
For instance, if my handler is a part of a checkout movement, I’d wish to add a step in the beginning that calls an exterior service to test if there’s a sale on.
Intuitively, I’d anticipate that in-flight requests which have already progressed previous the place the brand new step is inserted could be unaffected.
However that’s not true! Any in-flight request that began on the outdated model of code and is replayed on the brand new model will fail, or even perhaps have undefined behaviour – as a result of it should replay by the purpose the place the low cost test ought to have been made, discover that it doesn’t have something in its journal for that, and won’t know proceed.
The journal not matches the code.

That is the immutability drawback; the code executing a given request must not ever change in its behaviour, regardless of the potential for requests to be replayed lengthy after they began.
Each sturdy execution platform has an strategy to resolve this drawback.
Let’s assessment a couple of of them:
Azure sturdy capabilities #
Sturdy capabilities are successfully occasion shoppers which are deployed to the Azure capabilities platform.
They don’t often execute one another, however as a substitute apply sturdiness to a set of strategies inside one deployed Azure operate.
The code is mutable, and replays will all the time execute over no matter is the newest model of that operate.
Their really helpful replace technique is to deploy the brand new model of the code alongside the outdated one, in order that in-flight requests won’t see an replace.
They suggest two strategies for this:
- Copy and paste modified strategies of the general workflow as new strategies in the identical artifact (they might say ‘capabilities’ inside a ‘operate app’), and replace callers to make use of the brand new strategies for brand new requests.
These updates additionally require deploying new variations, nevertheless, so this must be carried out recursively till the decision chain results in the entry level, and so this isn’t really helpful. - Don’t replace the present deployment in any respect, however make a very new deployment of the entire bundle of code, and replace callers of the sturdy capabilities api to make use of the brand new deployment.
In-flight calls will maintain executing towards the outdated code.
Azure have the precise answer; the power to make sure that in-flight requests keep on the model they first executed on is superior.
However its not fairly a firstclass citizen; ideally we’d need new calls to a given sturdy operate to routinely use the newest model.
Needing to deploy a brand new sturdy operate and replace callers is cumbersome sufficient that in apply, folks may simply be tempted to disregard this drawback for small modifications and settle for some failures.
Temporal #
Temporal staff are occasion shoppers deployed on reserved infrastructure, for instance as containers in Kubernetes.
Consequently, the code is inherently mutable; you’ll be able to simply deploy a brand new container picture.
Through the years there have been a couple of completely different really helpful methods to deal with versioning, however the present best-in-class answer known as worker versioning.
On this mannequin, a employee (which can seemingly embrace the code for a lot of workflows) should be tagged with a construct ID.
When subscribing for work to do, a employee will solely ask for replays which have already began on that construct ID.
One construct ID is configured because the default, wherein case it should additionally get new invocations that haven’t began anyplace.
For this technique to work, a employee construct must be saved operating till any in-flight requests on it have accomplished.
For brief operating workflows, that’s not often a problem, though it nonetheless requires consideration when eradicating outdated employee deployments – and also you want to take away them ultimately, as a result of on reserved infrastructure they’ll price you only for present.
For lengthy operating workflows, nevertheless, this isn’t an answer.
Replays can occur in precept months or years after execution began – that is one in every of Temporals most attention-grabbing capabilities.
In these instances there are two issues with maintaining outdated code round for therefore lengthy.
Firstly, the fee; if my requests run for a month and I do 5 breaking modifications in that month, I have to run 5 staff concurrently.
Secondly, there’s a safety and reliability concern about having arbitrarily outdated code operating in your infrastructure – modifications won’t have an effect on enterprise logic, however will replace database connection parameters, or replace dependencies with safety vulnerabilities; modifications that maybe do have to be utilized to in-flight requests.
So, there must be a cutoff level the place we are saying that code is simply too outdated to run, and we’d want to sometimes backport modifications to older employee builds which are nonetheless operating.
The place employee versioning is inadequate, Temporal affords a patch API; you’ll be able to insert or take away steps by surrounding them in if
statements, making certain that new steps solely run on new invocations and by no means on replay, or that eliminated steps nonetheless run on replay.
That is actually versatile and nice to get you out of a jam, however these patches accumulate in your code and have to be eliminated with excessive care.
AWS Step Features #
AWS Step Features are described in JSON utilizing a workflow language known as ASL.
Whereas they might name out to code within the type of Lambdas, the sturdy execution solely extends to steps inside the workflow definition.
And the definition is totally immutable – updates create a brand new model, which will likely be used for brand new workflow runs, however in-flight runs all the time use the model they first executed on.
This utterly solves the issue! Preserving round outdated variations prices nothing; it’s simply storing a single file, in any case.
By the character of Step Features workflows, there are not often safety patches or infrastructure modifications to fret about when maintaining outdated variations round; all of the ‘meat’ is within the Lambdas that the workflow calls, which aren’t topic to sturdy execution and so haven’t any versioning drawback past customary request/response sort versioning.
When constructing Restate we wished to combine and match the most effective of all these approaches.
Step Features have by far the most effective person expertise – you merely don’t want to consider versioning, all due to immutable workflows, however you’re restricted to writing workflows in ASL.
We actually admire the workflow-as-code expertise of Azure and Temporal, however code is inherently mutable and this will result in complications.
How can we mix the 2?
Restate ‘workflows’ are far more like regular code than they’re like workflows; they seem like RPC handlers.
There aren’t any occasion shoppers; the runtime all the time makes outbound requests to your companies, which may run as long-lived containers, or as Lambda capabilities.
You simply have to register an HTTP or Lambda endpoint with Restate, which can determine what companies run there, create a brand new model for these companies, and begin utilizing that model for brand new requests.

As a facet impact of having the ability to run as Lambda capabilities, code immutability is straightforward!
Revealed Lambda operate variations are immutable – any replace to code or configuration results in a brand new model being deployed.
Variations might be saved round indefinitely – and you’ll invoke outdated variations in precisely the identical manner as new ones, with no further price.
By integrating our model abstraction with Lambda’s, we are able to provide the identical expertise as Step Features; in-flight requests will all the time execute on the code they began with, and new requests will go to the newest code.
Nonetheless, very lengthy operating handlers are nonetheless a headache.
Whereas Lambdas usually have fairly few dependencies past the AWS SDK, safety patches might nonetheless be obligatory, and infrastructure could change in such a manner that outdated Lambda variations develop into non-functional.
Moreover, if we are able to certain our request durations to, say, an hour, then we have now a tractable drawback in different kinds of deployments, like containers in Kubernetes.
We’d like to have the ability to maintain outdated variations of code round for an hour.
The simplest manner to do that is to deploy each containers facet by facet in the identical Kubernetes pod, serving on completely different ports or below completely different paths.
In time, we hope to offer operators and CI instruments that make this very easy to handle.
Writing handlers that take weeks to finish continues to be a tough drawback, although.
Maybe we needs to be asking; why do folks truly wish to write code like that?
Properly, we’ll cowl that subject in Half Two!
And within the meantime, when you write code like that and wish to inform us about it, be a part of our Discord!