Deterministic Simulation Testing for Our Total SaaS – WarpStream

2024-03-12 10:21:23

Deterministic Simulation Testing

Deterministic simulation testing is quick turning into the gold commonplace for a way mission vital software program is examined. Deterministic simulation testing was first popularized by the FoundationDB crew who spent 18 months constructing a deterministic simulation framework for his or her database earlier than ever letting it write or learn information from an precise bodily disk. The outcomes communicate for themselves: FoundationDB is broadly thought of to be one of the strong and well-tested distributed databases, a lot in order that Kyle Kingsbury (of Jepsen fame) refused to check it as a result of their deterministic simulator already stress examined FoundationDB greater than the Jepsen framework ever might.

The WarpStream crew utilized FoundationDB closely at Datadog after we constructed Husky, Datadog’s columnar storage engine for event data. Over the course of our careers, our crew has operated (and been on-call for) nearly each database in the marketplace: M3DB, etcd, ZooKeeper, Cassandra, Elasticsearch, Redis, MongoDB, MySQL, Postgres, Apache Kafka and extra. In our expertise, FoundationDB stands in a league of its personal by way of correctness and reliability as a result of its early funding in deterministic simulation testing.

A newer instance of a database system that leverages this strategy to testing is TigerBeetle, a monetary transactions database, that makes use of deterministic simulation testing to construct one of the strong monetary OLTP databases accessible right this moment.

Once we had been designing WarpStream, we knew that it wouldn’t be sufficient to simply change Apache Kafka with one thing cheaper and simpler to function. Kafka is the beating coronary heart of many corporations most crucial infrastructure, and if we had been to face any likelihood of convincing these organizations to undertake WarpStream, we’d need to compress 12+ years of manufacturing hardening right into a a lot shorter timeframe. We accelerated this course of with our architectural resolution to depend on object storage because the solely storage within the system, bypassing lots of the difficult issues of guaranteeing information sturdiness, availability, and replication at scale. Nonetheless, the truth that WarpStream leverages object storage is barely a small a part of guaranteeing the correctness of the general system.

Antithesis

Once we first heard about Antithesis, we might hardly include our pleasure. Antithesis has created the holy grail for testing distributed techniques: a bespoke hypervisor that deterministically simulates a whole set of Docker containers and injects faults, created by the identical individuals who made FoundationDB. For a bunch of gray-haired distributed techniques engineers, seeing Antithesis in motion felt like a tribe of cavemen stumbling upon a post-industrial revolution society. As we spoke extra to the Antithesis crew, an concept started to crystallize: we might use Antithesis to deterministically simulate not solely WarpStream, however our total SaaS!

WarpStream was constructed in another way than most conventional database merchandise. It was designed from day one with a real information airplane / management airplane cut up. There are two main parts to WarpStream: First, the Brokers (information airplane) that act as “thick proxies” and expose the Kafka protocol to shoppers. The Brokers additionally maintain all communication with object storage, layering in batching and caching to enhance efficiency and hold prices low.

‍
Second is the WarpStream management airplane which has two main parts:

The metadata retailer that tracks cluster metadata and performs distant consensus.
Our SaaS software program that manages completely different tenants’ metadata shops, API keys, customers, accounts, and so on.

The metadata retailer solely has two dependencies:

Any cloud KV retailer
Object storage

The SaaS software program provides one extra dependency: a conventional SQL database for managing customers, group, API keys, and so on. Taking a look at WarpStream’s minimal dependencies, we thought, why not take a look at its total buyer expertise, from preliminary signup to operating Kafka workloads?

We created a docker-compose file that incorporates the next parts:

A number of WarpStream Brokers
A number of WarpStream Management Airplane nodes
A number of Apache Kafka shoppers
A KV retailer
Postgres
An object retailer (localstack)

With the assistance of the Antithesis crew, we wrote a take a look at workload that began all of these providers, signed up for a WarpStream account, created a digital cluster, after which started producing and consuming information. The workload was rigorously structured in order that we might assert on a wide range of completely different necessary properties that WarpStream should keep always.

The take a look at workload consists of a number of producers which might be every assigned a singular ID and write information to a small set of subjects. These producers synchronously write a number of small JSON information that include the producer’s ID, a counter (a monotonic sequence quantity for that producer), and some different properties. We repeat the identical parts because the report’s key, worth, and in a header to make sure we by no means shuffle these round unintentionally. The buyer aspect of the workload polls all of the subjects and all of the partitions and asserts that:

The subject and partition the report was consumed from matches the subject and partition the report was produced to.
The offsets for every report in every partition are monotonic.
The sequence numbers for every producer are monotonic, i.e. if we group the information by <Subject, Partition, ProducerID> the sequence quantity encoded within the report is monotonically growing.

The customers retailer all the information for every polling iteration and might assert {that a} report at offset X within the earlier ballot nonetheless exists in a future ballot. This ensures that WarpStream does not lose or reorder information as e.g. background compaction runs to reorganize the cluster’s information for extra environment friendly entry.

These assertions handle lots of the lessons of bugs present in earlier Jepsen assessments of Apache Kafka and different Kafka-compatible techniques. For instance, prior Jepsen assessments have caught bugs like:

Lack of beforehand acknowledged writes. If a write was acknowledged however failed to seem within the output for that topic-partition sooner or later, assertion 1 or 3 above would fireplace.
Violation of producer idempotency (i.e producing duplicate information with out the producer itself crashing or restarting). Antithesis mechanically assessments our Idempotent Producer implementation by disrupting the community between the producer shopper and the agent or the agent and the management airplane, resulting in inner retries contained in the Kafka shopper. A reproduction would trigger the sequence quantity from a producer to remain the identical or lower, inflicting assertion 3 above to fireplace.
Data showing in numerous topic-partitions than they had been initially written to. That is addressed by assertions 1 or 3 above.

What’s the massive deal?

At this level you may be scratching your head a bit bit and questioning: “What’s the massive deal right here? Isn’t this only a actually fancy integration take a look at!?”. Sure and no. Earlier than we began utilizing Antithesis, WarpStream already had a fairly strong set of stress assessments we referred to as the “correctness assessments”.

These assessments do basically the whole lot we simply described, however in an everyday CI atmosphere. Our correctness assessments even inject faults all around the WarpStream stack utilizing a customized chaos injection library that we wrote. These assessments are extremely highly effective, and so they caught a lot of bugs. We might go so far as saying that investing deeply in these correctness assessments is likely one of the essential causes that we had been capable of develop WarpStream as effectively as we did.

Similar to our present correctness assessments, the Antithesis hypervisor mechanically injects faults, latency, thread hangs, and restarts into the workload. Nevertheless, not like our correctness assessments, the Antithesis hypervisor is actually sensible and mechanically fuzzes the system underneath take a look at in an clever method.

Antithesis mechanically devices your software program to measure code protection and construct statistics concerning the execution frequency of every code path. This allows Antithesis to detect “fascinating” habits within the take a look at (similar to rare code paths getting exercised, or uncommon log messages being emitted).

When Antithesis detects fascinating or uncommon habits, it instantly snapshots the state of your entire system earlier than exploring varied completely different execution branches. Which means Antithesis is a lot higher at triggering uncommon or unlikely habits in WarpStream than our present correctness assessments had been.

Additionally, since Antithesis runs your entire software program stack in a deterministic simulator, they will truly run the simulation at quicker than wall clock time. Just like FoundationDB, WarpStream makes heavy use of timers and batching to enhance efficiency. Anytime a WarpStream Goroutine does the equal of time.Sleep(), the Antithesis hypervisor doesn’t even have to attend. On prime of that, the Antithesis hypervisor explores code branches concurrently. All of this provides up in a significant method such that Antithesis can cheaply compress years of stress testing right into a a lot shorter timeframe.

It’s exhausting to over-emphasize simply how transformative this know-how is for constructing distributed techniques. For all intents and functions, it actually does really feel like a time-traveler arrived from 20 years sooner or later and gave us their cutting-edge software program testing know-how. After all, it’s not truly magic. Antithesis is the results of dozens of the neatest software program engineers, statisticians, and machine studying consultants pouring their coronary heart and souls into the issue of software program testing for five years straight. However to us mere mortals, it does really feel rather a lot like magic.

We discovered some bugs

Let’s have a look at a number of instance runs that Antithesis generated for us.

Antithesis ran the WarpStream workload for six wall clock hours, throughout which it simulated 280 hours of utility time. The graph exhibits that it took about 160 “utility hours” for Antithesis to “stall” and cease discovering new “behaviors” within the WarpStream workload. Which means operating the assessments for longer than 160 hours has diminishing returns, and as a substitute we should always spend money on making the take a look at itself extra subtle if we need to train the codebase extra. Nice suggestions for us!

However take into consideration that for a second: even after 140 hours of injecting faults, randomizing thread execution, mechanically detecting that one thing fascinating / uncommon had occurred and deliberately branching to research additional, Antithesis was nonetheless “discovering” new behaviors in WarpStream. We might rent a 100 distributed techniques engineers and make them write integration assessments for a whole yr, and so they in all probability wouldn’t have the ability to set off all of the fascinating states and habits {that a} single Antithesis run coated in 6 hours of wall clock time.

As only one instance of how highly effective that is, on the primary day we began utilizing Antithesis it caught an information race in our metrics instrumentation library that had been current because the first month of the mission.

Our correctness assessments had run in our common CI workflows for actually 10s of 1000’s of hours by then, with the Go race detector enabled, and never as soon as ever caught this bug. Antithesis caught this bug in its first 233 seconds of execution.

An information race within the instrumentation library isn’t that thrilling, although. What about a particularly uncommon information loss bug that’s the results of each a community failure and a race situation? That’s extra thrilling!

To attenuate the variety of S3 PUTs that WarpStream customers need to pay for, the Brokers buffer Kafka Produce requests from many alternative shoppers in-memory for ~250ms earlier than combining the batches of information right into a single file and flushing it to object storage.

In some eventualities, like if write throughput is excessive, there will likely be a number of excellent information being flushed to object storage concurrently. As soon as flushing the information succeeds, committing the metadata for the flushed information to the management airplane could be batched to scale back networking overhead. That is carried out utilizing a background Goroutine that periodically scans the checklist of “flushed however not but dedicated” information.

Whereas refactoring the Agent so as to add speculative retries for flushing information to object storage, we subtly broke the error dealing with on this path in order that, for a really transient window of time, a file which didn’t flush could be thought of profitable and able to decide to the management airplane metadata retailer. In program order (i.e. the linear stream of the code, ignoring concurrency) this window the place the background Goroutine that commits metadata would see the profitable file could be practically not possible to squeeze into. This background Goroutine solely polls for profitable information each 5 milliseconds, and the time between the 2 state transitions within the widespread case could be lower than a microsecond!

This bug is the manifestation of two unlikely occasions: a file failing to flush and a selected thread interleaving that ought to be extraordinarily uncommon in observe. Regardless of how unlikely these occasions are to happen collectively, on an extended sufficient time-scale, this bug would have resulted in information loss sooner or later.

As an alternative, due to Antithesis’ highly effective fuzzer and fault injector, this uncommon mixture of occasions occurred roughly as soon as per wall clock hour of testing. We’d been operating a construct with this bug in our staging atmosphere and clearly didn’t encounter that bug in any respect, not to mention as soon as per hour, as it might’ve instantly been seen when a future background compaction failed because of the lacking file in object storage. We’ve since fastened the regression within the code such that the invalid, non permanent state transition can’t happen.

Why not Jepsen?

The plain query you may be asking your self at this level is: Why use Antithesis as a substitute of a conventional Jepsen take a look at? It’s a very good query, and one we requested ourselves earlier than embarking on our journey with Antithesis.

We’re huge followers of Jepsen and have consumed nearly each printed report. Nevertheless, after talking with the Antithesis crew and spending a number of months integrating with it, we really feel strongly that deterministic simulation testing with instruments like Antithesis is a way more strong and sustainable path ahead for the trade. Particularly, we predict that the Antithesis’ strategy is best than Jepsen’s for a number of causes:

The Antithesis know-how is extra strong, and more likely to catch bugs than the Jepsen harness. There’s merely no different equal (that we’re conscious of) to Antithesis’ customized hypervisor, and its potential to mechanically instrument distributed techniques for code protection and successfully “hunt” for bugs. Sure, the Jepsen framework will inject faults right into a operating atmosphere in an effort to set off bugs and edge-case habits, however this strategy is crude compared to what Antithesis does.
Antithesis integrates natively into how our engineers are used to working. The whole take a look at setup is expressed utilizing commonplace docker-compose information and Docker photos, and Antithesis assessments are kicked off utilizing Github Actions that push WarpStream photos to Antithesis’ docker registry. Once we add new performance, all our engineers need to do is modify the Antithesis workload and kick off an automatic CI job. The whole expertise and workflow lives proper subsequent to our present codebase, CI, and workflows. Additional bonus: none of our engineers needed to be taught Clojure.
Antithesis testing is designed to be a steady course of with accompanying skilled providers that make it easier to develop and adapt the assessments because the scope of your product will increase. Meaning our customers get the boldness that each WarpStream launch is actively examined with Antithesis, not like a conventional Jepsen take a look at the place the engagement is short-lived and often solely covers a “snapshot” of a system at a static time limit.
Lastly, it might not have been sensible to constantly take a look at our total SaaS platform with Jepsen in the identical method that we do with Antithesis. Whereas which will look like overkill, we predict it’s a fairly necessary level. For instance, think about the truth that nearly each cloud infrastructure supplier has a routing or proxy layer that’s answerable for routing buyer requests to the proper set of infrastructure that hosts the client’s sources. A small information race or caching bug in that routing layer might lead to exposing one buyer’s sources to a special buyer. These multi-tenant SaaS layers are by no means examined in conventional Jepsen testing, however with Antithesis it was truly simpler to incorporate these layers in our testing than to particularly exclude them.

We’re simply getting began with Antithesis! Over the approaching months we plan to work with the Antithesis crew to develop our testing footprint to cowl extra performance like:

Multi-region deployments of our SaaS platform.
Multi-role Agent Clusters.
Injecting and detecting information corruption on the storage and file cache layer utilizing checksums.
And way more!

When you’d prefer to be taught extra about WarpStream, please contact us, or join our Slack!

Source Link