Now Reading
The best way to do distributed locking — Martin Kleppmann’s weblog

The best way to do distributed locking — Martin Kleppmann’s weblog

2023-05-11 19:17:44

Printed by Martin Kleppmann on 08 Feb 2016.

As a part of the analysis for my book, I got here throughout an algorithm known as Redlock on the
Redis web site. The algorithm claims to implement fault-tolerant distributed locks (or slightly,
leases [1]) on prime of Redis, and the web page asks for suggestions from people who find themselves into
distributed programs. The algorithm instinctively set off some alarm bells behind my thoughts, so
I spent a little bit of time fascinated with it and writing up these notes.

Since there are already over 10 independent implementations of Redlock and we don’t know
who’s already counting on this algorithm, I assumed it might be value sharing my notes publicly.
I received’t go into different elements of Redis, a few of which have already been critiqued
elsewhere.

Earlier than I am going into the small print of Redlock, let me say that I fairly like Redis, and I’ve efficiently
used it in manufacturing up to now. I feel it’s a superb slot in conditions the place you need to share
some transient, approximate, fast-changing knowledge between servers, and the place it’s not an enormous deal if
you often lose that knowledge for no matter motive. For instance, a superb use case is sustaining
request counters per IP deal with (for charge limiting functions) and units of distinct IP addresses per
consumer ID (for abuse detection).

Nevertheless, Redis has been progressively making inroads into areas of knowledge administration the place there are
stronger consistency and sturdiness expectations – which worries me, as a result of this isn’t what Redis
is designed for. Arguably, distributed locking is a kind of areas. Let’s look at it in some extra
element.

What are you utilizing that lock for?

The aim of a lock is to make sure that amongst a number of nodes that may attempt to do the identical piece of
work, just one really does it (not less than solely separately). That work is likely to be to put in writing some knowledge
to a shared storage system, to carry out some computation, to name some exterior API, or suchlike. At
a excessive stage, there are two the reason why you may want a lock in a distributed software:
for efficiency or for correctness [2]. To tell apart these instances, you may ask what
would occur if the lock failed:

  • Effectivity: Taking a lock saves you from unnecessarily doing the identical work twice (e.g. some
    costly computation). If the lock fails and two nodes find yourself doing the identical piece of labor, the
    result’s a minor enhance in price (you find yourself paying 5 cents extra to AWS than you in any other case would
    have) or a minor inconvenience (e.g. a consumer finally ends up getting the identical e mail notification twice).
  • Correctness: Taking a lock prevents concurrent processes from stepping on every others’ toes
    and messing up the state of your system. If the lock fails and two nodes concurrently work on the
    identical piece of knowledge, the result’s a corrupted file, knowledge loss, everlasting inconsistency, the fallacious
    dose of a drug administered to a affected person, or another significant issue.

Each are legitimate instances for wanting a lock, however you should be very clear about which one of many two
you’re coping with.

I’ll argue that if you’re utilizing locks merely for effectivity functions, it’s pointless to incur
the price and complexity of Redlock, operating 5 Redis servers and checking for a majority to accumulate
your lock. You might be higher off simply utilizing a single Redis occasion, maybe with asynchronous
replication to a secondary occasion in case the first crashes.

In the event you use a single Redis occasion, after all you’ll drop some locks if the ability out of the blue goes
out in your Redis node, or one thing else goes fallacious. However for those who’re solely utilizing the locks as an
effectivity optimization, and the crashes don’t occur too typically, that’s no massive deal. This “no massive
deal” state of affairs is the place Redis shines. A minimum of for those who’re counting on a single Redis occasion, it’s
clear to everybody who appears to be like on the system that the locks are approximate, and solely for use for
non-critical functions.

However, the Redlock algorithm, with its 5 replicas and majority voting, appears to be like at first
look as if it’s appropriate for conditions during which your locking is necessary for correctness.
I’ll argue within the following sections that it’s not appropriate for that objective. For the remainder of
this text we’ll assume that your locks are necessary for correctness, and that it’s a critical
bug if two completely different nodes concurrently imagine that they’re holding the identical lock.

Defending a useful resource with a lock

Let’s depart the particulars of Redlock apart for a second, and talk about how a distributed lock is
used basically (unbiased of the actual locking algorithm used). It’s necessary to recollect
{that a} lock in a distributed system shouldn’t be like a mutex in a multi-threaded software. It’s a extra
sophisticated beast, as a result of downside that completely different nodes and the community can all fail
independently in varied methods.

For instance, say you could have an software during which a consumer must replace a file in shared storage
(e.g. HDFS or S3). A consumer first acquires the lock, then reads the file, makes some adjustments, writes
the modified file again, and at last releases the lock. The lock prevents two purchasers from performing
this read-modify-write cycle concurrently, which might end in misplaced updates. The code would possibly look
one thing like this:

// THIS CODE IS BROKEN
operate writeData(filename, knowledge) {
    var lock = lockService.acquireLock(filename);
    if (!lock) {
        throw 'Failed to accumulate lock';
    }

    attempt {
        var file = storage.readFile(filename);
        var up to date = updateContents(file, knowledge);
        storage.writeFile(filename, up to date);
    } lastly {
        lock.launch();
    }
}

Sadly, even in case you have an ideal lock service, the code above is damaged. The next
diagram exhibits how one can find yourself with corrupted knowledge:

Unsafe access to a resource protected by a distributed lock

On this instance, the consumer that acquired the lock is paused for an prolonged time period whereas
holding the lock – for instance as a result of the rubbish collector (GC) kicked in. The lock has a timeout
(i.e. it’s a lease), which is all the time a good suggestion (in any other case a crashed consumer may find yourself holding
a lock ceaselessly and by no means releasing it). Nevertheless, if the GC pause lasts longer than the lease expiry
interval, and the consumer doesn’t realise that it has expired, it might go forward and make some unsafe
change.

This bug shouldn’t be theoretical: HBase used to have this problem [3,4]. Usually,
GC pauses are fairly quick, however “stop-the-world” GC pauses have generally been recognized to final for
several minutes [5] – actually lengthy sufficient for a lease to run out. Even so-called
“concurrent” rubbish collectors just like the HotSpot JVM’s CMS can not totally run in parallel with the
software code – even they need to stop the world every so often [6].

You can not repair this downside by inserting a examine on the lock expiry simply earlier than writing again to
storage. Do not forget that GC can pause a operating thread at any level, together with the purpose that’s
maximally inconvenient for you (between the final examine and the write operation).

And for those who’re feeling smug as a result of your programming language runtime doesn’t have lengthy GC pauses,
there are a lot of different the reason why your course of would possibly get paused. Perhaps your course of tried to learn an
deal with that’s not but loaded into reminiscence, so it will get a web page fault and is paused till the web page is
loaded from disk. Perhaps your disk is definitely EBS, and so studying a variable unwittingly become
a synchronous community request over Amazon’s congested community. Perhaps there are a lot of different processes
contending for CPU, and also you hit a black node in your scheduler tree. Perhaps somebody
by accident despatched SIGSTOP to the method. No matter. Your processes will get paused.

In the event you nonetheless don’t imagine me about course of pauses, then contemplate as an alternative that the file-writing
request could get delayed within the community earlier than reaching the storage service. Packet networks corresponding to
Ethernet and IP could delay packets arbitrarily, and they do [7]: in a well-known
incident at GitHub, packets have been delayed within the community for about 90
seconds [8]. Which means an software course of could ship a write request, and it might attain
the storage server a minute later when the lease has already expired.

Even in well-managed networks, this sort of factor can occur. You merely can not make any assumptions
about timing, which is why the code above is basically unsafe, it doesn’t matter what lock service you
use.

Making the lock protected with fencing

The repair for this downside is definitely fairly easy: you should embody a fencing token with each
write request to the storage service. On this context, a fencing token is just a quantity that
will increase (e.g. incremented by the lock service) each time a consumer acquires the lock. That is
illustrated within the following diagram:

Using fencing tokens to make resource access safe

Shopper 1 acquires the lease and will get a token of 33, however then it goes into a protracted pause and the lease
expires. Shopper 2 acquires the lease, will get a token of 34 (the quantity all the time will increase), after which
sends its write to the storage service, together with the token of 34. Later, consumer 1 comes again to
life and sends its write to the storage service, together with its token worth 33. Nevertheless, the storage
server remembers that it has already processed a write with the next token quantity (34), and so it
rejects the request with token 33.

Notice this requires the storage server to take an energetic function in checking tokens, and rejecting any
writes on which the token has gone backwards. However this isn’t significantly exhausting, as soon as you realize the
trick. And offered that the lock service generates strictly monotonically growing tokens, this
makes the lock protected. For instance, if you’re utilizing ZooKeeper as lock service, you need to use the zxid
or the znode model quantity as fencing token, and also you’re in good condition [3].

Nevertheless, this leads us to the primary massive downside with Redlock: it doesn’t have any facility for
producing fencing tokens
. The algorithm doesn’t produce any quantity that’s assured to extend
each time a consumer acquires a lock. Which means even when the algorithm have been in any other case excellent,
it might not be protected to make use of, since you can not stop the race situation between purchasers within the
case the place one consumer is paused or its packets are delayed.

And it’s not apparent to me how one would change the Redlock algorithm to start out producing fencing
tokens. The distinctive random worth it makes use of doesn’t present the required monotonicity. Merely protecting
a counter on one Redis node wouldn’t be ample, as a result of that node could fail. Protecting counters on
a number of nodes would imply they’d exit of sync. It’s probably that you’d want a consensus
algorithm simply to generate the fencing tokens. (If solely incrementing a counter was
easy.)

Utilizing time to resolve consensus

The truth that Redlock fails to generate fencing tokens ought to already be ample motive to not
use it in conditions the place correctness is determined by the lock. However there are some additional issues that
are value discussing.

Within the tutorial literature, essentially the most sensible system mannequin for this sort of algorithm is the
asynchronous model with unreliable failure detectors [9]. In plain English,
which means the algorithms make no assumptions about timing: processes could pause for arbitrary
lengths of time, packets could also be arbitrarily delayed within the community, and clocks could also be arbitrarily
fallacious – and the algorithm is nonetheless anticipated to do the precise factor. Given what we mentioned
above, these are very affordable assumptions.

The one objective for which algorithms could use clocks is to generate timeouts, to keep away from ready
ceaselessly if a node is down. However timeouts do not need to be correct: simply because a request instances
out, that doesn’t imply that the opposite node is unquestionably down – it may simply as nicely be that there
is a big delay within the community, or that your native clock is fallacious. When used as a failure detector,
timeouts are only a guess that one thing is fallacious. (If they may, distributed algorithms would do
with out clocks fully, however then consensus becomes impossible [10]. Buying a lock is
like a compare-and-set operation, which requires consensus [11].)

Notice that Redis uses gettimeofday, not a monotonic clock, to
decide the expiry of keys. The person web page for gettimeofday explicitly
says
that the time it returns is topic to discontinuous jumps in system time –
that’s, it’d out of the blue bounce forwards by a couple of minutes, and even bounce again in time (e.g. if the
clock is stepped by NTP as a result of it differs from a NTP server by an excessive amount of, or if the
clock is manually adjusted by an administrator). Thus, if the system clock is doing bizarre issues, it
may simply occur that the expiry of a key in Redis is far sooner or a lot slower than anticipated.

For algorithms within the asynchronous mannequin this isn’t an enormous downside: these algorithms typically
be certain that their security properties all the time maintain, without making any timing
assumptions
 [12]. Solely liveness properties rely upon timeouts or another failure
detector. In plain English, which means even when the timings within the system are in every single place
(processes pausing, networks delaying, clocks leaping forwards and backwards), the efficiency of an
algorithm would possibly go to hell, however the algorithm won’t ever make an incorrect determination.

Nevertheless, Redlock shouldn’t be like this. Its security is determined by a variety of timing assumptions: it assumes
that every one Redis nodes maintain keys for about the precise size of time earlier than expiring; that the
community delay is small in comparison with the expiry period; and that course of pauses are a lot shorter
than the expiry period.

Breaking Redlock with unhealthy timings

Let’s take a look at some examples to exhibit Redlock’s reliance on timing assumptions. Say the system
has 5 Redis nodes (A, B, C, D and E), and two purchasers (1 and a pair of). What occurs if a clock on one
of the Redis nodes jumps ahead?

  1. Shopper 1 acquires lock on nodes A, B, C. On account of a community challenge, D and E can’t be reached.
  2. The clock on node C jumps ahead, inflicting the lock to run out.
  3. Shopper 2 acquires lock on nodes C, D, E. On account of a community challenge, A and B can’t be reached.
  4. Shoppers 1 and a pair of now each imagine they maintain the lock.

The same challenge may occur if C crashes earlier than persisting the lock to disk, and instantly
restarts. For that reason, the Redlock documentation recommends delaying restarts of
crashed nodes for not less than the time-to-live of the longest-lived lock. However this restart delay once more
depends on a fairly correct measurement of time, and would fail if the clock jumps.

Okay, so perhaps you suppose {that a} clock bounce is unrealistic, since you’re very assured in having
appropriately configured NTP to solely ever slew the clock. In that case, let’s take a look at an instance of how
a course of pause could trigger the algorithm to fail:

  1. Shopper 1 requests lock on nodes A, B, C, D, E.
  2. Whereas the responses to consumer 1 are in flight, consumer 1 goes into stop-the-world GC.
  3. Locks expire on all Redis nodes.
  4. Shopper 2 acquires lock on nodes A, B, C, D, E.
  5. Shopper 1 finishes GC, and receives the responses from Redis nodes indicating that it efficiently
    acquired the lock (they have been held in consumer 1’s kernel community buffers whereas the method was
    paused).
  6. Shoppers 1 and a pair of now each imagine they maintain the lock.

Notice that despite the fact that Redis is written in C, and thus doesn’t have GC, that doesn’t assist us right here:
any system during which the purchasers could expertise a GC pause has this downside. You may solely make this
protected by stopping consumer 1 from performing any operations below the lock after consumer 2 has
acquired the lock, for instance utilizing the fencing method above.

An extended community delay can produce the identical impact as the method pause. It maybe is determined by your
TCP consumer timeout – for those who make the timeout considerably shorter than the Redis TTL, maybe the
delayed community packets can be ignored, however we’d must look intimately on the TCP implementation
to make certain. Additionally, with the timeout we’re again right down to accuracy of time measurement once more!

The synchrony assumptions of Redlock

These examples present that Redlock works appropriately provided that you assume a synchronous system mannequin –
that’s, a system with the next properties:

  • bounded community delay (you may assure that packets all the time arrive inside some assured most
    delay),
  • bounded course of pauses (in different phrases, exhausting real-time constraints, which you usually solely
    discover in automotive airbag programs and suchlike), and
  • bounded clock error (cross your fingers that you simply don’t get your time from a bad NTP
    server
    ).

Notice {that a} synchronous mannequin doesn’t imply precisely synchronised clocks: it means you’re assuming
a known, fixed upper bound on community delay, pauses and clock drift [12]. Redlock
assumes that delays, pauses and drift are all small relative to the time-to-live of a lock; if the
timing points turn into as giant because the time-to-live, the algorithm fails.

In a fairly well-behaved datacenter setting, the timing assumptions will probably be happy most
of the time – this is called a partially synchronous system [12]. However is that good
sufficient? As quickly as these timing assumptions are damaged, Redlock could violate its security properties,
e.g. granting a lease to 1 consumer earlier than one other has expired. In the event you’re relying in your lock for
correctness, “more often than not” shouldn’t be sufficient – you want it to all the time be right.

There’s loads of proof that it isn’t protected to imagine a synchronous system mannequin for many
sensible system environments [7,8]. Maintain reminding your self of the GitHub incident with the
90-second packet delay. It’s unlikely that Redlock would survive a Jepsen check.

However, a consensus algorithm designed for {a partially} synchronous system mannequin (or
asynchronous mannequin with failure detector) really has an opportunity of working. Raft, Viewstamped
Replication, Zab and Paxos all fall on this class. Such an algorithm should let go of all timing
assumptions. That’s exhausting: it’s so tempting to imagine networks, processes and clocks are extra
dependable than they are surely. However within the messy actuality of distributed programs, you must be very
cautious together with your assumptions.

Conclusion

I feel the Redlock algorithm is a poor selection as a result of it’s “neither fish nor fowl”: it’s
unnecessarily heavyweight and costly for efficiency-optimization locks, however it isn’t
sufficiently protected for conditions during which correctness is determined by the lock.

Particularly, the algorithm makes harmful assumptions about timing and system clocks (basically
assuming a synchronous system with bounded community delay and bounded execution time for operations),
and it violates security properties if these assumptions usually are not met. Furthermore, it lacks a facility
for producing fencing tokens (which defend a system in opposition to lengthy delays within the community or in
paused processes).

In the event you want locks solely on a best-effort foundation (as an effectivity optimization, not for correctness),
I’d advocate sticking with the straightforward single-node locking algorithm for
Redis (conditional set-if-not-exists to acquire a lock, atomic delete-if-value-matches to launch
a lock), and documenting very clearly in your code that the locks are solely approximate and should
often fail. Don’t hassle with organising a cluster of 5 Redis nodes.

However, for those who want locks for correctness, please don’t use Redlock. As a substitute, please use
a correct consensus system corresponding to ZooKeeper, most likely through one of many Curator recipes
that implements a lock. (On the very least, use a database with reasonable transactional
guarantees
.) And please implement use of fencing tokens on all useful resource accesses below the
lock.

As I mentioned at the start, Redis is a wonderful device for those who use it appropriately. Not one of the above
diminishes the usefulness of Redis for its meant functions. Salvatore has been very
devoted to the challenge for years, and its success is nicely deserved. However each device has
limitations, and it is very important know them and to plan accordingly.

If you wish to study extra, I clarify this subject in better element in chapters 8 and 9 of my
book
, now obtainable in Early Launch from O’Reilly. (The diagrams above are taken from my
ebook.) For studying easy methods to use ZooKeeper, I like to recommend Junqueira and Reed’s book [3].
For a superb introduction to the idea of distributed programs, I like to recommend Cachin, Guerraoui and
Rodrigues’ textbook
 [13].

Thanks to Kyle Kingsbury, Camille Fournier, Flavio Junqueira, and
Salvatore Sanfilippo for reviewing a draft of this text. Any errors are mine, of
course.

Replace 9 Feb 2016: Salvatore, the unique writer of Redlock, has
posted a rebuttal to this text (see additionally
HN discussion). He makes some good factors, however
I stand by my conclusions. I’ll elaborate in a follow-up publish if I’ve time, however please kind your
personal opinions – and please seek the advice of the references beneath, lots of which have obtained rigorous
tutorial peer overview (in contrast to both of our weblog posts).

References

[1] Cary G Grey and David R Cheriton:
Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency,”
at twelfth ACM Symposium on Working Techniques Ideas (SOSP), December 1989.
doi:10.1145/74850.74870

[2] Mike Burrows:
The Chubby lock service for loosely-coupled distributed systems,”
at seventh USENIX Symposium on Working System Design and Implementation (OSDI), November 2006.

[3] Flavio P Junqueira and Benjamin Reed:
ZooKeeper: Distributed Process Coordination. O’Reilly Media, November 2013.
ISBN: 978-1-4493-6130-3

[4] Enis Söztutar:
HBase and HDFS: Understanding filesystem usage in HBase,” at HBaseCon, June 2013.

[5] Todd Lipcon:
Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1,”
weblog.cloudera.com, 24 February 2011.

[6] Martin Thompson: “Java Garbage Collection Distilled,”
mechanical-sympathy.blogspot.co.uk, 16 July 2013.

[7] Peter Bailis and Kyle Kingsbury: “The Network is Reliable,”
ACM Queue, quantity 12, quantity 7, July 2014.
doi:10.1145/2639988.2639988

[8] Mark Imbriaco: “Downtime last Saturday,” github.com, 26 December 2012.

[9] Tushar Deepak Chandra and Sam Toueg:
Unreliable Failure Detectors for Reliable Distributed Systems,”
Journal of the ACM, quantity 43, quantity 2, pages 225–267, March 1996.
doi:10.1145/226643.226647

[10] Michael J Fischer, Nancy Lynch, and Michael S Paterson:
Impossibility of Distributed Consensus with One Faulty Process,”
Journal of the ACM, quantity 32, quantity 2, pages 374–382, April 1985.
doi:10.1145/3149.214121

[11] Maurice P Herlihy: “Wait-Free Synchronization,”
ACM Transactions on Programming Languages and Techniques, quantity 13, number one, pages 124–149, January 1991.
doi:10.1145/114005.102808

[12] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer:
Consensus in the Presence of Partial Synchrony,”
Journal of the ACM, quantity 35, quantity 2, pages 288–323, April 1988.
doi:10.1145/42282.42283

[13] Christian Cachin, Rachid Guerraoui, and Luís Rodrigues:
Introduction to Reliable and Secure Distributed Programming,
Second Version. Springer, February 2011. ISBN: 978-3-642-15259-7,
doi:10.1007/978-3-642-15260-3



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top