Now Reading
A Fail-Gradual Detection Framework for Cloud Storage Techniques

A Fail-Gradual Detection Framework for Cloud Storage Techniques

2023-04-30 13:12:48

The Perseus paper gained a finest paper award at FAST (File and Storage Applied sciences) and is one in a collection I will probably be writing about from that convention. These paper critiques can be delivered weekly to your inbox, or you possibly can subscribe to the Atom feed. As all the time, be at liberty to achieve out on Twitter with suggestions or recommendations!

Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

What’s the analysis?

This paper describes a system for detecting fail-slow cases in Alibaba storage clusters – fail-slow

The Perseus paper focuses particularly on detecting fail-slow in storage units, though the class of downside impacts all kinds of {hardware}. If left unresolved, fail-slow cases can dramatically impression efficiency and tail latency

The fail-slow phenonenon is troublesome to detect for a wide range of causes. Typical SLO-based approaches monitoring a hard and fast efficiency threshold are insufficiently delicate – various load may cause periodic efficiency regressions, even when a drive is wholesome

What are the paper’s contributions?

The paper makes 4 major contributions:

  • A framework for detecting cases of fail-slow at scale, together with takeaways about what didn’t work.
  • The design and implementation of Perseus, a system for detecting fail-slow cases in storage clusters.
  • An analysis of the approach in opposition to a floor reality dataset from Alibaba.
  • Root-cause evaluation of detected failures.

How does the system work?

In instrumenting their strategy, the authors set out with a number of major design targets:

  • Non-intrusive: Alibaba runs cloud workloads, and doesn’t essentially want (or need) to depend on deeper integration with buyer purposes.
  • Superb-grained: failures recognized by the system must be particular about what {hardware} is failing and why.
  • Correct: the system ought to appropriately establish failures, limiting wasted time by oncall engineers.
  • Common: the answer ought to have the ability to establish issues throughout several types of {hardware} throughout the identical class (for instance each SSDs and HDDs).

The crew constructed up a dataset of drive efficiency utilizing daemons deployed within the Alibaba cloud, recording time collection of common latency, common throughput keyed by machine and drive (as a machine can have many drives).

Preliminary Makes an attempt

With these targets in thoughts, the authors evaluated three totally different strategies for detecting fail-slow cases: threshold filtering, peer analysis, and an strategy based mostly on earlier analysis named IASO

Threshold filtering relied on figuring out problematic drives by recording whether or not write latency elevated over a hard and fast threshold. This strategy didn’t work as a result of disk latency would spike periodically when underneath heavy load, with little correlation to drive failure.

Peer analysis in contrast the efficiency of drives on the identical machine in opposition to each other – theoretically drives connected to the identical machine ought to obtain considerably related workloads if utilized by the identical buyer, so repeated deviations of a drive’s efficiency from its neighbors would flag the drive for additional inspection. The principle draw back to this strategy was a reliance on fine-tuning for correct detection – the period and frequency of deviations differed by clusters and workloads, requiring vital engineering work for correct detection of fail gradual occasions.

The final tried strategy described by the authors was one based mostly on earlier analysis from IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services. This IASO-based mannequin depends on timeouts – for instance, counting the variety of timeouts Cassandra has to a selected node, then utilizing this as a proxy for a problematic set of units. The IASO based mostly strategy was not appropriate for quite a lot of causes, together with that it targets nodes (fairly than particular units), and depends on information of the workload (which isn’t true of Alibaba’s cloud). The authors nonetheless tried to adapt it to their wants by reusing the output of the peer analysis strategy described above


The ultimate strategy that the authors applied was given the code-name Perseus. It depends on evaluation of the distribution of latency vs throughput

Utilizing the information on latency vs throughput for a node, Perseus follows a 4 step course of: performing outlier detection on the uncooked information, constructing a regression mannequin, figuring out fail gradual occasions, and evaluating the danger of any detected occasions.

In step one, Perseus makes use of two algorithms (DBScan and Principal Element Evaluation

Subsequent, the system excludes outliers and builds a regression mannequin, producing a curve that matches remaining information factors.

Perseus then runs this regression mannequin over the time collection for each drive within the node – for each drive, a given throughput ought to produce a given latency. Then the system measures the distinction between the drive’s precise and anticipated latency for a given throughput utilizing the slowdown ratio (higher certain of anticipated latency divided by precise drive latency).

Lastly, the system scans the slowdown ratio timeseries for each drive, discovering and categorizing decelerate occasions based mostly on their period and severity (represented by a drive’s distinction from anticipated efficiency). Drives with repeated, extreme slowdowns are flagged for additional investigation by engineers onsite.

See Also

How is the analysis evaluated?

To evalute the analysis, the authors examine the precision, recall, and Matthews Correlation Coefficient (MCC)

The paper additionally evaluates the totally different elements of Perseus’ design. For instance, measuring the impression of outlier elimination, the mix of Principal Element Evaluation with DBScan, and the thresholds used for what is taken into account because the higher certain of anticipated latency for a given throughput (a consider computing the slowdown ratio). The information from the paper helps their resolution making.

Lastly, the paper notes the elimination of fail-slow drives is signfiicant on tail latency:

Probably the most direct good thing about deploying PERSEUS is decreasing tail latency. By isolating the fail-slow, node-level ninety fifth, 99th and 99.99th write latencies are lowered by 30.67% (±10.96%), 46.39% (±14.84%), and 48.05% (±15.53%), respectively.

Root Trigger Evaluation

The paper wraps up by diving into a number of root-causes of fail gradual cases in Alibaba manufacturing clusters. Software program issues prompted a majority of the failures.

One instance root trigger occurred as a result of an working system bug launched thread rivalry


An fascinating aspect of the paper was quantifying the impression to tail latency from few fail-slow occasions in a single class of {hardware} (the paper makes use of a check dataset of 315 cases). I additionally appreciated the analysis figuring out potential shortcomings of the strategy. For instance, Perseus makes the belief that single (or few) drives on a node will fail-slow on the identical time. If all drives connected to a machine fail-slow (which is feasible), the system would possible not detect the issue.

For cloud suppliers with restricted instrumentation of buyer workloads, the strategy appears fairly promising, particularly with a possible enlargement to different fail-slow instances like reminiscence and networking. On the identical time, rising adoption of commoditized infrastructure

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top