All you want is Vast Occasions, not “Metrics, Logs and Traces”

This quote from Charity Majors might be the perfect abstract of the present state of observability within the tech trade – a complete, mass confusion. Everyone seems to be confused. What’s a hint? What’s a span? Is log line a span? Do I want traces if I’ve logs? Why I want traces if I’ve nice metrics? The record of questions like these goes on. Charity – along with different nice of us from observability system known as Honeycomb – have been doing an excellent job shedding mild on these questions. But, per my very own expertise it’s nonetheless extraordinarily onerous to clarify what does Charity meant by “logs are thrash”, not to mention the truth that logs and traces are primarily the identical issues. Why is everybody so confused?
With the danger of being spicy a bit of, I’m going guilty Open Telemetry. Sure, it’s powering the trendy observability stack and but I blame it for the mass confusion. Not as a result of it’s a foul resolution – it’s nice! However the presentation and the entire strategy of explaining what Open Telemetry is and what it does makes the observability look tough and complicated.
First, Open Telemetry from the very starting makes a transparent distinctions between traces, metrics and logs:
OpenTelemetry is a set of APIs, SDKs, and instruments. Use it to instrument, generate, gather, and export telemetry knowledge (metrics, logs, and traces) that can assist you analyze your software program’s efficiency and conduct.
Then it goes deeper in explaining every of those 3.
That is screenshot from the a part of open telemetry web site introducing traces. Primarily based on my expertise speaking to individuals who work with OpenTelemetry, this presentation has certainly turn out to be one of many most important photos being related to observability. For some, this IS the observability. And it additionally units traces aside from anything. That is clear not a log, is it? This additionally doesn’t appear like a metric, proper? It’s one thing particular, in all probability a bit sacred, and requiring studying funding. Per my expertise, as soon as individuals realized about traces, they solely take into consideration them within the context of this image and the entire set of phrases like spans, root spans, nested spans and the remainder. OpenTelemetry web site has a glossary web page with greater than 60 phrases! This all is insanely advanced!
However what’s extra essential – does this deal with “logs, metrics and traces” symbolize the true energy of observability? It does cowl some eventualities that’s true, however in terms of the distributed programs at scale what’s extra essential is a capability to “dig” into knowledge – “slice and cube” it, construct and analyse varied views, correlate, seek for anomalies… And programs that supply all of this do exist.
After I was working at Meta, I wasn’t conscious that I used to be privileged to be utilizing the perfect observability system ever. This technique known as Scuba and it’s the highest 1 factor by a big margin that folks miss once they depart Meta.
The fundamental concept of Scuba is very simple and doesn’t require a glossary web page for individuals to understand. It operates with Vast Occasions. Vast Occasion is only a assortment of fields with names and values, just about like a json doc. If it’s essential to document some data – whether or not it’s the present state of the system, or an occasion attributable to an API name, background job or no matter – you’ll be able to simply write some Vast Occasion to Scuba. For example, if a system serves adverts, the pure want could be to document Advert Impressions – the info {that a} sure advert has been seen by a person. The corresponding Vast Occasion is perhaps wanting like this:
{
"Timestamp": "1707951423",
"AdId": "542508c92f6f47c2916691d6e8551279”,
"UserCountry": "US",
"Placement": "mobile_feed",
"CampaingType": "direct_ads",
"UserOS": "Android",
"OSVersion": "14",
"AppVersion": "798de3c28b074df9a24a479ce98302b6",
...
}
Such occasions known as extensive, as a result of it’s inspired to dump to all of them the data one can consider. Something that is perhaps related within the context of a sure knowledge – simply put it there, it is perhaps helpful later. This strategy is laying the groundwork for coping with unknown unknowns – one thing you’ll be able to’t consider now which may be revealed later throughout an incident investigation.
Coping with unknown unknowns will be higher demonstrated on an instance. Scuba has a pleasant intuitive interface which is simple to discover and play with. It has a bit the place one can decide a metric to have a look at, in addition to sections for filters and groupings – and Scuba would draw a pleasant time sequence chart. Firs search for Advert Impressions dataset would merely draw a chart with impressions rely:
If we specific what’s precisely is chosen right here by way of SQL, then that is one thing like
SELECT COUNT(*) FROM AdImpressions
WHERE IsTest = False
Properly, it’s really not precisely like that. Scuba additionally has an idea of native sampling. When a sure occasion is written to Scuba, it additionally should write a area known as samplingRate – the speed this specific occasion is being sampled with. Scuba makes use of this data to correctly “upscale” outcomes proven on the charts, so there isn’t any want to do that upscaling within the head. This can be a actually nice idea as a result of it permits a dynamic sampling – e.g. some sort of impressions could also be sampled greater than one other, whereas preserving the “actual” values within the UI. So, the precise question beneath the hood is
SELECT SUM(samplingRate) FROM AdImpressions
WHERE IsTest = False
Word that this entire “upscaling” is finished transparently by the UI and customers don’t give it some thought throughout querying.
And so assume some alert occurred and indicated that our valuable Advert Impression chart is wanting bizarre:
The primary intuition of everybody who’re utilizing Scuba for investigation is to “slice and cube”, i.e. filter or group by, to see if we are able to be taught one thing. We don’t know what we’re searching for, however we consider that we’ll discover it. So we’d group by say impression sort, or person nation, or the position, till we’d discover one thing suspicious. Let’s assume it’s CampaignType grouping:
We see that some marketing campaign sort known as in_app_purchases (simply in case need to observe that this kind identify is totally made-up by me) appears to look in another way than others. We don’t actually know that does it imply – we don’t must! – to proceed our digging. Okay, now we are able to filter these campaigns solely, and proceed grouping by one thing else we are able to consider. For example, Consumer OS is smart.
Hmmm, Android appears to be in bother. iOS is OK, which means that the issue is on the consumer aspect – a damaged app model perhaps?
Bizarre. Some are struggling, some others don’t. Test OS Model perhaps?
Ha! It’s the most recent model of the OS, and appears like a number of the app variations should not doing nicely on this OS model for any such marketing campaign. The devoted groups could look deeper now, given this data.
What occurred? With none information of the system we’ve narrowed down the scope of the difficulty, and recognized the groups to take a lead for additional investigation. Might we all know prematurely that this bizarre mixture of OS, OS Model, Marketing campaign Kind and App Model may result in some subject, to have a devoted metric ready? After all no. That is an instance of coping with unknown unknowns. We’ve simply dropped all of the related context into Vast Occasions, and used it later when wanted. Scuba made it easy to do the exploration as a result of it’s quick and has a very nice easy-to-use UI. Additionally observe that we’ve got by no means talked about something about cardinality. As a result of it doesn’t matter – any area will be of any cardinality. Scuba works with uncooked occasions and doesn’t pre-aggregate something, and so cardinality just isn’t a difficulty.
Generally UI / Visualisation facet doesn’t get sufficient consideration, and observability programs provide some question language – both a proprietary (bad-bad-bad), or SQL (barely higher, however nonetheless dangerous). Such an interface makes it near unattainable to conduct related investigations. One essential facet of Scuba that each one the fields – operate, filter, grouping, and so forth. – are explorable. Which means that there’s a straightforward strategy to see what sort of values we are able to decide. When the house owners of a sure area should not lazy, they even included an in depth description for the given area with related hyperlinks and so forth. That is enormous. I’ve efficiently investigated a number of incidents myself, with out full understanding of both the entire system, or the info accessible on this dataset. And boy I did be taught so much in regards to the system throughout these investigations, merely by way of taking part in round with Scuba! This was superb. This was observability paradise.
Now think about my degree of confusion and disbelief once I left Meta and received to know the state of observability programs exterior.
Logs? Traces? Metrics? What the hell? Vast Occasions, anybody? Can I please not be taught that 60 phrases from the Glossary and simply … discover stuff?
It took me fairly some time to map my Scuba-based psychological mannequin to Open Telemetry psychological mannequin. I realised that Open Telemetry’s Span is, the truth is, the Vast Occasion. Really, I’m nonetheless not fairly positive I received it proper:
If we take AdImpression instance, this impression just isn’t actually an operation, it’s just a few truth we need to document… To be honest, there’s some notion of Occasion in Open Telemetry:
But when we comply with the hyperlinks to dig deeper, we once more discover out that Occasion is the truth is certainly one of Traces, Metrics or Logs 🤷
However anyway, Span is the closest idea to Vast Occasion. The factor is – it’s extraordinarily onerous to advocate for this psychological mannequin when the one steered by Open Telemetry is already realized. Which is basically upsetting, as a result of Traces, Metrics and Logs are all simply particular circumstances of Vast Occasions actually:
-
Traces and Spans: These are simply extensive occasions having SpanId, TraceId and ParentSpanId fields. So we are able to filter all spans with a given TraceId, topologically kind them utilizing SpanId → ParentSpanId relation, and draw that distributed tracing view everybody loves.
-
Logs: Actually, I’m actually confused what Open Telemetry means by Logs. Appears to be like like a lot of things, and certainly one of them is Structured Log which is just about the Vast Occasion. Nice! The issue, nevertheless, that “a log” is kind of an outlined idea, and often individuals imply by it’s what’s produced by these
logger.data(…)
calls. Anyway, no matter is supposed, logs will be simply mapped to extensive occasions, in fact. Within the easiest case we are able to simply get the log message, put it to the “log_message” area, add a bunch of metadata, and be completely happy. In a extra advanced case, we are able to attempt to mechanically extract a template from a log message by way of eradicating tokens that appears like IDs, and get a hash of this template. This may permit to rapidly get essentially the most frequent error, as an illustration, by way of grouping by this hash. Meta has such a system, and it’s fairly cool. -
Metrics: Metrics will be simply mapped, too. We simply must emit a Vast Occasion as soon as per some interval containing the state of the system (system metrics like cpu, varied counters,…). Prometheus, by the best way, does precisely that with the scraping strategy – takes a snapshot of the system now and again. In contrast to with prometheus, nevertheless, with Vast Occasions strategy we don’t want to fret about cardinality.
However Vast Occasions can provide a lot rather more than these “3 pillars”. The aforementioned debugging session is already a case which isn’t actually lined by Traces, Logs and Metrics – not naturally, at the very least. There may very well be different use circumstances, too – as an illustration, steady profiling knowledge will be simply represented as a Vast Occasion, and queried to construct a Flame Graph. No must have a separate system for this – a single system working with Vast Occasions can do all of it. Think about the chances for cross-correlation & root trigger evaluation when every part in a single place, saved collectively. Particularly now, within the period of elevate of AI-based instruments which can be fairly good to find correlations in knowledge.
I don’t know… I simply wished to specific my frustration with the extent of confusion and this deal with “3 pillars”.
I simply want that observability distributors took a stand towards confusion, and supplied a easy and pure strategy to work together with the system. Honeycomb appears is doing that, in addition to another programs like Axiom. That is nice! And hope the others will comply with.