Now Reading
Migrating to OpenTelemetry | Airplane

Migrating to OpenTelemetry | Airplane

2023-11-16 11:29:07

At Airplane, we accumulate observability knowledge from our personal programs in addition to distant “brokers” which can be operating in our clients’ infrastructure. The related outputs, which embrace the usual “three pillars of observability” (logs, metrics, and traces) are important for us to observe our infrastructure and likewise assist clients debug issues in theirs.

Over the past 12 months, we’ve made a concerted effort emigrate most of our observability knowledge era and assortment to the OpenTelemetry (OTel) customary. This has allowed us to gather knowledge extra reliably from extra locations, join extra simply with completely different distributors, and higher monitor and management our prices.

Within the the rest of this submit, we clarify what OpenTelemetry is, how we managed the migration, and classes we realized alongside the way in which.

Apart: What’s OTel?

OpenTelemetry (or OTel for brief) is a set of requirements and instruments for processing observability knowledge. It begins by defining the essential ideas (e.g., what a metric is) and the protocols for transmitting observability knowledge from one system to a different. It then supplies a set of SDK libraries that enable for the instrumentation of purposes in numerous programming languages (Python, Java, Golang, and so forth.) and the pushing of information through the previous protocols. Lastly, it supplies a collector, a centralized element that receives knowledge from numerous sources, applies arbitrary transformations to it, and exports the information to a number of downstream locations.

There are a couple of good issues about OTel that make it superior to legacy observability frameworks.

First, it’s vendor agnostic. In principle, you’ll be able to instrument a system and export observability knowledge with out caring about whether or not it’s finally going to Datadog, Splunk, Dynatrace, or another vendor. And, even higher, you’ll be able to swap between these distributors with out rewriting your purposes.

Second, it unifies the several types of observability knowledge, so that you don’t want separate libraries and implementations for logs, metrics, and traces.

And third, it has pretty stable library assist for the most typical programming languages and software frameworks.

Authentic structure

OTel-Infrastructure-Before.png

Earlier than describing our migration to OTel, it’s useful to summarize what sorts of observability knowledge Airplane has and the way we have been gathering it pre-migration.

As talked about within the introduction, Airplane has its personal programs, plus “brokers” that run within the infrastructure of our clients. We even have Airplane frontend code operating in our clients’ browsers. Thus, there are successfully 9 completely different streams of observability knowledge: {logs, metrics, traces} x {Airplane-hosted infra, customer-hosted infra, buyer browsers}. The information collected from these streams is shipped to a number of distributors together with Datadog (for software logs and metrics), Honeycomb (for traces), and Google Cloud Logging (for infrastructure logs).

Initially, assortment from customer-hosted infrastructure and browsers was minimal. When wanted for debugging functions, the state of those programs was pieced collectively through their calls to Airplane’s APIs (e.g., to ballot for brand new job runs) plus info that clients equipped to us manually on request (e.g., screenshots of logs).

On the Airplane aspect, we relied closely on the Datadog agent for each logs and metrics. The latter is a element that runs on every machine in a company’s infrastructure, scraping logs and metrics for no matter is operating on that machine and sending the information on to Datadog’s assortment APIs.

For metrics, the Datadog agent can both scrape metrics from every course of utilizing the Prometheus protocol or obtain metrics which can be pushed to it by every software utilizing a protocol like Statsd. We initially determined to go along with the previous method, and thus every of our purposes uncovered its metrics through an HTTP /metrics endpoint that was periodically learn by the agent.

Hint knowledge from Airplane-hosted infrastructure, alternatively, was despatched straight from our API to Honeycomb, our tracing vendor. We have been already utilizing OTel libraries for gathering and sending this knowledge (Honeycomb’s API is absolutely OTel-compliant), so no vital migration was required right here. Nevertheless, we weren’t gathering traces from any programs aside from our API, and we wished to increase that.

Making the swap

OTel-Infrastructure-After.png

Establishing the collector

Step one was to arrange an OTel collector in every of our environments. As described beforehand, this can be a element that receives observability knowledge, processes it, after which exports it to vendor-specific assortment endpoints for companies like Datadog or Honeycomb.

The OTel collector was pretty straightforward to arrange inside our current infrastructure- it’s a golang server that’s configured through a single YAML file. This config file specifies how knowledge is obtained (e.g., through what protocols and on which ports), how the information ought to be remodeled, and the downstream locations that the information ought to be despatched to.

The collector config documentation is mostly well-written, however unfold throughout a couple of places- the official OTel docs site, the collector GitHub repo, and the “contrib” GitHub repo, which is the place many of the vendor-specific integrations are hosted.

After some trial-and-error, we managed to create a config for gathering logs, metrics, and traces, and sending them to Datadog and Honeycomb. The next exhibits a simplified model with some annotation feedback:

Notice that the collector has many different options that we determined to not use, notably round knowledge aggregation and filtering. We might allow these sooner or later, nonetheless, to scale back the information volumes that we ship to our distributors downstream.

As soon as our collector service was up, we put inner and exterior load balancers in entrance of it in order that each inner companies and exterior brokers might ship it knowledge.

Instrumentation

The following step was to instrument our purposes to collect and ship helpful knowledge to the collector. This was completed in a set of small, unbiased tasks, together with:

See Also

  1. Exporting traces from our UI to the collector
  2. Exporting observability knowledge from customer-hosted brokers to the collector
  3. Switching metrics in Airplane-hosted programs from Prometheus+Datadog to OTel

Our frontend code is written in TypeScript and our backend programs use golang, so this concerned integrating the OTel JavaScript and Go SDKs, respectively. The language-specific instrumentation guides within the OTel docs have been very useful right here.

Switching our inner metrics was probably the most concerned challenge as a result of it required refactoring the pipelines that present knowledge for our (closely used) inner dashboards and alerts. Furthermore, we encountered some tough edges within the metrics-related performance of the Go SDK referenced above. Finally, we needed to write a conversion layer on prime of the OTel metrics API that allowed for easy, Prometheus-like counters, gauges, and histograms.

In the long run, we determined to not ship sure flows by way of the collector. Within the case of logs from our inner programs, we discovered that it was best to make use of our cloud supplier, Google Cloud Platform (GCP), for gathering and distributing them. GCP’s log tooling is fairly highly effective and simple to configure for purposes in Google Kubernetes Engine (GKE), the place we host most of our companies. GCP can also be the supply for a lot of of our non-application-based logs (e.g., DB errors), so we’d be relying on this service whether or not or not we used it for software logs.

For inner traces, we determined to maintain sending these on to Honeycomb since Honeycomb already makes use of the OTel protocol (OTLP) and going by way of a collector wouldn’t add a lot worth right here.

Parting ideas

General, we’ve been very pleased with our new, OTel-oriented observability stack.

First, the SDKs and collector have been rock solid- we’ve had no vital points operating these at scale in manufacturing. As talked about beforehand, the Golang SDK does have some tough edges (and doesn’t assist logs in any respect but), however we have been in a position to get round these with out an excessive amount of hassle by writing some customized code on prime of the OTel-provided interfaces.

Second, having an OTel collector has made it a lot simpler to gather and course of observability knowledge from programs exterior of our infrastructure. These programs can push their knowledge to our collector, and thus we don’t want to fret about writing our personal shoppers and servers for these flows or take care of scraping third-party locations.

Third, OTel has allowed us to save lots of a big sum of money on our Datadog invoice! As a result of we now use a centralized collector as an alternative of per-node Datadog brokers, the variety of hosts that Datadog sees is considerably decreased. At $15 / host / month, we’ve been in a position to shave 1000’s off of our month-to-month Datadog fees.

Fourth, OTel permits us to extra simply swap distributors sooner or later. We caught with Datadog and Honeycomb through the migration, nevertheless it’s good to know that if we wished to check out others sooner or later, we might make that swap pretty simply.

In case your group is processing numerous observability knowledge, it could be value investigating OTel as an answer for elevated management and adaptability. And for those who’re an Airplane person, it’s also possible to natively stream audit logs out of your Airplane duties to your OTel collector utilizing our log drains characteristic.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top