Tracing: structured logging, however higher in each manner

2023-09-18 16:52:28

It’s no secret that I’m not a fan of logs; I’ve baited (rapala in Finnish) dialogue in our work chat with issues like:

For those who’re writing log statements, you’re doing it flawed.

It is a fairly incendiary assertion, and whereas there was some good dialogue after, I figured it was time to jot down down why I believe logs are dangerous, why tracing needs to be used as an alternative, and the way we get from one to the opposite.

Hopefully, with much less clickbait. Step 3 will shock you, although.

Logs vs Traces

First, lets breakdown what I see as the important thing variations between logging and tracing code. If you would like the sensible instance and wish to skip this wall of textual content, click here. There’s additionally a brief Question and Answer on the finish.

Log Ranges

Log Ranges are meaningless. Is a log line debug, information, warning, error, deadly, or another shade in between?

The one time I’ve seen this effectively managed was once we had 3 descriptions:

Data (every little thing)
Create A Process for Later (e.g. this may watch for working hours)
Wake Somebody Up (that is on fireplace).

Nonetheless, this has points; a timeout as soon as is “for later” or simply information, however many timeouts is perhaps “get up”, how do you encode that logic someplace?

There isn’t a equal of this in a hint, so there’s one much less factor to fret about.

Messages

One thing like discovered ${rely} customers within the cache appears like log message. It’s clear and has the data you care about. Nonetheless, whenever you come to querying your log aggregation system, it’s good to do a free-text search on a substring of the message or use a wildcard search. Each of those operations are sluggish, and worse nonetheless can’t assist you to with the unfavorable model of the query “discover me operations the place the cache wasn’t checked”.

Assuming structured logs are getting used, we will a minimum of add the rely as a property to the log message so it could possibly now be filtered on by the log aggregator. At which level, why do you want the log message in any respect? Why not simply have log({ users_in_cache: rely })?

The logging libraries additionally put extra significance on the message than the properties by having or not it’s the primary (and solely required) argument to the logging perform. From a querying and evaluation perspective, that is flawed: messages pressure you into free textual content looking and scale back the probability of quick queries on attributes.

Given this logline:

logger.information("discovered the consumer within the cache", userId);

You possibly can reconstruct this as an announcement that may be filtered on:

span.addAttributes({
  user_in_cache: true,
  user_id: userId,
})

Blended Outputs / Semantics

These days, individuals configure their HTTP server to jot down logs to stdout reasonably than to a log file. This makes an excessive amount of sense, as now a separate piece of software program can tail the logs and ship them off to the aggregator. Your software’s output, nevertheless, will probably find yourself with combined plaintext and structured logs, principally right down to libraries and frameworks doing their very own factor with logging, which might be not the identical as what you might be doing (or different libraries are doing.)

The second downside with writing logs to stdout is that it mixes the semantics of log traces with console output. Whereas each log messages and console output are (most likely) simply textual content, they do have totally different functions. A chunk of suggestions is perhaps “Server listening on port 3000. Click on right here to open the browser”. That is helpful to a neighborhood consumer working the app however isn’t beneficial in your logs. To not point out, its plaintext output on the console, reasonably than structured, so now your log tailer wants to determine what to do with it.

With OpenTelemetry, you as an alternative configure an exporter to deal with all of your traces (and logs and metrics, in case you need.) That is sometimes despatched as OTLP format both on to a vendor or to an OTEL Collector occasion, which may add/take away/modify information and ship it to at least one or a number of locations.

Now you might be free to jot down no matter suggestions you wish to stdout.

Relationships

Loglines don’t have a causal relationships. At greatest, structured logs may need some type of request identifier (akin to requestID or correlationID) for all traces written throughout a request. This lets you discover all of the log traces for a given ID, however the log traces themselves don’t have a set order. Ordering in structured logging depends on the timestamp area, however that is restricted to the accuracy and determination of the time supply. It signifies that it’s potential to get traces showing on the similar time after they occurred at totally different occasions.

Traces include automated parent-child relationships, permitting us to see not solely all spans in a single request, however what triggered every span, and (as we’ll get to within the subsequent level) after they occurred.

Tracing additionally takes these relationships one other step additional however having an ordinary set of codecs to move hint and span IDs alongside to different providers, embedded in HTTP headers, message queue metadata, or different places. Assuming all of your different providers ship their traces to the identical system, you can begin to visualise who calls your service and what the results are of you calling different providers.

Think about opening a hint out of your service and discovering one other staff in a downstream service has began tracing their stuff, and also you instantly can see much more details about the request. You didn’t even do something! You discover that the way in which your service is structured causes a whole lot of load on the downstream, and begin a dialog to see if there’s a method to make it quicker/higher.

Timings

The one timing information you get on log traces routinely is the timestamp of when the log line was written. Once you wish to see how lengthy an operation took, you could begin a timer, after which cease the timer, and log the elapsed length your self. As this can be a guide course of, it’s typically not performed, and when it’s performed, it tends to have inconsistencies throughout the applying, specifically timing supply (and thus decision), property identify, and format. Is it elapsed, elapsed_ms, length, or size? does it comprise plenty of seconds, milliseconds, nanoseconds, or a timestamp format?

By comparability, traces include startTime, finishTime, and length attributes, which aren’t solely assured to be there however are set from the identical timing supply and are at all times written in the identical format.

Mix this with the relationship attributes, and now you can render timing graphs, permitting for simple visualisation of how lengthy every a part of a course of takes, what may be parallelised, and what components depend upon different components.

For instance, this can be a graph displaying how long a CI job took to run, displaying all of the totally different steps, their durations, and baby steps:

Querying

Querying can have wildly totally different efficiency traits relying on which log aggregation service you might be utilizing. What unifies all these techniques is their slowness, which is usually right down to huge quantities of knowledge which wants indexing in order that it may be free textual content searched. Filtering logs by their structured properties is nevertheless fairly fast (often.)

The place querying falls down is looking for traits in information, and looking for solutions to unfavorable queries. For instance, how would you seek for “all requests to x endpoint, the place customers weren’t discovered within the cache”? This requires you to group the logs by request ID, then discover an entry with a selected url path, then see whether it is lacking a log line. The identical question in a tracing system could be the place path = "/some/api" && !user_in_cache, because the tracing system is already conscious of all of the spans in a hint, and does the grouping automagically.

Lastly, visualising lacking information is tough. Take this small instance; it’s 4 parallel requests to a system, and one in every of them is lacking a log line. Which line is lacking?

Timestamp	UserID	Message
12:51:27	3fcce385be9e	fetched third social gathering preferences
12:51:27	3fcce385be9e	discovered consumer in cache
12:51:27	915d273db25c	fetched third social gathering preferences
12:51:27	3fcce385be9e	saved efficiently
12:51:27	8507d369d11c	fetched third social gathering preferences
12:51:27	c4e71b4a29f2	fetched third social gathering preferences
12:51:27	915d273db25c	saved efficiently
12:51:27	c4e71b4a29f2	discovered consumer in cache
12:51:27	c4e71b4a29f2	saved efficiently
12:51:27	8507d369d11c	discovered consumer in cache
12:51:27	8507d369d11c	saved efficiently

Is it simpler to see the one that’s totally different now?

Timestamp	UserID	Fetched	In Cache	Saved
12:51:27	3fcce385be9e	true	true	true
12:51:27	915d273db25c	true	false	true
12:51:27	8507d369d11c	true	true	true
12:51:27	c4e71b4a29f2	true	true	true

Not solely is it straightforward to see at a look that 915d273db25c didn’t discover the consumer within the cache, but additionally how a lot much less house this takes up (each visually and by way of storage.)

We will additionally then use this to question additional: present me all traces the place in_cache != true, and see what’s totally different about them.

Evolving logs

So, with all that being stated, let’s have a look at a sensible instance of the right way to go about tracing an present system and what that appears like.

Ripping all of the log statements in an software in a single go isn’t a possible technique, particularly on a big codebase. Nonetheless, we will use logs as an honest beginning place and evolve our system to one thing higher. Specifically, to OpenTelemetry Tracing.

We’ll begin off with an actual pair of features from one of many codebases I work on. Numerous data modified to guard the ~~responsible~~ harmless. That is run as a part of an api name to publish a container, however this half has no internet particular code.

func PrepareContainer(ctx context.Context, container ContainerContext, locales []string, dryRun bool, allLocalesRequired bool) (*StatusResult, error) {

	logger.Data(`Filling dwelling web page template`)

	homePage, err := RenderPage(ctx, dwelling, container, locales, allLocalesRequired)
	if err != nil {
		return nil, err
	}

	templateIds := []string{homePage.ID}

	if container.PageSlugs.FAQ != "" {
		faqPage, err := RenderPage(ctx, faq, container, locales, allLocalesRequired)
		if err != nil {
			return nil, err
		}

		templateIds = append(templateIds, faqPage.ID)
	}

	if dryRun {
		return &StatusResult{Standing: StatusDryRun}, nil
	}

	logger.Data(`Marking web page template(s) for utilization`, "template_ids", templateIds)

	if err := MarkReadyForUsage(ctx, container, templateIds); err != nil {
		return nil, err
	}

	return &StatusResult{Standing: StatusComplete}, nil
}

func RenderPage(ctx context.Context, supply Supply, container ContainerContext, locales []string, allLocalesRequired bool) (string, error) {

	logger.Data(fmt.Sprintf(`Filling %s web page template`, supply.Title))

	template, err := FetchAndFillTemplate(ctx, supply, container, locales)
	if err != nil {
		return nil, err
	}

	web page, err := ConfigureFromTemplate(ctx, container, template, locales)
	if err != nil {
		return nil, err
	}

	if len(web page.Locales) != len(locales) {
		const message = fmt.Sprintf(`Didn't render %s web page template for some locales`, supply.Title)
		if allLocalesRequired {
			return nil, fmt.Errorf(message)
		} else {
			logger.Warn(message, "locales", locales, "pages", web page.Locales)
		}
	}

	return web page, nil
}

Step 1: Add a Tracer

Step one is to import a tracer and begin a span for every methodology. As is considerably widespread in Go, the strategies have already got a ctx parameter, so we simply have to wrap it with the tr.Begin name.

var tr = otel.Tracer("container_api")

func PrepareContainer(ctx context.Context, container ContainerContext, locales []string, dryRun bool, allLocalesRequired bool) (*StatusResult, error) {
	ctx, span := tr.Begin(ctx, "prepare_container")
	defer span.Finish()

func RenderPage(ctx context.Context, supply Supply, container ContainerContext, locales []string, allLocalesRequired bool) (string, error) {
	ctx, span := tr.Begin(ctx, "render_page")
	defer span.Finish()

Simply this step alone already provides us worth over logging: As talked about above, the spans routinely include timing data and parent-child relationships.

Step 2: Wrap the Errors

OTEL Spans help a standing attribute, together with a standing message which is used when there’s a non-success standing. By making a small wrapper perform like this:

func Error(s hint.Span, err error) error {
  s.RecordError(err)
  s.SetStatus(codes.Error, err.Error())

  return err
}

We will wrap all error returns in order that we seize the error itself (SetStatus) and there may be an error occasion recorded on the hint too (RecordError):

if err := MarkReadyForUsage(ctx, container, templateIds); err != nil {
- return nil, err
+ return nil, tracing.Error(span, err)
}

Step 3: Add Attributes and Substitute Messages

The following steps is to interchange any logger messages with attributes by turning them into statements that may be filtered on.

- logger.Data(fmt.Sprintf(`Filling %s web page template`, supply.Title))
+ tracing.String(span, "source_name", supply.Title)

We additionally wish to add attributes for any parameters we’d wish to filter on later.

	ctx, span := tr.Begin(ctx, "prepare_container")
	defer span.Finish()

	tracing.StringSlice(span, "locales", locales)
	tracing.Bool(span, "dry_run", dryRun)
	tracing.Bool(span, "locales_mandatory", allLocalesRequired)

Lastly , we will simplify a code block: there is no such thing as a level within the logger.Warning name right here, as we will have all of the required data as filterable properties:

+ allLocalesRendered := len(web page.Locales) == len(locales)

+ tracing.Bool(span, "all_locales_rendered", allLocalesRendered)
+ tracing.StringSlice(span, "locales_rendered", web page.Locales)


- if len(web page.Locales) != len(locales) {
+ if !allLocalesRendered && allLocalesRequired {
+   return nil, tracing.Errorf(`Didn't render %s web page template for some locales`, supply.Title)
- const message = fmt.Sprintf(`Didn't render %s web page template for some locales`, supply.Title)
- if allLocalesRequired {
-   return nil, tracing.Errorf(message)
- } else {
-   logger.Warn(message, "locales", locales, "pages", web page.Locales)
- }
}

The Outcome

The diff between the 2 features exhibits that the tracing model is longer – by 9 traces. Nonetheless, the traced model comprises a lot extra data than the logged model.

var tr = otel.Tracer("container_api")

func PrepareContainer(ctx context.Context, container ContainerContext, locales []string, dryRun bool, allLocalesRequired bool) (*StatusResult, error) {
	ctx, span := tr.Begin(ctx, "prepare_container")
	defer span.Finish()

	tracing.StringSlice(span, "locales", locales)
	tracing.Bool(span, "dry_run", dryRun)
	tracing.Bool(span, "locales_mandatory", allLocalesRequired)

	homePage, err := RenderPage(ctx, dwelling, container, locales, allLocalesRequired)
	if err != nil {
		return nil, tracing.Error(span, err)
	}

	templateIds := []string{homePage.ID}

	hasFaq := container.PageSlugs.FAQ != ""
	tracing.Bool(span, "has_faq", hasFaq)

	if hasFaq {
		faqPage, err := RenderPage(ctx, faq, container, locales, allLocalesRequired)
		if err != nil {
			return nil, tracing.Error(span, err)
		}

		templateIds = append(templateIds, faqPage.ID)
	}

	tracing.StringSlice(span, "template_ids", templateIds)

	if dryRun {
		return &StatusResult{Standing: StatusDryRun}, nil
	}

	if err := MarkReadyForUsage(ctx, container, templateIds); err != nil {
		return nil, tracing.Error(span, err)
	}

	return &StatusResult{Standing: StatusComplete}, nil
}

func RenderPage(ctx context.Context, supply Supply, container ContainerContext, locales []string, allLocalesRequired bool) (string, error) {
	ctx, span := tr.Begin(ctx, "render_page")
	defer span.Finish()

	tracing.String(span, "source_name", supply.Title)

	template, err := FetchAndFillTemplate(ctx, supply, container, locales)
	if err != nil {
		return nil, tracing.Error(span, err)
	}

	web page, err := ConfigureFromTemplate(ctx, container, template, locales)
	if err != nil {
		return nil, tracing.Error(span, err)
	}

	allLocalesRendered := len(web page.Locales) == len(locales)

	tracing.Bool(span, "all_locales_rendered", allLocalesRendered)
	tracing.StringSlice(span, "locales_rendered", web page.Locales)

	if !allLocalesRendered && required {
		return nil, tracing.Errorf(`Didn't render %s web page template for some locales`, supply.Title)
	}

	return web page, nil
}

For those who want to view the information as particular person steps: initial, step 1, step 2, step 3, final

Questions & Solutions

Some questions I’ve seen come up when speaking about doing this.

How do I exploit this?

For those who’re debugging some error, taking the span identify out of your UI, and looking the codebase is an effective begin. Use the attributes named in code to look your UI for issues that match/don’t match.

Additionally, attempt Observability Driven Development. It entails utilizing your hint information earlier than, throughout, and after a change so you understand if its truly behaving as you count on.

Honeycomb is one of the best, no query. Lightstep is an in depth second.

I believe that is ugly!

That isn’t a query. Additionally it is a matter of opinion – I occur to suppose traced code appears higher than logged code. It additionally might be that it takes some getting used to whenever you first begin seeing it! Keep in mind how unit testing felt bizarre to start out with?

I would like a log message&mldr;

I attempt to flip log messages into statements. Like within the instance above:

discovered consumer in cache => user_in_cache: true
loading template ${identify} => template_loading: true, template_name: identify

After I have to mark that code execution has gotten to a selected level in a way:

add an attribute, like initial_processing_complete: true
ponder if the strategy needs to be cut up, and thus have its personal span

I’ve a loop, I’m overwriting attributes on every iteration

Just like above: both transfer the loop physique into its personal methodology, use a closure to create a brand new span within the loop, or don’t have a span per iteration.

You may at all times add abstract data after a loop:

foos := 0
bars := 0
for i, factor := vary issues {
  // ...
  if factor.prop == "foo" {
    foos++
  } else {
    bars++
  }
}

tracing.SetAttribute(span, "foos", foos)
tracing.SetAttribute(span, "bars", bars)

Thanks

Due to Aki for among the questions and a few discussions which expanded a number of of the sections right here.

Due to Lari for prompting this weblog put up initially!

Due to you, for making it to the tip.

Source Link

What's Your Reaction?

Excited

Happy

In Love

Not Sure

Silly

Logs vs Traces#

Log Ranges#

Messages#

Blended Outputs / Semantics#

Relationships#

Timings#

Querying#

Evolving logs#

Step 1: Add a Tracer#

Step 2: Wrap the Errors#

Step 3: Add Attributes and Substitute Messages#

The Outcome#

Questions & Solutions#

How do I exploit this?#

What do you advocate to view traces?#

I believe that is ugly!#

I would like a log message&mldr;#

I’ve a loop, I’m overwriting attributes on every iteration#

Thanks#