How we sped up time sequence by 20-30x — Rerun


Written by Nikolaus West 2 days in the past
It is a observe up submit to the submit Real-time kHz time series in a multimodal visualizer that dives into how we managed to attain such large efficiency good points for time sequence (amongst different issues), and why it was so arduous within the first place.
Comparability of time taken to attract a time sequence plot in Rerun with 1M factors between model 0.12 and 0.13. The numbers come from profiling on a 2021 M1 MacBook Professional.
Versatile multimodal knowledge is difficult to mix with quick plots
Ever since the first release of Rerun, plotting larger time series has been painfully slow.
0.13 is the primary launch the place we expect they’re truly usable.
It’s price taking a step again to clarify why making plots quick in Rerun was arduous, earlier than stepping into what we did about it.
To try this, we’d like some extra background on how Rerun, the in-memory database, works.
Rerun as a multimodal time sequence database
One way of looking at Rerun is as an in-memory multimodal time series database with visualization on top.
- You can throw many kinds of data at it, from simple metrics to big multi dimensional tensors, point clouds, and text documents.
- Data is indexed along multiple user defined timelines, and can come in out of order.
- The Rerun data model is a temporal Entity Component System (ECS) that allows updating single components at a time.
Let’s look at a small example of updating a colored point cloud in parts and out of order:
For any given time point, the datastore needs to be able to collect the latest components of any entity, and join them against the primary component (3D positions in this case). Queries run every time Rerun renders a frame, which should be 60 times per second.
Getting all this to be both correct and fast took considerable effort during the first year of Rerun. The combination of all these features is what makes Rerun so versatile and easy to use.
Overhead from flexibility make simple time series the worst case
Flexibility comes at the cost of added overhead. That matters less for larger data, but dominates performance for small data like scalars. In Rerun, time series are created by querying the datastore for all scalars on an entity for a range of time.
Before 0.13, the worst case was therefore simple time series plots, since you’d have to pay all that overhead many times for very little data.
Speeding up time series by a factor of 20-30
On Rerun 0.12, rendering a single frame of a 1M point time series plot takes ~600ms on a 2021 M1 MacBook Pro. On 0.13 it takes ~20ms, a 30x speedup. For smaller series the speedup tends to come closer to 20x, which is still huge. How did we get there?
Sources of overhead in time series plots
Let’s sum up the main sources of overhead in producing data for and rendering a time series plot:
- Bookkeeping: Keeping track of which components exist at each timestamp is relatively costly for small data.
- Data locality: For small data, the flexible data model (both in time and data type) bottoms out in a lack of data locality, which is bad for CPU cache efficiency.
- Redundant work: When plotting a moving time window, data is usually only changing at the edges of the window. Repeatedly running the full range query creates lots of redundant work.
- Rendering: For large time series, there may be more points than pixels to draw them on along the time dimension (x-axis). This leads to redundant tessellation and overdraw in the rendering pipeline.
Why we didn’t just special case metrics
The easiest solution for speeding up time series plots would be to special case the path for metrics. A specialized code path for simple metrics would make huge gains relatively easy.
Unfortunately this isn’t good enough because our users need good performance for more kinds of range queries. Here is an example using time ranges to show an aggregate point cloud in a structure from motion setting:
In addition, our motivating example for kHz plots was IMU samples, which often come as messages containing a number of values to plot. That makes particular casing the only metrics much less helpful. For instance:
typedef struct { float acc[3]; float gyro[3]; unsigned lengthy timestamp; } IMUSample;
Rerun doesn’t but help visualizing time sequence from single fields of bigger structs immediately, however we’ll quickly and our method must help this as a first-class use-case.
Caching is tough however mandatory
There are only two hard things in Computer Science: cache invalidation and naming things.
Phil Karlton
The non-rendering sources of overhead just scream caching, but as usual the devil is in the details, in particular the details of cache invalidation. In our case, out of order insertions combined with composing multiple components over time make invalidation particularly gnarly.
Datastore changelogs make invalidation manageable
Except for garbage collection (dropping old data to free up memory), the Rerun datastore is immutable. Data is always dropped through snapshotting so that query semantics are left unchanged. Every single change results in adding or removing rows to the store.
The first step we took after deciding it was time for caching was a refactor that turned every system that maintains a derived dataset in Rerun (timeline widget, view heuristics engine, etc) into a store subscriber, which listens to changelogs of added or removed rows in the datastore.
Cache invalidation is yet another store subscriber, and having this protocol in place is what made it manageable to deal with all the complexity.
Caches are built lazily at query time
Cache invalidation only sets a dirty flag. Actually building the cache happens lazily at query time. Since Rerun uses an end to end immediate mode architecture, we query the datastore on every frame, ideally 60 times per second.
This acts as a natural micro-batching mechanism where we first accumulate changes while we render the current frame and then handle all these changes at the start of the next frame. Batching updates like this is great for performance.
Visible views query the datastore right before rendering. Building caches lazily therefore means we never spend time updating a cache that isn’t used.
Multi-tenancy requires fine grained locking
When not on the web, each space view (draggable visualization panel) runs and queries the datastore in parallel. Many of these queries might overlap so we need to make sure to share data and cache resources appropriately.
Supporting this multi-tenancy requires fine grained locking on the combination of store (there are multiple), entity, set of components, and the component “point of view” (the component we join against).
Aggregating sub-pixel points speeds up rendering
There are only so many pixels available to draw on along the x-axis. For large time series, that means you need to aggregate points that would show up on the same x-axis tick to avoid overdraw and redundant tessellation in the render pipeline.
We compute these aggregations on every frame and let users choose between a set of basic options.
A caveat on performance for out of order logs
The speedups added in 0.13 slow down ingestion speed for out of order logs.
In practice this only matters for scalars logged out of order at high frequency,
but it does mean that live time series plots are slow in this case.
An update that fixes this is planned for 0.13.1.
Follow the progress here.
Big wins already however there’s extra available
Adding a caching layer to the Rerun datastore has taken lots of effort over the last months, but the performance gains are clearly worth the added complexity.
There are still lots of gains to be had by adding secondary caches on top, for scalars and other data. For example, slowly changing 3D geometry could be cached on the GPU to avoid redundant CPU -> GPU transfers, which are currently the performance bottleneck for point clouds.
Join us on Github or Discord
and tell us what if these efficiency enchancment made a distinction for you,
and what areas you’d like to see us pace up subsequent.