Higher CPU choice for timer expiration [LWN.net]
Welcome to LWN.web
The next subscription-only content material has been made out there to you
by an LWN subscriber. Hundreds of subscribers rely upon LWN for the
finest information from the Linux and free software program communities. In the event you take pleasure in this
article, please contemplate subscribing to LWN. Thanks
for visiting LWN.web!
By Jonathan Corbet
November 7, 2022
On the floor, the kernel’s inner timer mechanism wouldn’t seem to
have modified a lot in a very long time; the core API seems fairly just like the
one current within the 1.0 launch. Beneath the API, naturally, fairly a bit
of complexity has been added over time. The implementation of this
API seems to turn out to be much more advanced — however quicker — if and when this
patch set from Anna-Maria Behnsen finds its manner into the mainline.
Untimely optimization
For context, it’s price remembering that the kernel truly has two core APIs for the
administration of inner timers. High-resolution
timers (hrtimers) are, as their identify would counsel, used for
near-future occasions the place precision is essential; they’re comparatively
costly and solely used when obligatory. For every little thing else, there may be the
subsystem simply referred to as “kernel timers” (or, generally, “timer-wheel
timers”), centered round features like add_timer().
Behnsen’s patch set modifications how these extraordinary kernel timers work.
Arguably, most makes use of of kernel timers are the kernel making certain that it’ll
reply correctly if an anticipated occasion does not occur. A driver might begin
an I/O operation on a tool assured within the data that the system
will increase an interrupt when the operation completes, however it would nonetheless set
a timer in order that, if the system goes out to lunch, the operation does not
simply hold indefinitely. Different elements of the kernel, similar to networking, use
timers in related methods.
There are a few fascinating implications that come up from this utilization
sample. One is that such timers needn’t expire at precisely their nominal
expiration time; if the kernel takes a short while to get round to
dealing with an expired timer, nothing unhealthy occurs. That enables the kernel to
batch timer expirations and to defer them to keep away from waking an in any other case idle
CPU. The implication that’s related right here, although, is that kernel timers
hardly ever expire. When issues are working usually, the anticipated occasions
will happen and the timer will both be canceled or reset to a brand new time
additional sooner or later. In consequence, the timer subsystem needs to be
optimized for the creation and cancellation of timer occasions. And, certainly,
a lot effort has gone into that kind of optimization, as will be seen in LWN
protection from way back to 2004 as much as a significant reimplementation of timers in
2015.
Behnsen has recognized a spot the place additional optimization will be carried out,
although. When a timer is ready in present kernels, the timer subsystem spends
some CPU time attempting to determine which CPU ought to deal with the expiration of that timer.
The intent is to push that work to a CPU that’s
already busy somewhat than waking an idle CPU simply to run out a timer. So the
timer subsystem scans by way of the system in search of an appropriate CPU and
provides the timer to the suitable queue.
There are a few issues with this algorithm, although. One is {that a}
CPU that’s busy now might not be busy when the timer expires. So the
selection of a CPU to deal with expiration, if made when the timer is ready, is
actually only a guess. Maybe worse, although, is that this work is being
accomplished on the mistaken time; since most timers by no means expire, any effort that’s
put into choosing the expiration CPU forward of time is more likely to be wasted,
even when the guess seems to be a very good one. It could be much better to
not do any additional work when a timer is ready, and have a CPU that’s truly
busy at expiration time deal with it then — on the comparatively uncommon
event when a timer truly expires.
The primary half — not choosing an expiration CPU at setting time — is straightforward to
implement; a timer is simply put into the queue of the CPU that’s setting it.
Having an appropriate CPU truly deal with expiration is tougher,
although. A naive implementation may simply create a easy queue of timers
{that a} CPU would test sometimes if it is operating and in a position to deal with
expirations. That may create quite a lot of locking rivalry and
cache-line bouncing, although, slowing issues down even when there have been no
timers to deal with. So one thing extra advanced known as for.
Selecting the expiration CPU
The scheme chosen is to prepare the system’s CPUs right into a hierarchy that
resembles the {hardware} topology, however which is impartial from it. On the
lowest stage, CPUs are assembled into teams of as much as eight, with the constraint
that each one eight should be contained inside the identical NUMA node. The teams
are, themselves, organized into teams; this course of continues till all
CPUs within the system have been organized right into a single tree.
Think about, for
instance, a easy, four-CPU system organized into two NUMA nodes as proven
to the best. The primary two CPUs are organized into Group 1; the
different two, since they’re in a distinct NUMA node, go right into a separate
group. These two teams, in flip, are positioned collectively in Group 3. A
bigger and extra advanced system may require extra ranges of group
hierarchy, however that will get awkward to indicate in a easy diagram.
The timer API permits timers to be pinned to particular CPUs; that doesn’t
change within the reimplementation. Every CPU must deal with expiration
for its pinned timers, even when which means waking from an idle state. Most
timers, although, will be executed anyplace within the system and aren’t pinned;
the dealing with of those “world” timers will probably be totally different from earlier than. A CPU
that’s busy will proceed to deal with world timers which are in its particular
queue however, if that CPU goes idle, it would as a substitute add its soonest-expiring
world timer to the queue related to its group.
Usually, every CPU group will designate certainly one of its members because the
“migrator”. That CPU, which can’t be idle, will sometimes test the
queue related to its group for expiring world timers; if the CPU that
queued a timer there may be nonetheless idle, then the migrator will pull it over and
deal with the expiration as a substitute, and the CPU that originally queued the timer can
stay idle. So, for instance, if CPU 1 within the diagram above is idle,
it would have enqueued its earliest-expiring world timer in Group 1;
if CPU 2 is operating (and is thus the migrator), it would deal with that
timer when it expires.
If the migrator goes idle, then one other CPU within the group must be handed
the baton and turn out to be the brand new migrator; that’s only a matter of discovering the
subsequent busy CPU in that group. If the entire different CPUs are additionally idle,
as a substitute, then the group finally ends up and not using a migrator. On this case, the
group is marked as idle within the subsequent increased group within the hierarchy,
and its first-expiring timer is queued at that subsequent stage. So, if
CPU 2 additionally goes idle, it would take the earliest-expiring occasion in
Group 1 and put it into the queue at Group 3.
The project of the migrator position occurs within the higher-level teams as
properly. If a bunch comprises different teams, a kind of teams would be the
migrator for that stage. Within the state of affairs described right here, Group 2 will
be the migrator for Group 3. The CPU operating because the migrator inside
Group 2 (both CPU 3 or CPU 4) will thus must deal with
timer occasions for Group 3 as properly. In a system with a whole lot of idle
CPUs, this migrator position can propagate all the best way to the highest of the
hierarchy, at which level one CPU could also be liable for dealing with all
expiring timers within the system.
If that CPU additionally goes idle, the system will probably be left with none
migrator CPU in any respect. At that time, the final CPU standing will set a
{hardware} timer to wake it on the expiration time for the earliest expiring
timer. That ensures that timer occasions do not get dropped even within the
absence of busy CPUs to deal with them.
Implementing this equipment within the timer subsystem leads to a patch set
including almost 2,000 strains of code to a core kernel subsystem. The profit
that comes from this work is alleged to be an roughly 25% discount within the time
required so as to add and take away a timer. Since timers will be set (and adjusted)
inside performance-sensitive code, this enchancment probably justifies
the added complexity.
A aspect good thing about this work is that it ought to allow the removing of the deferrable timers mechanism. Deferrable
timers are these for which expiration will be deferred with none in poor health
impact; a CPU that’s going idle won’t get up solely to deal with a
deferrable timer. Turning deferrable timers into world timers could have
the identical impact — they’ll not trigger a sleeping CPU to wake — so
there isn’t a longer a have to deal with them individually. The removing of
deferrable timers, which is, in response to Behnsen, coming quickly, will
counterbalance a few of the complexity added by this work.
(Log in to submit feedback)