Hyperlink-time optimisation (LTO) | J. Ryan Stinnett
I lately began exploring link-time optimisation (LTO),
which I used to suppose was only a single boolean alternative
within the compilation and linking workflow,
and maybe it was like that some time in the past…
I’ve discovered that lately,
there are a lot of totally different dimensions of LTO throughout
compilers and linkers at present and extra variations
are
being
proposed
on a regular basis.
On this “dwelling information”,
I goal to cowl the LTO-related options I’ve encountered to date.
I intend to maintain updating this going ahead
as I find out about new particulars on this house.
I’m certain there are much more corners to cowl,
so should you see one thing that ought to be added, corrected, and so on.
please contact me.
I’m not aiming to offer particular suggestions right here,
as there are a lot of tradeoffs to think about
and totally different functions of LTO will weigh every of these in another way.
As a substitute, I hope it is a broadly helpful portrayal of the details.
This information focuses on widespread toolchains for languages like C, Rust, and so on. which
sometimes use ahead-of-time (AOT) compilation and linking. Various
toolchains for these languages and customary toolchains for different languages might use
different methods like deciphering, JIT compilation, and so on. These different language
implementation methods don’t provide LTO-like options (that I do know of), so
I’ve ignored them right here.
I hope this information is beneficial to skilled compiler customers and compiler hackers
who might not have heard concerning the newest work but. I additionally hope it’s broadly
accessible to those that could also be unfamiliar with LTO.
Fundamentals#
Regular (non-LTO) compilation compiles and optimises one file at a time.
LTO can optimise throughout all information without delay.
The general goal of LTO is healthier runtime efficiency by means of whole-program
evaluation and cross-module optimisation.
Let’s take a high-level take a look at the workflow of most AOT compile and hyperlink
toolchains. We’ll begin off with the “default” workflow with out LTO.
Within the default workflow with out LTO, every unit of supply code is compiled and
optimised individually to supply an object file with machine code for the goal
structure. Optimisations can solely think about a single supply unit at a time, so
all externally accessible symbols (features and variables) should be preserved,
even when they are going to find yourself being unused within the ultimate linked output. The linker
then combines these object information to supply the ultimate output (executable or
library).
Now let’s take a look at a workflow with LTO.
Within the LTO setup, we nonetheless initially compile every supply unit individually and
carry out some quantity of optimisation, however the output is totally different: as an alternative of
machine code, the output of LTO compilation is an object file containing
the compiler’s particular inside illustration (IR) for that supply unit.
The linking stage now performs a way more complicated dance than it did earlier than.
The IR produced from compiling every supply unit is learn and the compiler’s
optimisation pipeline is invoked to analyse and remodel the entire program (the
exact particulars of this varies with totally different LTO options, as we’ll see later
on this information). This entire program stage unlocks additional optimisation
alternatives, as we are able to take away symbols which might be unused exterior the entire
program, inline features from different supply items, and so on.
With these fundamentals out of the best way, let’s take a look at numerous options and
variants that toolchains provide to additional tweak and customise this course of.
Options and variants#
⚠️
Among the options discovered within the LTO house have “advertising” names which don’t
talk what they really do on a technical stage.
For some descriptions under, I’ve used my very own terminology to keep away from these
points.
Every part additionally lists different names issues are identified by, in case you wish to
seek for extra data.
Primary LTO#
That is the only of the LTO modes and matches the workflow described above
in Basics.
Every compilation activity produces object information containing the compiler’s inside
IR format.
The linking stage combines all compilation items right into a single, giant module.
Interprocedural evaluation and optimisation is carried out on a single thread.
With giant code bases, this course of is more likely to eat a number of reminiscence and take
a substantial period of time.
When it comes to compile-time efficiency,
the LLVM challenge has shown that compilation and linking of
the Clang 3.9 codebase with primary LTO is ~5x the non-LTO time.
This additional work achieves a median run-time efficiency enchancment of two.86%.
This mode can also be known as “full LTO”.
Toolchain | First obtainable | Choice |
---|---|---|
Clang | 2.6 (2009) | -flto or -flto=full |
GCC | 4.5 (2010) | -flto -flto-partition=one |
Rust | 1.0 (2015) | -C lto |
Parallel LTO#
Toolchains have provide you with a wide range of associated strategies to hurry up the
the link-time work of LTO whereas preserving (most of) the run-time efficiency
good points. As a substitute of making a single, huge module at hyperlink time, a a lot smaller
world index of features more likely to be inlined is computed. With that in hand,
every compilation unit could be processed in parallel at hyperlink time whereas nonetheless
benefiting from many of the identical whole-program optimisations as basic
LTO.
Persevering with with the identical example data based mostly on constructing Clang 3.9,
LLVM’s implementation of parallel LTO
achieves almost the entire run-time efficiency enchancment
as seen with primary LTO:
primary LTO reached a 2.86% enchancment over non-LTO,
whereas parallel LTO achieved a 2.63% enchancment over non-LTO.
The compilation time improves dramatically: as an alternative of 5x non-LTO, it’s now solely
1.2x non-LTO, which is sort of spectacular.
The technical particulars of how every toolchain
implements parallel LTO range considerably.
LLVM-based toolchains (which incorporates Clang and Rust from our discussions right here)
optimise totally different compilation items in parallel
and inlines throughout module boundaries,
however most different cross-modules optimisations are skipped.
GCC, then again,
partitions (almost) the identical optimisation work
it will have completed with one thread right into a batch per thread.
This means that GCC’s parallel LTO ought to have the ability to get nearer than LLVM
in reaching the identical run-time efficiency as with primary LTO
(our recurring dataset reveals a 0.23% run-time delta between
LLVM’s primary and parallel modes).
On the identical time, LLVM’s strategy could also be higher in a position to deal with
incremental modifications in a single module of enormous program.
If you need to see knowledge evaluating the 2 modes in GCC,
please let me know.
This mode can also be known as “skinny LTO”,
significantly within the LLVM ecosystem.
The “skinny” idea on the LLVM aspect
refers to the truth that no IR is concerned in the entire program evaluation step.
Toolchain | First obtainable | Choice |
---|---|---|
Clang | 3.9 (2016) | -flto=skinny |
GCC | 4.6 (2011) | -flto=auto or -flto=<threads> |
Rust | 1.25 (2018) | -C lto=skinny |
LTO with mode choice deferred to hyperlink time#
In some functions, it could be helpful to help each the basic
and parallel LTO modes. On this association, the compiler IR
hooked up to every object file shops all the information it must run both LTO mode
at hyperlink time.
When it’s time to hyperlink, you’ll be able to then select both of
the fundamental or parallel LTO modes by way of link-time compiler choices
(for the toolchains talked about right here,
this system generally known as simply the “compiler”
is admittedly the “compiler driver”
which in flip calls the opposite applications within the workflow,
such because the linker).
This variant can also be known as “unified LTO”.
Toolchain | First obtainable | Choice |
---|---|---|
Clang | 17 (2023) | -funified-lto |
GCC | 4.5 (2010) | supported by default |
Rust | — | — |
LTO enablement choice deferred to hyperlink time#
It could even be helpful to push the selection of whether or not to make use of LTO in any respect all the way down to
the linking step of the workflow. To help this use case, the compiler
combines each machine code and inside IR within the object information produced by
every compilation unit.
This variant can also be known as “fats LTO objects”.
Toolchain | First obtainable | Choice |
---|---|---|
Clang | 17 (2023) | -ffat-lto-objects |
GCC | 4.9 (2014) | -ffat-lto-objects |
Rust | — | — |
Different particulars#
There are a couple of different extra superior corners of LTO, together with:
- Distributed construct help
- Image visibility
- Linker caching
If you happen to’re interested in any of those or different facets, please
let me know!
I plan to increase this information to doc further bits of LTO
that others are all in favour of.
Acknowledgements#
Due to
Teresa Johnson,
Jan Hubička,
Iti Shree, and
Laurence Tratt
for suggestions on earlier drafts.