How briskly are Linux pipes anyway?
On this publish, we’ll discover how Unix pipes are applied in Linux by iteratively optimizing a take a look at program that writes and reads information by means of a pipe.
We’ll start with a easy program with a throughput of round 3.5GiB/s, and enhance its efficiency twentyfold. The enhancements shall be knowledgeable by profiling this system utilizing Linux’s perf
tooling. The code is available on GitHub.
The publish was impressed by studying a highly optimized FizzBuzz program, which pushes output to a pipe at a fee of ~35GiB/s on my laptop computer. Our first aim shall be to match that velocity, explaining each step as we go alongside. We’ll additionally add a further performance-improving measure, which isn’t wanted in FizzBuzz for the reason that bottleneck is definitely computing the output, not IO, a minimum of on my machine.
We’ll proceed as follows:
- A first slow version of our pipe test bench;
- How pipes are implemented internally, and why writing and reading from them is slow;
- How the
vmsplice
andsplice
syscalls let us get around some (but not all!) of the slowness; - A description of Linux paging, leading up to a faster version using huge pages;
- The final optimization, replacing polling with busy looping;
- Some closing thoughts.
Part 4 is the heaviest on Linux kernel internals, so it is likely to be attention-grabbing even in case you’re aware of the opposite matters handled within the publish. For readers not aware of the matters handled, solely primary data of C is assumed.
Let’s start!
The problem, and a sluggish first model #
Initially, let’s begin with measuring the efficiency of the fabled FizzBuzz program, following the principles laid down by the StackOverflow publish:
% ./fizzbuzz | pv >/dev/null
422GiB 0:00:16 [36.2GiB/s]
pv
is “pipe viewer”, a handy utility to measure the throughput of information flowing by means of a pipe. So fizzbuzz
is producing output at a fee of 36GiB/s.
fizzbuzz
writes the output in blocks as massive because the L2 cache, to strike an excellent stability between low cost entry to reminiscence and minimizing IO overhead.
On my machine, the L2 cache is 256KiB. All through this publish, we’ll additionally output blocks of 256KiB, however with out “computing” something. Primarily, we’ll attempt to measure the higher sure for packages writing to a pipe with an inexpensive buffer dimension.
Whereas fizzbuzz
makes use of pv
to measure velocity, our setup shall be barely totally different: we’ll implement the packages on each ends of the pipe. That is in order that we absolutely management the code concerned in pushing and pulling information from the pipe.
The code is obtainable in my pipes-speed-test
repo. write.cpp
implements the writing, and learn.cpp
the studying. write
repeatedly writes the identical 256KiB eternally. learn
reads by means of 10GiB of information and terminates, printing the throughput in GiB/s. Each executables settle for quite a lot of command line choices to alter their habits.
The primary try at studying and writing from pipes shall be utilizing the write
and read
syscalls, utilizing the identical buffer dimension as fizzbuzz
. Right here’s a view of the writing finish:
int main() {
size_t buf_size = 1 << 18; // 256KiB
char* buf = (char*) malloc(buf_size);
memset((void*)buf, 'X', buf_size); // output Xs
while (true) {
size_t remaining = buf_size;
while (remaining > 0) {
// Keep invoking `write` until we've written the entirety
// of the buffer. Remember that write returns how much
// it could write into the destination -- in this case,
// our pipe.
ssize_t written = write(
STDOUT_FILENO, buf + (buf_size - remaining), remaining
);
remaining -= written;
}
}
}
This snippet and following ones omit all error checking for brevity. The memset
ensures that the output will be printable, but also plays another role, as we’ll discuss later.
The work is all done by the write
call, the rest is making sure that the whole buffer is written. The read end is very similar, but read
ing data into buf
, and terminating when enough has been read.
After building, the code from the repo can be run as follows:
% ./write | ./read
3.7GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
We’re writing the same 256KiB buffer filled with 'X'
s 40960 times, and measuring the throughput. What’s worrying is that we’re 10 times slower than fizzbuzz
! And we’re not doing any work, just writing bytes to the pipe.
It turns out that we can’t get much faster than this by using write
and read
.
The trouble with write
#
To seek out out what our program is spending time on, we are able to use perf
:
% perf file -g sh -c './write | ./learn'
3.2GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 2.851 MB perf.data (21201 samples) ]
The -g
instructs perf to file name graphs: it will permit us to take a top-down take a look at the place time is being spent.
We are able to check out the place time is spent utilizing perf report
. Here’s a evenly redacted excerpt, breaking down the place write
spends its time:
% perf report -g --symbol-filter=write
- 48.05% 0.05% write libc-2.33.so [.] __GI___libc_write
- 48.04% __GI___libc_write
- 47.69% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 47.54% ksys_write
- 47.40% vfs_write
- 47.23% new_sync_write
- pipe_write
+ 24.08% copy_page_from_iter
+ 11.76% __alloc_pages
+ 4.32% schedule
+ 2.98% __wake_up_common_lock
0.95% _raw_spin_lock_irq
0.74% alloc_pages
0.66% prepare_to_wait_event
47% of the time is spent in pipe_write
, which is what write
resolves to if we’re writing to a pipe. This isn’t shocking — we’re spending roughly half of the time writing, and the opposite half studying.
Inside pipe_write
, 3/4 of the time is spent copying or allocating pages (copy_page_from_iter
and __alloc_pages
). If we have already got an concept of how communication between the kernel and userspace works this would possibly make some sense. Regardless, to totally perceive what’s occurring we should first perceive how pipes work.
What are pipes manufactured from? #
The information construction holding a pipe may be present in include/linux/pipe_fs_i.h
, and the operations on it in fs/pipe.c
.
A Linux pipe is a ring buffer holding references to pages the place the info is written to and skim from:
Within the picture above the ring buffer has 8 slots, however we’d have roughly, the default being 16. Every web page is 4KiB on x86-64, however is likely to be of various sizes on different architectures. In whole, this pipe can maintain at most 32KiB of information. This can be a key level: each pipe has an higher sure on the overall quantity of information it may possibly maintain earlier than it’s full.
The shaded a part of the diagram represents the present pipe information, the non-shaded half the empty area within the pipe.
Considerably counterintuitively, head
shops the write-end of the pipe. That’s, writers will write into the buffer pointed at by head
, and enhance head
accordingly if they should transfer onto the subsequent buffer. Inside the write buffer, len
shops how a lot we’ve written in it.
Conversely, tail
shops the read-end of the pipe: readers will begin consuming the pipe from there. offset
signifies the place to start out studying from.
Notice that tail
can seem after head
, like within the image, since we’re working with a round/ring buffer. Additionally be aware that some slots is likely to be unused after we haven’t stuffed the pipe utterly — the NULL
cells within the center. If the pipe is full (no NULL
s and no free area within the pages), write
will block. If the pipe is empty (all NULL
s), learn
will block.
Right here’s an abridged model of the C information buildings in pipe_fs_i.h
:
struct pipe_inode_info {
unsigned int head;
unsigned int tail;
struct pipe_buffer *bufs;
};
struct pipe_buffer {
struct page *page;
unsigned int offset, len;
};
We’re omitting many fields here, and we’re not explainining what struct page
contains yet, but this is the key data structure to understanding how reading and writing from a pipe happens.
Reading and writing to pipes #
Let’s now go to the definition of pipe_write
, to attempt to make sense of the perf
output proven earlier than.
Here’s a simplified rationalization of how pipe_write
works:
- If the pipe is already full, wait for space and restart;
- If the buffer presently pointed at by
head
has area, fill that space first; - While there’s free slots, and there are remaining bytes to write, allocate new pages and fill them, updating
head
.
The operations described above are protected by a lock, which pipe_write
acquires and releases as obligatory.
pipe_read
is the mirror picture of pipe_write
, besides that we eat pages, free them after we’ve absolutely learn them, and replace tail
.
So, we now have a fairly disagreeable image of what’s going on:
- We copy every web page twice, as soon as from consumer reminiscence to the kernel, and again once more to the kernel to consumer reminiscence;
- The copying is finished one 4KiB web page at a time, interspersed with different exercise, such because the synchronization between learn and write, and web page allocation and liberating;
- We’re working with reminiscence which may not be contiguous, since we’re continuously allocating new pages;
- We’re buying and releasing the pipe lock.
On this machine, sequential RAM studying clocks at round 16GiB/s:
% sysbench reminiscence --memory-block-size=1G --memory-oper=learn --threads=1 run
...
102400.00 MiB transferred (15921.22 MiB/sec)
Given all of the fiddliness listed above, a 4x slowdown in comparison with single-threaded sequential RAM velocity will not be that shocking.
Tweaking the buffer dimension or the pipe dimension to cut back the quantity of syscall and synchronization overhead, or tuning different parameters won’t get us very far. Fortunately, there’s a solution to get across the slowness of write
and of learn
altogether.
Splicing to the rescue #
This copying of buffers from consumer reminiscence to the kernel and again is a frequent thorn within the aspect of individuals needing to do quick IO. One widespread resolution is to simply lower the kernel out of the image and carry out IO operations instantly. For instance we’d work together instantly with a community card and bypass the kernel for low-latency networking.
On the whole after we write to a socket, or a file, or in our case a pipe, we’re first writing to a buffer someplace within the kernel, after which let the kernel do its work. Within the case of pipes, the pipe is a sequence of buffers within the kernel. All this copying is undesirable if we’re within the enterprise of efficiency.
Fortunately, Linux consists of system calls to hurry issues up after we need to transfer information to and from pipes, with out copying. Particularly:
splice
strikes information from a pipe to a file descriptor, and vice-versa.vmsplice
strikes information from consumer reminiscence right into a pipe.
Crucially, each operations work with out copying something.
Now that we all know how pipes work, we are able to already vaguely think about how the 2 operations operate: they only “seize” an present buffer from someplace and put it into the pipe ring buffer, or the reverse, quite than allocating new pages as wanted:
We’ll quickly see precisely how this works.
Splicing in follow #
Let’s exchange write
with vmsplice
. That is the signature for vmsplice
:
struct iovec {
void *iov_base; // Starting address
size_t iov_len; // Number of bytes
};
// Returns how much we've spliced into the pipe
ssize_t vmsplice(
int fd, const struct iovec *iov, size_t nr_segs, unsigned int flags
);
fd
is the target pipe, struct iovec *iov
is an array of buffers we’ll be moving to the pipe. Note that vmsplice
returns how much was “spliced” into the pipe, which might not be the full amount, much like how write
returns how much was written. Remember that pipes are bounded by how many slots they have in the ring buffer, and vmsplice
is not exempt from this restriction.
We also need to be a bit careful when using vmsplice
. Since the user memory is moved to the pipe without copying, we must ensure that the read-end consumes it before we can reuse the spliced buffer.
For this reason fizzbuzz
uses a double buffering scheme, which works as follows:
- Split the 256KiB buffer in two;
- Set the pipe size to 128KiB, this will have the effect of setting the pipe ring buffer to have 128KiB/4KiB = 32 slots;
- Alternate between writing to the first half-buffer and using
vmsplice
to move it to the pipe and doing the same with the other half.
The fact that the pipe size is set to 128KiB, and that we wait for vmsplice
to fully output one 128KiB buffer, ensures that by the time we’re done with one iteration of vmsplice
we know that the the previous buffer has been fully read — otherwise we would not have been able to fully vmsplice
the new 128KiB buffer into the 128KiB pipe.
Now, we’re not actually writing anything to the buffers, but we’ll keep the double buffering scheme since a similar scheme would be required for any program actually writing content.
Our write loop now looks something like this:
int main() {
size_t buf_size = 1 << 18; // 256KiB
char* buf = malloc(buf_size);
memset((void*)buf, 'X', buf_size); // output Xs
char* bufs[2] = { buf, buf + buf_size/2 };
int buf_ix = 0;
// Flip between the two buffers, splicing until we're done.
while (true) {
struct iovec bufvec = {
.iov_base = bufs[buf_ix],
.iov_len = buf_size/2
};
buf_ix = (buf_ix + 1) % 2;
while (bufvec.iov_len > 0) {
ssize_t ret = vmsplice(STDOUT_FILENO, &bufvec, 1, 0);
bufvec.iov_base = (void*) (((char*) bufvec.iov_base) + ret);
bufvec.iov_len -= ret;
}
}
}
Here are the results writing with vmsplice
, rather than write
:
% ./write --write_with_vmsplice | ./read
12.7GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
This reduces by half the amount of copying we need to do, and already improves our througput more than threefold — to 12.7GiB/s. Changing the read end to use splice
, we eliminate all copying, and get another 2.5x speedup:
% ./write --write_with_vmsplice | ./read --read_with_splice
32.8GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
Fishing for pages #
What subsequent? Let’s ask perf
:
% perf file -g sh -c './write --write_with_vmsplice | ./learn --read_with_splice'
33.4GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.305 MB perf.data (2413 samples) ]
% perf report --symbol-filter=vmsplice
- 49.59% 0.38% write libc-2.33.so [.] vmsplice
- 49.46% vmsplice
- 45.17% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 44.30% __do_sys_vmsplice
+ 17.88% iov_iter_get_pages
+ 16.57% __mutex_lock.constprop.0
3.89% add_to_pipe
1.17% iov_iter_advance
0.82% mutex_unlock
0.75% pipe_lock
2.01% __entry_text_start
1.45% syscall_return_via_sysret
The lion’s share of the time is taken by locking the pipe for writing (__mutex_lock.constprop.0
), and by shifting the pages into the pipe (iov_iter_get_pages
). There isn’t a lot we are able to do in regards to the locking, however we can enhance the efficiency of iov_iter_get_pages
.
Because the identify suggests, iov_iter_get_pages
turns the struct iovec
s we feed into vmsplice
into struct web page
s to place into the pipe. To know what this operate really does, and easy methods to velocity it up, we should first take a detour into how the CPU and Linux manage pages.
A whirlwind tour of paging #
As you would possibly pay attention to, processes don’t discuss with areas in RAM instantly: as an alternative, the are assigned digital reminiscence addresses, which get resolved to bodily addresses. This abstraction is called virtual memory, and has all kinds of benefits we gained’t cowl right here — the obvious being that it considerably simplifies working a number of processes competing for a similar bodily reminiscence.
In any case, every time we execute a program and we load/retailer from/to reminiscence, the CPU must convert our digital deal with to a bodily deal with. Storing a mapping from each digital deal with to each corresponding bodily deal with can be impratical. Subsequently reminiscence is break up up in uniformly sized chunks, known as pages, and digital pages are mapped to bodily pages:
There’s nothing particular about 4KiB: every structure picks a dimension, based mostly on numerous tradeoffs — a few of which we’ll quickly discover.
To make this a bit extra exact, let’s think about allocating 10000 bytes utilizing malloc
:
As we use them, our 10k bytes will look contiguous in virtual memory, but will be mapped to 3 not necessarily contiguous physical pages:
One of the tasks of the kernel is to manage this mapping, which is embodied in a data structure called the page table. The CPU specifies how the page table looks (since it needs to understand it), and the kernel manipulates it as needed. On x86-64 the page table is a 4-level, 512-way tree, which itself lives in memory. Each node of this tree is (you guessed it!) 4KiB wide, with each entry within the node leading to the next level being 8 bytes (4KiB/8bytes = 512). The entries contain the address of the next node, along with other metadata.
We have one page table per process — or in other words, each process has a reserved virtual address space. When the kernel context-switches to a process, it sets the special register CR3 to the physical address of the root of this tree. Then whenever a virtual address needs to be converted to a physical address, the CPU splits up the address in sections, and uses them to walk this tree and compute the physical address.
To make these concepts less abstract, here’s a visual depiction of how the virtual address 0x0000f2705af953c0
might be resolved to a physical address:
The search starts from the first level, called the “page global directory”, or PGD, the physical location of which is stored in CR3. The first 16 bits of the address are unused. We use the next 9 bits the PGD entry, and traverse down to the second level, “page upper directory”, or PUD. The next 9 bits are used to select an entry from the PUD. The process repeats for the next two levels, PMD (“page middle directory”), and PTE (“page table entry”). The PTE tells where the actual physical page we’re looking for is, and then we use the last 12 bits to find the offset inside the page.
The sparse structure of the page table allows the mapping to be gradually built up as new pages are needed. Whenever a process needs memory, the page table will be updated with a new entry by the kernel.
The role of struct page
#
The struct web page
information construction is a key piece of this equipment: it’s what the kernel makes use of to discuss with a single bodily web page, storing its bodily deal with and all kinds of different metadata about it. As an illustration we are able to get a struct web page
from the knowledge contained within the PTE (the final degree of the web page desk described above). On the whole it’s used pervasively in all code dealing with page-related issues.
Within the case of pipes, struct web page
is used to carry their information within the ring buffer, as we’re already seen:
However, vmsplice
accepts virtual memory as input, while struct page
refers to physical memory directly.
Therefore we need turn arbitrary chunks of virtual memory into a bunch of struct page
s. This is exactly what iov_iter_get_pages
does, and the place we’re spending half of our time:
ssize_t iov_iter_get_pages(
struct iov_iter *i, // input: a sized buffer in virtual memory
struct page **pages, // output: the list of pages which back the input buffers
size_t maxsize, // maximum number of bytes to get
unsigned maxpages, // maximum number of pages to get
size_t *start // offset into first page, if the input buffer wasn't page-aligned
);
struct iov_iter
is a Linux kernel data structure representing various ways of walking through chunks of memory, including struct iovec
. In our case, it will point to a 128KiB buffer. vmsplice
will use iov_iter_get_pages
to turn the input buffer into a bunch of struct page
s, and hold on to them. Now that you know how paging works, you might vaguely imagine how iov_iter_get_pages
works as well, but we’ll explain it in detail in the next section.
We’ve rapidly gone through a lot of new concepts, so to recap:
- Modern CPUs use virtual memory for their processes;
- Memory is organized in regularly-sized pages;
- The CPU translates virtual addresses into physical addresses using a page table mapping virtual pages to physical pages;
- The kernel adds and removes entries to the page table as necessary;
- Pipes are made out of references to physical pages, so
vmsplice
must convert virtual memory ranges into physical pages, and hold on to them.
The cost of getting pages #
The time spent in iov_iter_get_pages
is admittedly fully spent in one other operate, get_user_pages_fast
:
% perf report -g --symbol-filter=iov_iter_get_pages
- 17.08% 0.17% write [kernel.kallsyms] [k] iov_iter_get_pages
- 16.91% iov_iter_get_pages
- 16.88% internal_get_user_pages_fast
11.22% try_grab_compound_head
get_user_pages_fast
is a extra bare-bones model of iov_iter_get_pages
:
int get_user_pages_fast(
// virtual address, page aligned
unsigned long start,
// number of pages to retrieve
int nr_pages,
// flags, the meaning of which we won't get into
unsigned int gup_flags,
// output physical pages
struct page **pages
)
Here “user” (as opposed to “kernel”) refers to the fact that we’re turning virtual pages into references to physical pages.
To get our struct page
s, get_user_pages_fast
does exactly what the CPU would do, but in software: it walks the page table to collect all the physical pages, storing the results in struct page
s. In our case, we have a 128KiB buffer, and 4KiB pages, so we’ll have nr_pages = 32
. get_user_pages_fast
will need to walk the page table tree collecting 32 leaves, and storing the result in 32 struct page
s.
get_user_pages_fast
also needs to make sure that the physical page is not repurposed until the caller doesn’t need it anymore. This is achieved in the kernel using a reference count stored in struct page
, which is used to know when a bodily web page may be launched and repurposed sooner or later. The caller of get_user_pages_fast
should, sooner or later, launch the pages once more with put_page
, which is able to lower the reference rely.
Lastly, get_user_pages_fast
behaves in a different way relying on whether or not digital addresses are already within the web page desk. That is the place the _fast
suffix comes from: the kernel will first attempt to get an already present web page desk entry and corresponding struct web page
by simply strolling the web page desk, which is comparatively low cost, and fall again to producing a struct web page
by different, dearer means in any other case. The truth that we memset
the reminiscence at first will be certain that we by no means find yourself within the “sluggish” path of get_user_pages_fast
, for the reason that web page desk entries shall be created as our buffer is full of 'X'
s.
Notice that the get_user_pages
household of features will not be solely helpful for pipes — actually, it’s central in lots of drivers. A typical use is expounded to the kernel bypass we talked about: a driver for a community card would possibly use it to show some consumer reminiscence area right into a bodily web page, then talk the bodily web page location to the community card, and have the community card work together instantly with that reminiscence area with out kernel involvement.
Enormous pages #
To this point we’ve introduced pages as all the time being of the identical dimension — 4KiB on x86-64. Nonetheless, many CPU architectures, together with x86-64, embrace bigger web page sizes. Within the case of x86-64, we not solely have 4KiB pages (the “normal” dimension), but additionally 2MiB and even 1GiB pages (“large” pages). In the remainder of the publish we’ll solely cope with 2MiB large pages, since 1GiB pages are pretty unusual, and overkill for our activity anyway.
x86 | 4KiB | 2MiB, 4MiB |
x86-64 | 4KiB | 2MiB, 1GiB |
ARMv7 | 4KiB | 64KiB, 1MiB, 16MiB |
ARMv8 | 4KiB | 16KiB, 64KiB |
RISCV32 | 4KiB | 4MiB |
RISCV64 | 4KiB | 2MiB, 1GiB, 512GiB, 256 TiB |
Energy ISA | 8KiB | 64 KiB, 16 MiB, 16 GiB |
Web page sizes out there on architectures generally used at the moment, from Wikipedia.
The primary benefit of giant pages is that bookkeeping is cheaper, since there’s fewer of them wanted to cowl the identical quantity of reminiscence. Furthermore different operations are cheaper too, similar to resolving a digital deal with to a bodily deal with, since one degree much less of web page desk is required: as an alternative of getting a 12-bit offset into the web page, we’ll have a 21-bit offset, and one much less web page desk degree.
This relieves strain on the components of the CPUs that deal with this conversion, resulting in efficiency enhancements in lots of circumstances. Nonetheless, in our case, the strain will not be on the {hardware} that walks the web page desk, however on its software program counterpart which runs within the kernel.
On Linux, we are able to allocate a 2MiB large web page in a variety of ways, similar to by allocating reminiscence aligned to 2MiB after which utilizing madvise
to inform the kernel to make use of large pages for the offered buffer:
Switching to huge pages in our program yields another ~50% improvement:
% ./write --write_with_vmsplice --huge_page | ./read --read_with_splice
51.0GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
However, the reason for the improvements is not totally obvious. Naively, we might think that by using huge pages struct page
will just refer to a 2MiB page, rather than 4KiB.
Sadly this is not the case: the kernel code assumes everywhere that a struct page
refers to a page of the “standard” size for the current architecture. The way this works for huge pages (and in general for what Linux calls “compound pages”) is that a “head” struct page
contains the actual information about the backing physical page, with successive “tail” pages just containing a pointer to the head page.
So to represent 2MiB huge page we’ll have 1 “head” struct page
, and up to 511 “tail” struct page
s. Or in the case of our 128KiB buffer, 31 tail struct page
s:
Even if we need all these struct page
s, the code generating it ends up significantly faster. Instead of traversing the page table multiple times, once the first entry is found, the following struct page
s can be generated in a simple loop. Therefore the efficiency enchancment!
Busy looping #
We’re nearly carried out, I promise! Let’s take a look at perf
output as soon as once more:
- 46.91% 0.38% write libc-2.33.so [.] vmsplice
- 46.84% vmsplice
- 43.15% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 41.80% __do_sys_vmsplice
+ 14.90% wait_for_space
+ 8.27% __wake_up_common_lock
4.40% add_to_pipe
+ 4.24% iov_iter_get_pages
+ 3.92% __mutex_lock.constprop.0
1.81% iov_iter_advance
+ 0.55% import_iovec
+ 0.76% syscall_exit_to_user_mode
1.54% syscall_return_via_sysret
1.49% __entry_text_start
We’re now spending a big period of time ready for the pipe to be writeable (wait_for_space
), and waking up readers which have been ready for the pipe to have content material (__wake_up_common_lock
).
To sidestep these synchronization prices, we are able to ask vmsplice
to return if the pipe can’t be written to, and busy loop till it’s — and the identical when studying with splice
:
...
// SPLICE_F_NONBLOCK will cause `vmsplice` to return immediately
// if we can't write to the pipe, returning EAGAIN
ssize_t ret = vmsplice(STDOUT_FILENO, &bufvec, 1, SPLICE_F_NONBLOCK);
if (ret < 0 && errno == EAGAIN) {
continue; // busy loop if not ready to write
}
...
By busy looping we get another 25% performance increase:
% ./write --write_with_vmsplice --huge_page --busy_loop | ./read --read_with_splice --busy_loop
62.5GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
Obviously busy looping comes at the cost of fully occupying a CPU core waiting for vmsplice
to be ready. But often this compromise is worth it, and in fact it is a common pattern for high-performance server applications: we trade off possibly wasteful CPU utilization for better latency and/or throughput.
In our case, this concludes our optimization journey for our little synthetic benchmark, from 3.5GiB/s to 65GiB/s.
Closing thoughts #
We’ve systematically improved the efficiency of our program by trying on the perf
output and the Linux supply. Pipes and splicing specifically aren’t actually scorching matters in relation to high-performance programming, however the themes we’ve touched upon are: zero-copy operations, ring buffers, paging & digital reminiscence, synchronization overhead.
There are some particulars and attention-grabbing matters I not noted, however this weblog publish was already spiraling uncontrolled and changing into too lengthy:
-
Within the precise code, the buffers are allotted separatedly, to cut back web page desk competition by putting them in numerous web page desk entries (one thing that the FizzBuzz program additionally does).
Do not forget that when a web page desk entry is taken with
get_user_pages
, its refcount is elevated, and decreased onput_page
. If we use two web page desk entries for the 2 buffers, quite than one web page desk entry for each of them, we’ve got much less competition when modifying the refcount. -
The checks are ran by pinning the
./write
and./learn
processes to 2 cores withtaskset
. -
The code within the repo accommodates many different choices I performed with, however didn’t find yourself speaking about since they have been irrelevant or not attention-grabbing sufficient.
-
The repo additionally accommodates a synthetic benchmark for
get_user_pages_fast
, which can be utilized to measure precisely how a lot slower it runs with or with out large pages. -
Splicing basically is a barely doubtful/dangerous idea, which continues to annoy to kernel builders.
Please let me know if this publish was useful, attention-grabbing, or unclear!
Acknowledgements #
Many because of Alexandru Scvorţov, Max Staudt, Alex Appetiti, Alex Sayers, Stephen Lavelle, Peter Cawley, and Niklas Hambüchen for reviewing drafts of this publish. Max Staudt additionally helped me perceive some subtleties of get_user_pages
.