Why you must use io_uring for community I/O
io_uring
is an async interface to the Linux kernel that may probably profit networking. It has been a giant win for file I/O (enter/output), however would possibly provide solely modest positive factors for community I/O, which already has non-blocking APIs. The positive factors are prone to come from the next:
- A lowered variety of syscalls on servers that do a number of context switching
- A unified asynchronous API for each file and community I/O
Many io_uring
options can be found in Red Hat Enterprise Linux 9 which is distributed with kernel model 5.14. The most recent io_uring
options can be found in Fedora 37.
What’s io_uring?
io_uring
is an asynchronous I/O interface for the Linux kernel. An io_uring
is a pair of ring buffers in shared reminiscence which can be used as queues between consumer house and the kernel:
- Submission queue (SQ): A consumer house course of makes use of the submission queue to ship asynchronous I/O requests to the kernel.
- Completion queue (CQ): The kernel makes use of the completion queue to ship the outcomes of asynchronous I/O operations again to consumer house.
The diagram in Determine 1 reveals how io_uring
gives an asynchronous interface between consumer house and the Linux kernel.
This interface permits purposes to maneuver away from the normal readiness-based mannequin of I/O to a brand new completion-based mannequin the place async file and community I/O share a unified API.
The syscall API
The Linux kernel API for io_uring
has 3 syscalls:
io_uring_setup
: Arrange a context for performing asynchronous I/Oio_uring_register
: Register information or consumer buffers for asynchronous I/Oio_uring_enter
: Provoke and/or full asynchronous I/O
The primary two syscalls are used to arrange an io_uring
occasion and optionally to pre-register buffers that might be referenced by io_uring
operations. Solely io_uring_enter
must be known as for queue submission and consumption. The price of an io_uring_enter
name will be amortized over a number of I/O operations. For very busy servers, you’ll be able to keep away from io_uring_enter
calls solely by enabling busy-polling of the submission queue within the kernel. This comes at the price of a kernel thread consuming CPU.
The liburing API
The liburing library gives a handy approach to make use of io_uring
, hiding a few of the complexity and offering features to organize all forms of I/O operations for submission.
A consumer course of creates an io_uring
:
struct io_uring ring;
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
then submits operations to the io_uring
submission queue:
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_readv(sqe, client_socket, iov, 1, 0);
io_uring_sqe_set_data(sqe, user_data);
io_uring_submit(&ring);
The method waits for completion:
struct io_uring_cqe *cqe;
int ret = io_uring_wait_cqe(&ring, &cqe);
and makes use of the response:
user_data = io_uring_cqe_get_data(cqe);
if (cqe->res < 0) {
// deal with error
} else {
// deal with response
}
io_uring_cqe_seen(&ring, cqe);
The liburing API is the popular approach to make use of io_uring
from purposes. liburing has function parity with the most recent kernel io_uring
growth work and is backward-compatible with older kernels that lack the most recent io_uring
options.
Utilizing io_uring for community I/O
We’ll check out io_uring
for community I/O by writing a easy echo server utilizing the liburing API. Then we are going to see the way to decrease the variety of syscalls required for a high-rate concurrent workload.
A easy echo server
The basic echo server that appeared in Berkeley Software program Distribution (BSD) Unix appears to be like one thing like this:
client_fd = settle for(listen_fd, &client_addr, &client_addrlen);
for (;;) {
numRead = learn(client_fd, buf, BUF_SIZE);
if (numRead <= 0) // exit loop on EOF or error
break;
if (write(client_fd, buf, numRead) != numRead)
// deal with write error
}
}
shut(client_fd);
The server may very well be multithreaded or use non-blocking I/O to help concurrent requests. No matter kind it takes, the server requires no less than 5 syscalls per consumer session, for settle for, learn, write, learn to detect EOF after which shut.
A naive translation of this to io_uring
leads to an asynchronous server that submits one operation at a time and waits for completion earlier than submitting the following. The pseudocode for a easy io_uring
-based server, omitting the boilerplate and error dealing with, appears to be like like this:
add_accept_request(listen_socket, &client_addr, &client_addr_len);
io_uring_submit(&ring);
whereas (1) {
int ret = io_uring_wait_cqe(&ring, &cqe);
struct request *req = (struct request *) cqe->user_data;
swap (req->sort) {
case ACCEPT:
add_accept_request(listen_socket,
&client_addr, &client_addr_len);
add_read_request(cqe->res);
io_uring_submit(&ring);
break;
case READ:
if (cqe->res <= 0) {
add_close_request(req);
} else {
add_write_request(req);
}
io_uring_submit(&ring);
break;
case WRITE:
add_read_request(req->socket);
io_uring_submit(&ring);
break;
case CLOSE:
free_request(req);
break;
default:
fprintf(stderr, "Sudden req sort %dn", req->sort);
break;
}
io_uring_cqe_seen(&ring, cqe);
}
On this io_uring
instance, the server nonetheless requires no less than 4 syscalls to course of every new consumer. The one saving achieved right here is by submitting a learn and a brand new settle for request collectively. This may be seen within the following strace output for the echo server receiving 1,000 consumer requests.
% time seconds usecs/name calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 0.445109 111 4001 io_uring_enter
0.01 0.000063 63 1 brk
------ ----------- ----------- --------- --------- ----------------
100.00 0.445172 111 4002 whole
Combining submissions
In an echo server, there are restricted alternatives for chaining I/O operations since we have to full a learn earlier than we all know what number of bytes we are able to write. We may chain settle for and browse by utilizing a brand new mounted file function of io_uring
, however we’re already capable of submit a learn request and a brand new settle for request collectively, so there’s perhaps not a lot to be gained there.
We will submit unbiased operations on the identical time so we are able to mix the submission of a write and the next learn. This reduces the syscall rely to three per consumer request:
% time seconds usecs/name calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.93 0.438697 146 3001 io_uring_enter
0.07 0.000325 325 1 brk
------ ----------- ----------- --------- --------- ----------------
100.00 0.439022 146 3002 whole
Draining the completion queue
It’s potential to mix much more work into the identical submission if we deal with all queued completions earlier than calling io_uring_submit
. We will do that by utilizing a mix of io_uring_wait_cqe
to attend for work, adopted by calls to io_uring_peek_cqe
to verify whether or not the completion queue has extra entries that may be processed. This avoids spinning in a busy loop when the completion queue is empty whereas additionally draining the completion queue as quick as potential.
The pseudocode for the principle loop now appears to be like like this:
whereas (1) {
int submissions = 0;
int ret = io_uring_wait_cqe(&ring, &cqe);
whereas (1) {
struct request *req = (struct request *) cqe->user_data;
swap (req->sort) {
case ACCEPT:
add_accept_request(listen_socket,
&client_addr, &client_addr_len);
add_read_request(cqe->res);
submissions += 2;
break;
case READ:
if (cqe->res <= 0) {
add_close_request(req);
submissions += 1;
} else {
add_write_request(req);
add_read_request(req->socket);
submissions += 2;
}
break;
case WRITE:
break;
case CLOSE:
free_request(req);
break;
default:
fprintf(stderr, "Sudden req sort %dn", req->sort);
break;
}
io_uring_cqe_seen(&ring, cqe);
if (io_uring_sq_space_left(&ring) < MAX_SQE_PER_LOOP) {
break; // the submission queue is full
}
ret = io_uring_peek_cqe(&ring, &cqe);
if (ret == -EAGAIN) {
break; // no remaining work in completion queue
}
}
if (submissions > 0) {
io_uring_submit(&ring);
}
}
The results of batching submissions for all out there work provides a major enchancment over the earlier consequence, as proven within the following strace output, once more for 1,000 consumer requests:
% time seconds usecs/name calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.91 0.324226 4104 79 io_uring_enter
0.09 0.000286 286 1 brk
------ ----------- ----------- --------- --------- ----------------
100.00 0.324512 4056 80 whole
The development right here is substantial, with greater than 12 consumer requests being dealt with per syscall, or a mean of greater than 60 I/O ops per syscall. This ratio improves because the server will get busier, which will be demonstrated by enabling logging within the server:
% time seconds usecs/name calls errors syscall
------ ----------- ----------- --------- --------- ----------------
68.86 0.225228 42 5308 286 write
31.13 0.101831 4427 23 io_uring_enter
0.00 0.000009 9 1 brk
------ ----------- ----------- --------- --------- ----------------
100.00 0.327068 61 5332 286 whole
This reveals that when the server has extra work to do, extra io_uring
operations have time to finish so extra new work will be submitted in a single syscall. The echo server is responding to 1,000 consumer echo requests, or finishing 5,000 socket I/O operations with simply 23 syscalls.
It’s value noting that as the quantity of labor submitted will increase, the time spent within the io_uring_enter
syscall will increase, too. There’ll come a degree the place it could be essential to restrict the scale of submission batches or to allow submission queue polling within the kernel.
Advantages of community I/O
The primary advantage of io_uring
for community I/O is a contemporary asynchronous API that’s easy to make use of and gives unified semantics for file and community I/O.
A possible efficiency advantage of io_uring
for community I/O is lowering the variety of syscalls. This might present the most important profit for prime volumes of small operations the place the syscall overhead and variety of context switches will be considerably lowered.
It is usually potential to keep away from cumulatively costly operations on busy servers by pre-registering sources with the kernel earlier than sending io_uring
requests. File slots and buffers will be registered to keep away from the lookup and refcount prices for every I/O operation.
Registered file slots, known as mounted information, additionally make it potential to chain an settle for with a learn or write, with none round-trip to consumer house. A submission queue entry (SQE) would specify a set file slot to retailer the return worth of settle for, which a linked SQE would then reference in an I/O operation.
Limitations
In concept, operations will be chained collectively utilizing the IOSQE_IO_LINK
flag. Nevertheless, for reads and writes, there isn’t a mechanism to coerce the return worth from a learn operation into the parameter set for the next write operation. This limits the scope of linked operations to semantic sequencing comparable to “write then learn” or “write then shut” and for settle for adopted by learn or write.
One other consideration is that io_uring
is a comparatively new Linux kernel function that’s nonetheless below energetic growth. There’s room for efficiency enchancment, and a few io_uring
options would possibly nonetheless profit from optimization work.
io_uring
is at the moment a Linux-specific API, so integrating it into cross-platform libraries like libuv may current some challenges.
Newest options
The newest options to reach in io_uring
are multi-shot settle for, which is out there from 5.19 and multi-shot obtain, which arrived in 6.0. Multi-shot settle for permits an utility to problem a single settle for SQE, which can repeatedly publish a CQE at any time when the kernel receives a brand new connection request. Multi-shot obtain will likewise publish a CQE at any time when newly obtained information is out there. These options can be found in Fedora 37 however are usually not but out there in RHEL 9.
Conclusion
The io_uring
API is a totally practical asynchronous I/O interface that gives unified semantics for each file and community I/O. It has the potential to offer modest efficiency advantages to community I/O by itself and better profit for blended file and community I/O utility workloads.
Fashionable asynchronous I/O libraries comparable to libuv are multi-platform, which makes it more difficult to undertake Linux-specific APIs. When including io_uring
to a library, each file I/O and community I/O needs to be added to achieve essentially the most from io_uring’s async completion mannequin.
Community I/O-related function growth and optimization work in io_uring
can be pushed primarily by additional adoption in networked purposes. Now could be the time to combine io_uring
into your purposes and I/O libraries.
Extra info
Discover the next sources to study extra: