Now Reading
Why you must use io_uring for community I/O

Why you must use io_uring for community I/O

2023-04-12 16:35:48

io_uring is an async interface to the Linux kernel that may probably profit networking. It has been a giant win for file I/O (enter/output), however would possibly provide solely modest positive factors for community I/O, which already has non-blocking APIs. The positive factors are prone to come from the next:

  • A lowered variety of syscalls on servers that do a number of context switching
  • A unified asynchronous API for each file and community I/O

Many io_uring options can be found in Red Hat Enterprise Linux 9 which is distributed with kernel model 5.14. The most recent io_uring options can be found in Fedora 37.

What’s io_uring?

io_uring is an asynchronous I/O interface for the Linux kernel. An io_uring is a pair of ring buffers in shared reminiscence which can be used as queues between consumer house and the kernel:

  • Submission queue (SQ): A consumer house course of makes use of the submission queue to ship asynchronous I/O requests to the kernel.
  • Completion queue (CQ): The kernel makes use of the completion queue to ship the outcomes of asynchronous I/O operations again to consumer house.

The diagram in Determine 1 reveals how io_uring gives an asynchronous interface between consumer house and the Linux kernel.

Two ring buffers called the submission queue and the completion queue. An application is adding an item to the tail of the submission queue and the kernel is consuming an item from the head of the submission queue. The completion queue shows the reverse for responses from kernel to application.


Creator
Donald Hunter

Determine 1: A visible illustration of the io_uring submission and completion queues.

This interface permits purposes to maneuver away from the normal readiness-based mannequin of I/O to a brand new completion-based mannequin the place async file and community I/O share a unified API.

The syscall API

The Linux kernel API for io_uring has 3 syscalls:

  • io_uring_setup: Arrange a context for performing asynchronous I/O
  • io_uring_register: Register information or consumer buffers for asynchronous I/O
  • io_uring_enter: Provoke and/or full asynchronous I/O

The primary two syscalls are used to arrange an io_uring occasion and optionally to pre-register buffers that might be referenced by io_uring operations. Solely io_uring_enter must be known as for queue submission and consumption. The price of an io_uring_enter name will be amortized over a number of I/O operations. For very busy servers, you’ll be able to keep away from io_uring_enter calls solely by enabling busy-polling of the submission queue within the kernel. This comes at the price of a kernel thread consuming CPU.

The liburing API

The liburing library gives a handy approach to make use of io_uring, hiding a few of the complexity and offering features to organize all forms of I/O operations for submission.

A consumer course of creates an io_uring:

struct io_uring ring;
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);

then submits operations to the io_uring submission queue:

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_readv(sqe, client_socket, iov, 1, 0);
io_uring_sqe_set_data(sqe, user_data);
io_uring_submit(&ring);

The method waits for completion:

struct io_uring_cqe *cqe;
int ret = io_uring_wait_cqe(&ring, &cqe);

and makes use of the response:

user_data = io_uring_cqe_get_data(cqe);
if (cqe->res < 0) {
    // deal with error
} else {
    // deal with response
}
io_uring_cqe_seen(&ring, cqe);

The liburing API is the popular approach to make use of io_uring from purposes. liburing has function parity with the most recent kernel io_uring growth work and is backward-compatible with older kernels that lack the most recent io_uring options.

Utilizing io_uring for community I/O

We’ll check out io_uring for community I/O by writing a easy echo server utilizing the liburing API. Then we are going to see the way to decrease the variety of syscalls required for a high-rate concurrent workload.

A easy echo server

The basic echo server that appeared in Berkeley Software program Distribution (BSD) Unix appears to be like one thing like this:

client_fd = settle for(listen_fd, &client_addr, &client_addrlen);
for (;;) {
    numRead = learn(client_fd, buf, BUF_SIZE);
    if (numRead <= 0)   // exit loop on EOF or error
        break;
    if (write(client_fd, buf, numRead) != numRead)
        // deal with write error
    }
}
shut(client_fd);

The server may very well be multithreaded or use non-blocking I/O to help concurrent requests. No matter kind it takes, the server requires no less than 5 syscalls per consumer session, for settle for, learn, write, learn to detect EOF after which shut.

A naive translation of this to io_uring leads to an asynchronous server that submits one operation at a time and waits for completion earlier than submitting the following. The pseudocode for a easy io_uring-based server, omitting the boilerplate and error dealing with, appears to be like like this:

add_accept_request(listen_socket, &client_addr, &client_addr_len);
io_uring_submit(&ring);

whereas (1) {
    int ret = io_uring_wait_cqe(&ring, &cqe);

    struct request *req = (struct request *) cqe->user_data;
    swap (req->sort) {
    case ACCEPT:
        add_accept_request(listen_socket,
                          &client_addr, &client_addr_len);
        add_read_request(cqe->res);
        io_uring_submit(&ring);
        break;
    case READ:
        if (cqe->res <= 0) {
            add_close_request(req);
        } else {
            add_write_request(req);
        }
        io_uring_submit(&ring);
        break;
    case WRITE:
        add_read_request(req->socket);
        io_uring_submit(&ring);
        break;
    case CLOSE:
        free_request(req);
        break;
    default:
        fprintf(stderr, "Sudden req sort %dn", req->sort);
        break;
    }

    io_uring_cqe_seen(&ring, cqe);
}

On this io_uring instance, the server nonetheless requires no less than 4 syscalls to course of every new consumer. The one saving achieved right here is by submitting a learn and a brand new settle for request collectively. This may be seen within the following strace output for the echo server receiving 1,000 consumer requests.

% time     seconds  usecs/name     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.99    0.445109         111      4001           io_uring_enter
  0.01    0.000063          63         1           brk
------ ----------- ----------- --------- --------- ----------------
100.00    0.445172         111      4002           whole

Combining submissions

In an echo server, there are restricted alternatives for chaining I/O operations since we have to full a learn earlier than we all know what number of bytes we are able to write. We may chain settle for and browse by utilizing a brand new mounted file function of io_uring, however we’re already capable of submit a learn request and a brand new settle for request collectively, so there’s perhaps not a lot to be gained there.

We will submit unbiased operations on the identical time so we are able to mix the submission of a write and the next learn. This reduces the syscall rely to three per consumer request:

% time     seconds  usecs/name     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.93    0.438697         146      3001           io_uring_enter
  0.07    0.000325         325         1           brk
------ ----------- ----------- --------- --------- ----------------
100.00    0.439022         146      3002           whole

Draining the completion queue

It’s potential to mix much more work into the identical submission if we deal with all queued completions earlier than calling io_uring_submit. We will do that by utilizing a mix of io_uring_wait_cqe to attend for work, adopted by calls to io_uring_peek_cqe to verify whether or not the completion queue has extra entries that may be processed. This avoids spinning in a busy loop when the completion queue is empty whereas additionally draining the completion queue as quick as potential.

The pseudocode for the principle loop now appears to be like like this:

whereas (1) {
    int submissions = 0;
    int ret = io_uring_wait_cqe(&ring, &cqe);
    whereas (1) {
        struct request *req = (struct request *) cqe->user_data;
        swap (req->sort) {
        case ACCEPT:
            add_accept_request(listen_socket,
                              &client_addr, &client_addr_len);
            add_read_request(cqe->res);
            submissions += 2;
            break;
        case READ:
            if (cqe->res <= 0) {
                add_close_request(req);
                submissions += 1;
            } else {
                add_write_request(req);
                add_read_request(req->socket);
                submissions += 2;
            }
            break;
        case WRITE:
          break;
        case CLOSE:
            free_request(req);
            break;
        default:
            fprintf(stderr, "Sudden req sort %dn", req->sort);
            break;
        }

        io_uring_cqe_seen(&ring, cqe);

        if (io_uring_sq_space_left(&ring) < MAX_SQE_PER_LOOP) {
            break;     // the submission queue is full
        }

        ret = io_uring_peek_cqe(&ring, &cqe);
        if (ret == -EAGAIN) {
            break;     // no remaining work in completion queue
        }
    }
    if (submissions > 0) {
        io_uring_submit(&ring);
    }
}

The results of batching submissions for all out there work provides a major enchancment over the earlier consequence, as proven within the following strace output, once more for 1,000 consumer requests:

% time     seconds  usecs/name     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.91    0.324226        4104        79           io_uring_enter
  0.09    0.000286         286         1           brk
------ ----------- ----------- --------- --------- ----------------
100.00    0.324512        4056        80           whole

The development right here is substantial, with greater than 12 consumer requests being dealt with per syscall, or a mean of greater than 60 I/O ops per syscall. This ratio improves because the server will get busier, which will be demonstrated by enabling logging within the server:

% time     seconds  usecs/name     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 68.86    0.225228          42      5308       286 write
 31.13    0.101831        4427        23           io_uring_enter
  0.00    0.000009           9         1           brk
------ ----------- ----------- --------- --------- ----------------
100.00    0.327068          61      5332       286 whole

This reveals that when the server has extra work to do, extra io_uring operations have time to finish so extra new work will be submitted in a single syscall. The echo server is responding to 1,000 consumer echo requests, or finishing 5,000 socket I/O operations with simply 23 syscalls.

It’s value noting that as the quantity of labor submitted will increase, the time spent within the io_uring_enter syscall will increase, too. There’ll come a degree the place it could be essential to restrict the scale of submission batches or to allow submission queue polling within the kernel.

Advantages of community I/O

The primary advantage of io_uring for community I/O is a contemporary asynchronous API that’s easy to make use of and gives unified semantics for file and community I/O.

See Also

A possible efficiency advantage of io_uring for community I/O is lowering the variety of syscalls. This might present the most important profit for prime volumes of small operations the place the syscall overhead and variety of context switches will be considerably lowered.

It is usually potential to keep away from cumulatively costly operations on busy servers by pre-registering sources with the kernel earlier than sending io_uring requests. File slots and buffers will be registered to keep away from the lookup and refcount prices for every I/O operation.

Registered file slots, known as mounted information, additionally make it potential to chain an settle for with a learn or write, with none round-trip to consumer house. A submission queue entry (SQE) would specify a set file slot to retailer the return worth of settle for, which a linked SQE would then reference in an I/O operation.

Limitations

In concept, operations will be chained collectively utilizing the IOSQE_IO_LINK flag. Nevertheless, for reads and writes, there isn’t a mechanism to coerce the return worth from a learn operation into the parameter set for the next write operation. This limits the scope of linked operations to semantic sequencing comparable to “write then learn” or “write then shut” and for settle for adopted by learn or write.

One other consideration is that io_uring is a comparatively new Linux kernel function that’s nonetheless below energetic growth. There’s room for efficiency enchancment, and a few io_uring options would possibly nonetheless profit from optimization work. 

io_uring is at the moment a Linux-specific API, so integrating it into cross-platform libraries like libuv may current some challenges.

Newest options

The newest options to reach in io_uring are multi-shot settle for, which is out there from 5.19 and multi-shot obtain, which arrived in 6.0. Multi-shot settle for permits an utility to problem a single settle for SQE, which can repeatedly publish a CQE at any time when the kernel receives a brand new connection request. Multi-shot obtain will likewise publish a CQE at any time when newly obtained information is out there. These options can be found in Fedora 37 however are usually not but out there in RHEL 9.

Conclusion

The io_uring API is a totally practical asynchronous I/O interface that gives unified semantics for each file and community I/O. It has the potential to offer modest efficiency advantages to community I/O by itself and better profit for blended file and community I/O utility workloads.

Fashionable asynchronous I/O libraries comparable to libuv are multi-platform, which makes it more difficult to undertake Linux-specific APIs. When including io_uring to a library, each file I/O and community I/O needs to be added to achieve essentially the most from io_uring’s async completion mannequin.

Community I/O-related function growth and optimization work in io_uring can be pushed primarily by additional adoption in networked purposes. Now could be the time to combine io_uring into your purposes and I/O libraries.

Extra info

Discover the next sources to study extra: 

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top