Io uring – dankwiki, the wiki of nick black

Launched in 2019 (kernel 5.1) by Jens Axboe, io_uring (henceforth uring) is a system for offering the kernel with a schedule of system calls, and receiving the outcomes as they’re generated. Whereas epoll and kqueue help multiplexing, the place you are informed when you may usefully carry out a system name utilizing some set of filters, uring means that you can specify the system calls themselves (and dependencies between them), and execute the schedule on the kernel “dataflow restrict”. It combines asynchronous I/O, system name polybatching, and versatile buffer administration, and is IMHO probably the most substantial improvement within the Linux I/O model since Berkeley sockets (sure, I am conscious Berkeley sockets preceded Linux. Let’s then say that they are probably the most substantial improvement within the UNIX I/O mannequin to originate in Linux):
- Asynchronous I/O with out the massive copy overheads and restrictions of POSIX AIO
- System name batching/linking throughout distinct system calls
- Present a buffer pool, they usually’ll be used as wanted
- Each polling- and interrupt-driven I/O on the kernel aspect
The core system calls of uring are wrapped by the C API of liburing. Home windows added a really comparable interface, IoRing, in 2020. For my part, uring ought largely displace epoll in new Linux code. FreeBSD appears to be sticking with kqueue, which means code utilizing uring will not run there, however neither did epoll (save by way of FreeBSD’s considerably doubtful Linux compatibility layer). Each the system calls and liburing have pretty complete man web page protection, together with the io_uring.7 top-level web page.
Rings
Central to each uring are two ringbuffers holding CQEs (Completion Queue Entries) and SQE (Submission Queue Entries) descriptors (as finest I can inform, this terminology was borrowed from the NVMe specification). SQEs roughly correspond to a single system name: they’re tagged with an operation sort, and stuffed in with the values that will historically be equipped as arguments to the suitable operate. Userspace is offered references to SQEs on the SQE ring, which it fills in and submits. Submission operates up by way of a specified SQE, and thus all SQEs earlier than it within the ring should even be able to go. The kernel locations ends in the CQE ring. These rings are shared between kernel- and userspace. The rings should be distinct until the kernel specifies the IORING_FEAT_SINGLE_MMAP characteristic (see under). Word that SQEs are allotted externally to the SQ descriptor ring.
uring doesn’t usually make use of errno. Synchronous capabilities return the unfavorable error code as their consequence. Completion queue entries have the negated error code positioned of their res fields.
CQEs are often 16 bytes, and SQEs are often 64 bytes (however see IORING_SETUP_SQE128 and IORING_SETUP_CQE32 under). Both method, SQEs are allotted externally to the submission queue, which is merely a hoop of 32-bit descriptors.
System calls
The liburing interface shall be enough for many customers, and it’s attainable to function virtually wholly with out system calls when the system is busy. For the sake of completion, listed below are the three system calls implementing the uring core (from the kernel’s io_uring/io_uring.c):
int io_uring_setup(u32 entries, struct io_uring_params *p);
int io_uring_enter(unsigned fd, u32 to_submit, u32 min_complete, u32 flags, const void* argp, size_t argsz);
int io_uring_register(unsigned fd, unsigned opcode, void *arg, unsigned int nr_args);
Word that io_uring_enter(2) corresponds extra intently to the io_uring_enter2(3) wrapper, and certainly io_uring_enter(3) is outlined by way of the latter (from liburing’s src/syscall.c):
static inline int __sys_io_uring_enter2(unsigned int fd, unsigned int to_submit,
unsigned int min_complete, unsigned int flags, sigset_t *sig, size_t sz){
return (int) __do_syscall6(__NR_io_uring_enter, fd, to_submit, min_complete, flags, sig, sz);
}
static inline int __sys_io_uring_enter(unsigned int fd, unsigned int to_submit,
unsigned int min_complete, unsigned int flags, sigset_t *sig){
return __sys_io_uring_enter2(fd, to_submit, min_complete, flags, sig, _NSIG / 8);
}
io_uring_enter(2) can each submit SQEs and wait till some variety of CQEs can be found. Its flags parameter is a bitmask over:
Flag | Description |
---|---|
IORING_ENTER_GETEVENTS | Wait till not less than min_complete CQEs are prepared earlier than returning. |
IORING_ENTER_SQ_WAKEUP | Get up the kernel thread created when utilizing IORING_SETUP_SQPOLL. |
IORING_ENTER_SQ_WAIT | Wait till not less than one entry is free within the submission ring earlier than returning. |
IORING_ENTER_EXT_ARG | (Since Linux 5.11) Interpret sig to be a io_uring_getevents_arg reasonably than a pointer to sigset_t. This construction can specify each a sigset_t and a timeout.
struct io_uring_getevents_arg {
__u64 sigmask;
__u32 sigmask_sz;
__u32 pad;
__u64 ts;
};
Is ts nanoseconds from now? From the Epoch? Nope! ns is definitely a pointer to a __kernel_timespec, handed to u64_to_user_ptr() within the kernel. One of many uglier elements of uring. |
IORING_ENTER_REGISTERED_RING | ring_fd is an offset into the registered ring pool reasonably than a traditional file descriptor. |
Setup
The io_uring_setup(2) system name returns a file descriptor, and accepts two parameters, u32 entries and struct io_uring_params *p:
int io_uring_setup(u32 entries, struct io_uring_params *p);
struct io_uring_params {
__u32 sq_entries; // variety of SQEs, stuffed by kernel
__u32 cq_entries; // see IORING_SETUP_CQSIZE and IORING_SETUP_CLAMP
__u32 flags; // see "Flags" under
__u32 sq_thread_cpu; // see IORING_SETUP_SQ_AFF
__u32 sq_thread_idle; // see IORING_SETUP_SQPOLL
__u32 options; // see "Kernel options" under, stuffed by kernel
__u32 wq_fd; // see IORING_SETUP_ATTACH_WQ
__u32 resv[3]; // should be zero
struct io_sqring_offsets sq_off; // see "Ring construction" under, stuffed by kernel
struct io_cqring_offsets cq_off; // see "Ring construction" under, stuffed by kernel
};
resv should be zeroed out. Within the absence of flags, the uring makes use of interrupt-driven I/O. Calling shut(2) on the returned descriptor frees all resourced related to the uring.
io_uring_setup(2) is wrapped by liburing’s io_uring_queue_init(3) and io_uring_queue_init_params(3). When utilizing these wrappers, io_uring_queue_exit(3) must be used to scrub up. These wrappers function on a struct io_uring. io_uring_queue_init(3) takes an unsigned flags argument, which is handed because the flags area of io_uring_params. io_uring_queue_init_params(3) takes a struct io_uring_params* argument, which is handed by way of on to io_uring_setup(2). It is best to keep away from mixing the low-level API and that offered by liburing.
Ring construction
The small print of ring construction are solely related when utilizing the low-level API, and they aren’t uncovered by way of liburing. They’re primarily used to arrange the three (or two, see IORING_FEAT_SINGLE_MMAP) backing reminiscence maps. You will want set these up your self if you wish to use huge pages.
struct io_sqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 flags;
__u32 dropped;
__u32 array;
__u32 resv[3];
};
struct io_cqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 overflow;
__u32 cqes;
__u32 flags;
__u32 resv[3];
};
As defined within the io_uring_setup(2) man web page, the submission queue might be mapped thusly:
mmap(0, sq_off.array + sq_entries * sizeof(__u32), PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_SQ_RING);
The submission queue comprises the inner knowledge construction adopted by an array of SQE descriptors. These descriptors are 32 bits every irrespective of the structure, implying that they’re indices into the SQE map, not pointers. The SQEs are allotted:
mmap(0, sq_entries * sizeof(struct io_uring_sqe), PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_SQES);
and at last the completion queue:
mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe), PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_CQ_RING);
Recall that when the kernel expresses IORING_FEAT_SINGLE_MMAP, the submission and completion queues might be allotted in a single mmap(2) name.
Flags
The flags area is ready up by the caller, and is a bitmask over:
Flag | Kernel model | Description |
---|---|---|
IORING_SETUP_IOPOLL | 5.1 | Instruct the kernel to make use of polled (versus interrupt-driven) I/O. That is meant for block gadgets, and requires that O_DIRECT was offered when the file descriptor was opened. |
IORING_SETUP_SQPOLL | 5.1 (5.11 for full options) | Create a kernel thread to ballot on the submission queue. If the submission queue is stored busy, this thread will reap SQEs with out the necessity for a system name. If sufficient time goes by with out new submissions, the kernel thread goes to sleep, and io_uring_enter(2) should be known as to wake it. |
IORING_SETUP_SQ_AFF | 5.1 | Solely significant with IORING_SETUP_SQPOLL. The ballot thread shall be sure to the core laid out in sq_thread_cpu. |
IORING_SETUP_CQSIZE | 5.1 | Create the completion queue with cq_entries entries. This worth should be higher than entries, and is likely to be rounded as much as the following energy of two. |
IORING_SETUP_CLAMP | 5.1 | Clamp entries at IORING_MAX_ENTRIES and cq_entries at IORING_MAX_CQ_ENTRIES. |
IORING_SETUP_ATTACH_WQ | 5.1 | Specify a uring in wq_fd, and the brand new uring will share that uring’s employee thread backend. |
IORING_SETUP_R_DISABLED | 5.10 | Begin the uring disabled, requiring that or not it’s enabled with io_uring_register(2). |
IORING_SETUP_SUBMIT_ALL | 5.18 | Proceed submitting SQEs from a batch even after one ends in error. |
IORING_SETUP_COOP_TASKRUN | 5.19 | Do not interrupt userspace processes to point CQE availability. It is often fascinating to permit occasions to be processed at arbitrary kernelspace transitions, by which case this flag might be offered to enhance efficiency. |
IORING_SETUP_TASKRUN_FLAG | 5.19 | Requires IORING_SETUP_COOP_TASKRUN. When completions are pending awaiting processing, the IORING_SQ_TASKRUN flag shall be set within the submission ring. This shall be checked by io_uring_peek_cqe(), which is able to enter the kernel to course of them. |
IORING_SETUP_SQE128 | 5.19 | Use 128-byte SQEs, crucial for NVMe passthroughs utilizing IORING_OP_URING_CMD. |
IORING_SETUP_CQE32 | 5.19 | Use 32-byte CQEs, crucial for NVMe passthroughs utilizing IORING_OP_URING_CMD. |
IORING_SETUP_SINGLE_ISSUER | 6.0 | Trace to the kernel that solely a single thread will submit requests, permitting for optimizations. This thread should both be the thread which created the ring, or (iff IORING_SETUP_R_DISABLED is used) the thread which permits the ring. |
IORING_SETUP_DEFER_TASKRUN | 6.1 | Requires IORING_SETUP_SINGLE_ISSUER. Do not course of completions at arbitrary kernel/scheduler transitions, however solely io_uring_enter(2) when known as with IORING_ENTER_GETEVENTS by the thread that submitted the SQEs. |
When IORING_SETUP_R_DISABLED is used, the ring should be enabled earlier than submissions can happen. If utilizing the liburing API, that is carried out by way of io_uring_enable_rings(3):
int io_uring_enable_rings(struct io_uring *ring); // liburing 2.4
Kernel options
Varied performance was added to the kernel following the preliminary launch of uring, and thus not essentially obtainable to all kernels supporting the essential system calls. The __u32 options area of the io_uring_params parameter to io_uring_setup(2) is stuffed in with characteristic flags by the kernel, a bitmask over:
Characteristic | Kernel model | Description |
---|---|---|
IORING_FEAT_SINGLE_MMAP | 5.4 | A single mmap(2) can be utilized for each the submission and completion rings. |
IORING_FEAT_NODROP | 5.5 (5.19 for full options) | Completion queue occasions should not dropped. As an alternative, submitting ends in -EBUSY till completion reaping yields enough room for the overflows. As of 5.19, io_uring_enter(2) moreover returns -EBADR reasonably than ready for completions. |
IORING_FEAT_SUBMIT_STABLE | 5.5 | Knowledge submitted for async might be mutated following submission, reasonably than solely following completion. |
IORING_FEAT_RW_CUR_POS | 5.6 | Studying and writing can present -1 for offset to point the present file place. |
IORING_FEAT_CUR_PERSONALITY | 5.6 | Assume the credentials of the thread calling io_uring_enter(2), reasonably than the thread which created the uring. Registered personalities can at all times be used. |
IORING_FEAT_FAST_POLL | 5.7 | Inside polling for knowledge/house readiness is supported. |
IORING_FEAT_POLL_32BITS | 5.9 | IORING_OP_POLL_ADD accepts all epoll flags, together with EPOLLEXCLUSIVE. |
IORING_FEAT_SQPOLL_NONFIXED | 5.11 | IORING_SETUP_SQPOLL would not require registered recordsdata. |
IORING_FEAT_ENTER_EXT_ARG | 5.11 | io_uring_enter(2) helps struct io_uring_getevents_arg. |
IORING_FEAT_NATIVE_WORKERS | 5.12 | Async helpers use native employees reasonably than kernel threads. |
IORING_FEAT_RSRC_TAGS | 5.13 | Registered buffers might be up to date in partes reasonably than in toto. |
IORING_FEAT_CQE_SKIP | 5.17 | IOSQE_CQE_SKIP_SUCCESS can be utilized to inhibit CQE technology on success. |
IORING_FEAT_LINKED_FILE | 5.17 | Defer file project till execution of a given request begins. |
Registered assets
Buffers
Since Linux 5.7, user-allocated reminiscence might be offered to uring in teams of buffers (every with a gaggle ID), by which every buffer has its personal ID. This was carried out with the io_uring_prep_provide_buffers(3) name, working on an SQE. Since 5.19, the “ringmapped buffers” approach (io_uring_register_buf_ring(3)) permits these buffers for use far more successfully.
Flag | Kernel | Description |
---|---|---|
IORING_REGISTER_BUFFERS | 5.1 | |
IORING_UNREGISTER_BUFFERS | 5.1 | |
IORING_REGISTER_BUFFERS2 | 5.13 |
struct io_uring_rsrc_register {
__u32 nr;
__u32 resv;
__u64 resv2;
__aligned_u64 knowledge;
__aligned_u64 tags;
};
|
IORING_REGISTER_BUFFERS_UPDATE | 5.13 |
struct io_uring_rsrc_update2 {
__u32 offset;
__u32 resv;
__aligned_u64 knowledge;
__aligned_u64 tags;
__u32 nr;
__u32 resv2;
};
|
IORING_REGISTER_PBUF_RING | 5.19 |
struct io_uring_buf_reg {
__u64 ring_addr;
__u32 ring_entries;
__u16 bgid;
__u16 pad;
__u64 resv[3];
};
|
IORING_UNREGISTER_PBUF_RING | 5.19 |
Registered recordsdata
Registered (typically “direct”) descriptors are integers corresponding to personal file deal with buildings inside to the uring, and can be utilized anyplace uring needs a file descriptor by way of the IOSQE_FIXED_FILE flag. They’ve much less overhead than true file descriptors, which use buildings shared amongst threads. Word that registered recordsdata are required for submission queue polling until the IORING_FEAT_SQPOLL_NONFIXED characteristic flag was returned.
Flag | Kernel | Description |
---|---|---|
IORING_REGISTER_FILES | 5.1 | |
IORING_UNREGISTER_FILES | 5.1 | |
IORING_REGISTER_FILES2 | 5.13 | |
IORING_REGISTER_FILES_UPDATE | 5.5 (5.12 for all options) | |
IORING_REGISTER_FILES_UPDATE2 | 5.13 | |
IORING_REGISTER_FILE_ALLOC_RANGE | 6.0 |
struct io_uring_file_index_range {
__u32 off;
__u32 len;
__u64 resv;
};
|
Personalities
Flag | Kernel | Description |
---|---|---|
IORING_REGISTER_PERSONALITY | 5.6 | |
IORING_UNREGISTER_PERSONALITY | 5.6 |
Submitting work
Submitting work consists of 4 steps:
- Buying free SQEs
- Filling in these SQEs
- Putting these SQEs on the tail of the submission queue
- Submitting the work, probably utilizing a system name
The SQE construction
struct io_uring_sqe has a number of giant unions which I will not reproduce in full right here; seek the advice of liburing.h if you would like the small print. The instructive parts embrace:
struct io_uring_sqe {
__u8 opcode; /* sort of operation for this sqe */
__u8 flags; /* IOSQE_ flags */
__u16 ioprio; /* ioprio for the request */
__s32 fd; /* file descriptor to do IO on */
... numerous unions for representing the request particulars ...
};
Flags might be set on a per-SQE foundation utilizing io_uring_sqe_set_flags(3), or writing to the flags area instantly:
static inline
void io_uring_sqe_set_flags(struct io_uring_sqe *sqe, unsigned flags){
sqe->flags = (__u8) flags;
}
The flags are a bitfield over:
SQE flag | Description |
---|---|
IOSQE_FIXED_FILE | References a registered descriptor. |
IOSQE_IO_DRAIN | Difficulty after in-flight I/O. |
IOSQE_IO_LINK | Hyperlinks subsequent SQE. |
IOSQE_IO_HARDLINK | Similar as IOSQE_IO_HARDLINK, however a failure doesn’t sever the chain. |
IOSQE_ASYNC | All the time function asynchronously. |
IOSQE_BUFFER_SELECT | Use a registered buffer. |
IOSQE_CQE_SKIP_SUCCESS | Do not publish a CQE on success. |
Prepping SQEs
Every SQE should be seeded with the thing upon which it acts (often a file descriptor) and any crucial arguments. You will often additionally use the consumer knowledge space.
Person knowledge
Every SQE supplies 64 bits of user-controlled knowledge which shall be copied by way of to any generated CQEs. Since CQEs do not embrace the related file descriptor, you may virtually at all times be encoding some sort of lookup info into this space.
void io_uring_sqe_set_data(struct io_uring_sqe *sqe, void *user_data);
void io_uring_sqe_set_data64(struct io_uring_sqe *sqe, __u64 knowledge);
void *io_uring_cqe_get_data(struct io_uring_cqe *cqe);
__u64 io_uring_cqe_get_data64(struct io_uring_cqe *cqe);
This is an instance C++ knowledge sort that encodes eight bits as an operation sort, eight bits as an index, and forty-eight bits as different knowledge. I usually use one thing like this to replicate the operation which was used, the index into some related knowledge construction, and different details about the operation (maybe an offset or a size):
union URingCtx {
struct rep {
rep(int8_t op, unsigned ix, uint64_t d):
sort(static_cast<URingCtx::rep::uint8_t>(op)),
idx(static_cast<URingCtx::rep::uint8_t>(ix)),
knowledge(d)
{
if(sort >= MAXOP){
throw std::invalid_argument("unhealthy uringctx op");
}
if(ix > MAXIDX){
throw std::invalid_argument("unhealthy uringctx index");
}
if(d > 0xffffffffffffull){
throw std::invalid_argument("unhealthy uringctx knowledge");
}
}
enum uint8_t {
...operation varieties...
MAXOP // should not be used
} sort: 8;
uint8_t idx: 8;
uint64_t knowledge: 48;
} r;
uint64_t val;
static constexpr auto MAXIDX = 255;
URingCtx(uint8_t op, unsigned idx, uint64_t d):
r(op, idx, d)
{}
URingCtx(uint64_t v):
URingCtx(v & 0xffu, (v >> 8) & 0xffu, v >> 16)
{}
};
Nearly all of I/O-related system calls have by now a uring equal (the one main exception of which I am conscious is listing itemizing; there appears to be no readdir(3)/getdents(2)). What follows is an incomplete checklist.
Opening and shutting file descriptors
void io_uring_prep_openat(struct io_uring_sqe *sqe, int dfd, const char *path,
int flags, mode_t mode);
void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd, const char *path,
int flags, mode_t mode, unsigned file_index);
void io_uring_prep_openat2(struct io_uring_sqe *sqe, int dfd, const char *path,
int flags, struct open_how *how);
void io_uring_prep_openat2_direct(struct io_uring_sqe *sqe, int dfd, const char *path,
int flags, struct open_how *how, unsigned file_index);
void io_uring_prep_accept(struct io_uring_sqe *sqe, int sockfd, struct sockaddr *addr,
socklen_t *addrlen, int flags);
void io_uring_prep_accept_direct(struct io_uring_sqe *sqe, int sockfd, struct sockaddr *addr,
socklen_t *addrlen, int flags, unsigned int file_index);
void io_uring_prep_multishot_accept(struct io_uring_sqe *sqe, int sockfd, struct sockaddr *addr,
socklen_t *addrlen, int flags);
void io_uring_prep_multishot_accept_direct(struct io_uring_sqe *sqe, int sockfd, struct sockaddr *addr,
socklen_t *addrlen, int flags);
void io_uring_prep_close(struct io_uring_sqe *sqe, int fd);
void io_uring_prep_close_direct(struct io_uring_sqe *sqe, unsigned file_index);
void io_uring_prep_socket(struct io_uring_sqe *sqe, int area, int sort,
int protocol, unsigned int flags);
void io_uring_prep_socket_direct(struct io_uring_sqe *sqe, int area, int sort,
int protocol, unsigned int file_index, unsigned int flags);
void io_uring_prep_socket_direct_alloc(struct io_uring_sqe *sqe, int area, int sort,
int protocol, unsigned int flags);
Studying and writing file descriptors
void io_uring_prep_send(struct io_uring_sqe *sqe, int sockfd, const void *buf, size_t len, int flags);
void io_uring_prep_send_zc(struct io_uring_sqe *sqe, int sockfd, const void *buf, size_t len, int flags, int zc_flags);
void io_uring_prep_sendmsg(struct io_uring_sqe *sqe, int fd, const struct msghdr *msg, unsigned flags);
void io_uring_prep_sendmsg_zc(struct io_uring_sqe *sqe, int fd, const struct msghdr *msg, unsigned flags);
void io_uring_prep_recv(struct io_uring_sqe *sqe, int sockfd, void *buf, size_t len, int flags);
void io_uring_prep_recv_multishot(struct io_uring_sqe *sqe, int sockfd, void *buf, size_t len, int flags);
void io_uring_prep_recvmsg(struct io_uring_sqe *sqe, int fd, struct msghdr *msg, unsigned flags);
void io_uring_prep_recvmsg_multishot(struct io_uring_sqe *sqe, int fd, struct msghdr *msg, unsigned flags);
void io_uring_prep_read(struct io_uring_sqe *sqe, int fd, void *buf, unsigned nbytes, __u64 offset);
void io_uring_prep_read_fixed(struct io_uring_sqe *sqe, int fd, void *buf, unsigned nbytes, __u64 offset, int buf_index);
void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs, unsigned nr_vecs, __u64 offset);
void io_uring_prep_readv2(struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs,
unsigned nr_vecs, __u64 offset, int flags);
void io_uring_prep_shutdown(struct io_uring_sqe *sqe, int sockfd, int how);
void io_uring_prep_splice(struct io_uring_sqe *sqe, int fd_in, int64_t off_in, int fd_out,
int64_t off_out, unsigned int nbytes, unsigned int splice_flags);
void io_uring_prep_sync_file_range(struct io_uring_sqe *sqe, int fd, unsigned len, __u64 offset, int flags);
void io_uring_prep_tee(struct io_uring_sqe *sqe, int fd_in, int fd_out, unsigned int nbytes, unsigned int splice_flags);
void io_uring_prep_write(struct io_uring_sqe *sqe, int fd, const void *buf, unsigned nbytes, __u64 offset);
void io_uring_prep_write_fixed(struct io_uring_sqe *sqe, int fd, const void *buf,
unsigned nbytes, __u64 offset, int buf_index);
void io_uring_prep_writev(struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs,
unsigned nr_vecs, __u64 offset);
void io_uring_prep_writev2(struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs,
unsigned nr_vecs, __u64 offset, int flags);
Manipulating directories
void io_uring_prep_fsync(struct io_uring_sqe *sqe, int fd, unsigned flags);
void io_uring_prep_linkat(struct io_uring_sqe *sqe, int olddirfd, const char *oldpath,
int newdirfd, const char *newpath, int flags);
void io_uring_prep_link(struct io_uring_sqe *sqe, const char *oldpath, const char *newpath, int flags);
void io_uring_prep_mkdirat(struct io_uring_sqe *sqe, int dirfd, const char *path, mode_t mode);
void io_uring_prep_mkdir(struct io_uring_sqe *sqe, const char *path, mode_t mode);
void io_uring_prep_rename(struct io_uring_sqe *sqe, const char *oldpath, const char *newpath, unsigned int flags);
void io_uring_prep_renameat(struct io_uring_sqe *sqe, int olddirfd, const char *oldpath,
int newdirfd, const char *newpath, unsigned int flags);
void io_uring_prep_statx(struct io_uring_sqe *sqe, int dirfd, const char *path, int flags,
unsigned masks, struct statx *statxbuf);
void io_uring_prep_symlink(struct io_uring_sqe *sqe, const char *goal, const char *linkpath);
void io_uring_prep_symlinkat(struct io_uring_sqe *sqe, const char *goal, int newdirfd, const char *linkpath);
void io_uring_prep_unlinkat(struct io_uring_sqe *sqe, int dirfd, const char *path, int flags);
void io_uring_prep_unlink(struct io_uring_sqe *sqe, const char *path, int flags);
Timeouts and polling
void io_uring_prep_poll_add(struct io_uring_sqe *sqe, int fd, unsigned poll_mask);
void io_uring_prep_poll_multishot(struct io_uring_sqe *sqe, int fd, unsigned poll_mask);
void io_uring_prep_poll_remove(struct io_uring_sqe *sqe, __u64 user_data);
void io_uring_prep_poll_update(struct io_uring_sqe *sqe, __u64 old_user_data, __u64 new_user_data, unsigned poll_mask, unsigned flags);
void io_uring_prep_timeout(struct io_uring_sqe *sqe, struct __kernel_timespec *ts, unsigned rely, unsigned flags);
void io_uring_prep_timeout_update(struct io_uring_sqe *sqe, struct __kernel_timespec *ts, __u64 user_data, unsigned flags);
void io_uring_prep_timeout_remove(struct io_uring_sqe *sqe, __u64 user_data, unsigned flags);
Linked operations
IOSQE_IO_LINK (since 5.3) or IOSQE_IO_HARDLINK (since 5.5) might be equipped within the flags area of an SQE to hyperlink with the following SQE. The chain might be arbitrarily lengthy (although it can’t cross submission boundaries), terminating within the first linked SQE with out this flag set. A number of chains can execute in parallel on the kernel aspect. Except HARDLINK is used, any error terminates a sequence; any remaining linked SQEs shall be instantly cancelled (quick reads/writes are thought-about errors) with return code -ECANCELED.
Sending it to the kernel
If IORING_SETUP_SQPOLL was offered when creating the uring, the kernel spawned a thread to ballot the submission queue. If the thread is awake, there is no such thing as a must make a system name; the kernel will ingest the SQE as quickly as it’s written (io_uring_submit(3) nonetheless should be used, however no system name shall be made). This thread goes to sleep after sq_thread_idle milliseconds idle, by which case IORING_SQ_NEED_WAKEUP shall be written to the flags area of the submission ring.
int io_uring_submit(struct io_uring *ring);
int io_uring_submit_and_wait(struct io_uring *ring, unsigned wait_nr);
int io_uring_submit_and_wait_timeout(struct io_uring *ring, struct io_uring_cqe **cqe_ptr, unsigned wait_nr,
struct __kernel_timespec *ts, sigset_t *sigmask);
All of those liburing capabilities name the inner capabilities __io_uring_flush_sq() and __io_uring_submit(). The previous updates the ring tail with a release-semantics write, whereas the latter calls io_uring_enter() if crucial. Word that timeouts are applied internally utilizing a SQE, and thus will kick off work if the submission ring is full pursuant to buying the entry.
Submission queue polling particulars
Utilizing IORING_SETUP_SQPOLL will, by default, create two threads in your course of, one named iou-sqp-TID, and the opposite named iou-wrk-TID. The previous is created the primary time work is submitted. The latter is created each time the uring is enabled (i.e. at creation time, until IORING_SETUP_R_DISABLED is used). Submission queue ballot threads might be shared between urings by way of IORING_SETUP_ATTACH_WQ along with the wq_fd area of io_uring_params.
Reaping completions
Submitted actions end in completion occasions:
struct io_uring_cqe {
__u64 user_data; /* sqe->knowledge submission handed again */
__s32 res; /* consequence code for this occasion */
__u32 flags;
/*
* If the ring is initialized with IORING_SETUP_CQE32, then this area
* comprises 16-bytes of padding, doubling the scale of the CQE.
*/
__u64 big_cqe[];
};
Recall that reasonably than utilizing errno, errors are returned as their unfavorable worth in res.
CQE flag | Description |
---|---|
IORING_CQE_F_BUFFER | If set, higher 16 bits of flags is the buffer ID |
IORING_CQE_F_MORE | The related multishot SQE will generate extra entries |
IORING_CQE_F_SOCK_NONEMPTY | There’s extra knowledge to obtain after this learn |
IORING_CQE_F_NOTIF | Notification CQE for zero-copy sends |
Completions might be detected by 4 totally different means:
- Checking the completion queue speculatively. This both means a periodic verify, which is able to endure latency as much as the interval, or a busy verify, which is able to churn CPU, however might be the lowest-latency resolution. That is finest achieved with io_uring_peek_cqe(3), maybe together with io_uring_cq_ready(3) (neither entails a system name).
int io_uring_peek_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr);
unsigned io_uring_cq_ready(const struct io_uring *ring);
- Ready on the ring by way of kernel sleep. Use io_uring_wait_cqe(3) (unbounded sleep), io_uring_wait_cqe_timeout(3) (bounded sleep), or io_uring_wait_cqes(3) (bounded sleep with atomic sign blocking and batch obtain). These don’t require a system name if they are often instantly happy.
int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr);
int io_uring_wait_cqe_nr(struct io_uring *ring, struct io_uring_cqe **cqe_ptr, unsigned wait_nr);
int io_uring_wait_cqe_timeout(struct io_uring *ring, struct io_uring_cqe **cqe_ptr,
struct __kernel_timespec *ts);
int io_uring_wait_cqes(struct io_uring *ring, struct io_uring_cqe **cqe_ptr, unsigned wait_nr,
struct __kernel_timespec *ts, sigset_t *sigmask);
int io_uring_peek_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr);
- Utilizing an eventfd along with io_uring_register_eventfd(3). See under for the complete API. This eventfd might be mixed with e.g. common epoll.
- Utilizing processor-dependent reminiscence watch directions. On x86, there’s MONITOR+MWAIT, however they require you to be in ring 0, so that you’d in all probability need UMONITOR/UMWAIT. This ought enable a really low-latency wake that consumes little or no energy.
As soon as the CQE might be returned to the system, accomplish that with io_uring_cqe_seen(3), or batch it with io_uring_cq_advance(3) (the previous can mark cq entries as seen out of order).
void io_uring_cqe_seen(struct io_uring *ring, struct io_uring_cqe *cqe);
void io_uring_cq_advance(struct io_uring *ring, unsigned nr);
Multishot
It’s attainable for a single submission to end in a number of completions (e.g. io_uring_prep_multishot_accept(3)); this is named multishot. Errors on a multishot SQE will usually terminate the work request; a multishot SQE will set IORING_CQE_F_MORE excessive in generated CQEs as long as it stays lively. A CQE with out this flag signifies that the multishot is now not operational, and should be reposted if additional occasions are desired.
Nook instances
Single fd in a number of rings
If logically equal SQEs are submitted to totally different rings, just one operation appears to happen when logical. As an example, if two rings have the identical socket added for an settle for(2), a profitable three-way TCP handshake will generate just one CQE, on one of many two rings. Which ring sees the occasion shall be totally different from connection to connection.
Multithreaded use
urings (and particularly the struct io_uring object of liburing) are not intended for multithreaded use (quoth Axboe, “do not share a hoop between threads”), although they can be utilized in a number of threaded paradigms. A single thread submitting and a single thread reaping is certainly supported. Descriptors might be despatched amongst rings with IORING_OP_MSG_RING. A number of submitters undoubtedly should be serialized in userspace.
If an op shall be accomplished by way of a kernel job, the thread that submitted that SQE should stay alive till the op’s completion. It is going to in any other case error out with ECANCELED. In the event you should submit the SQE from a thread which is able to die, think about creating it disabled (see IORING_SETUP_R_DISABLED), and enabling it from the thread which is able to reap the completion occasion utilizing IORING_REGISTER_ENABLE_RINGS with io_uring_register(2).
In the event you can prohibit all submissions (and creation/enabling of the uring) to a single thread, use IORING_SETUP_SINGLE_ISSUER to allow kernel optimizations. In any other case, think about using io_uring_register_ring_fd(3) (or io_uring_register(2) instantly) to register the ring descriptor with the ring itself, and thus cut back the overhead of io_uring_enter(2).
When IORING_SETUP_SQPOLL is used, the kernel poller thread is taken into account to have carried out the submission, offering one other attainable method round this drawback.
I’m conscious of no unannoying solution to share some parts between threads in a single uring, whereas additionally monitoring distinct urings for every thread.
Coexistence with epoll/XDP
If you wish to monitor each an epoll and a uring in a single thread with out busy ready, you’ll run into issues. You may’t instantly ballot() a uring for CQE readiness, so it may well’t be added to your epoll watchset. In the event you set a zero timeout, you are busy ready; should you set a non-zero timeout, one relies on the opposite’s readiness. There are two options:
- Add the epoll fd to your uring with IORING_OP_POLL_ADD, and wait just for uring readiness. Once you get a CQE for this submitted occasion, verify the epoll.
- Register an eventfd along with your uring with io_uring_register_eventfd(3), add that to your epoll, and if you get POLLIN for this fd, verify the completion ring.
The total API right here is:
int io_uring_register_eventfd(struct io_uring *ring, int fd);
int io_uring_register_eventfd_async(struct io_uring *ring, int fd);
int io_uring_unregister_eventfd(struct io_uring *ring);
io_uring_register_eventfd_async(3) solely posts to the eventfd for occasions that accomplished out-of-line. There’s not essentially a bijection between completion occasions and posts even with the common kind; a number of CQEs can publish solely a single occasion, and spurious posts can happen.
Equally, XDP‘s native technique of notification is by way of ballot(2); XDP might be unified with uring utilizing both of those two strategies.
Queue overflows
What’s lacking
I would wish to see signalfds and pidfds built-in for functions of linked operations (you may learn from them with the prevailing infrastructure, however you may’t create them, and thus cannot hyperlink their creation to different occasions).
Why is there no vmsplice(2) motion? How about fork(2)/pthread_create(3)?
It will be good to have tight integration with situation variables and even mutex/futex (enable me to submit a request to get a lock, and when i get the CQE, i’ve that lock). Bonus factors if the quick (uncontended) path by no means wants a system name (like mutexes constructed atop futexes right now).
It is sort of annoying that chains cannot lengthen over submissions. If I’ve bought quite a lot of knowledge I would like delivered so as, it appears I am restricted to a single chain in-flight at a time, or else I threat out-of-order supply because of one chain aborting, adopted by objects from a subsequent chain succeeding.
Evidently the apparent subsequent step could be offering small snippets of arbitrary computation to be run in kernelspace, linked with SQEs. Maybe eBPF could be an excellent beginning place.
Exterior hyperlinks