Debugging a FUSE impasse within the Linux kernel | by Netflix Expertise Weblog | Could, 2023
The Compute crew at Netflix is charged with managing all AWS and containerized workloads at Netflix, together with autoscaling, deployment of containers, challenge remediation, and so forth. As a part of this crew, I work on fixing unusual issues that customers report.
This explicit challenge concerned a customized inside FUSE filesystem: ndrive. It had been festering for a while, however wanted somebody to sit down down and take a look at it in anger. This weblog publish describes how I poked at /proc
to get a way of what was happening, earlier than posting the problem to the kernel mailing record and getting schooled on how the kernel’s wait code really works!
We had a caught docker API name:
goroutine 146 [select, 8817 minutes]:
web/http.(*persistConn).roundTrip(0xc000658fc0, 0xc0003fc080, 0x0, 0x0, 0x0)
/usr/native/go/src/web/http/transport.go:2610 +0x765
web/http.(*Transport).roundTrip(0xc000420140, 0xc000966200, 0x30, 0x1366f20, 0x162)
/usr/native/go/src/web/http/transport.go:592 +0xacb
web/http.(*Transport).RoundTrip(0xc000420140, 0xc000966200, 0xc000420140, 0x0, 0x0)
/usr/native/go/src/web/http/roundtrip.go:17 +0x35
web/http.ship(0xc000966200, 0x161eba0, 0xc000420140, 0x0, 0x0, 0x0, 0xc00000e050, 0x3, 0x1, 0x0)
/usr/native/go/src/web/http/shopper.go:251 +0x454
web/http.(*Consumer).ship(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0, 0xc00000e050, 0x0, 0x1, 0x10000168e)
/usr/native/go/src/web/http/shopper.go:175 +0xff
web/http.(*Consumer).do(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0)
/usr/native/go/src/web/http/shopper.go:717 +0x45f
web/http.(*Consumer).Do(...)
/usr/native/go/src/web/http/shopper.go:585
golang.org/x/web/context/ctxhttp.Do(0x163bd48, 0xc000044090, 0xc000438480, 0xc000966100, 0x0, 0x0, 0x0)
/go/pkg/mod/golang.org/x/web@v0.0.0-20211209124913-491a49abca63/context/ctxhttp/ctxhttp.go:27 +0x10f
github.com/docker/docker/shopper.(*Consumer).doRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc000966100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/moby@v0.0.0-20190408150954-50ebe4562dfc/shopper/request.go:132 +0xbe
github.com/docker/docker/shopper.(*Consumer).sendRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0x13d8643, 0x3, 0xc00079a720, 0x51, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/moby@v0.0.0-20190408150954-50ebe4562dfc/shopper/request.go:122 +0x156
github.com/docker/docker/shopper.(*Consumer).get(...)
/go/pkg/mod/github.com/moby/moby@v0.0.0-20190408150954-50ebe4562dfc/shopper/request.go:37
github.com/docker/docker/shopper.(*Consumer).ContainerInspect(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc0006a01c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/moby@v0.0.0-20190408150954-50ebe4562dfc/shopper/container_inspect.go:18 +0x128
github.com/Netflix/titus-executor/executor/runtime/docker.(*DockerRuntime).Kill(0xc000215180, 0x163bdb8, 0xc000938600, 0x1, 0x0, 0x0)
/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runtime/docker/docker.go:2835 +0x310
github.com/Netflix/titus-executor/executor/runner.(*Runner).doShutdown(0xc000432dc0, 0x163bd10, 0xc000938390, 0x1, 0xc000b821e0, 0x1d, 0xc0005e4710)
/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:326 +0x4f4
github.com/Netflix/titus-executor/executor/runner.(*Runner).startRunner(0xc000432dc0, 0x163bdb8, 0xc00071e0c0, 0xc0a502e28c08b488, 0x24572b8, 0x1df5980)
/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:122 +0x391
created by github.com/Netflix/titus-executor/executor/runner.StartTaskWithRuntime
/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:81 +0x411
Right here, our administration engine has made an HTTP name to the Docker API’s unix socket asking it to kill a container. Our containers are configured to be killed through SIGKILL
. However that is unusual. kill(SIGKILL)
ought to be comparatively deadly, so what’s the container doing?
$ docker exec -it 6643cd073492 bash
OCI runtime exec failed: exec failed: container_linux.go:380: beginning container course of induced: process_linux.go:130: executing setns course of induced: exit standing 1: unknown
Hmm. Looks like it’s alive, however setns(2)
fails. Why would that be? If we take a look at the method tree through ps awwfux
, we see:
_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/6643cd073492ba9166100ed30dbe389ff1caef0dc3d35
| _ [docker-init]
| _ [ndrive] <defunct>
Okay, so the container’s init course of continues to be alive, however it has one zombie little one. What might the container’s init course of presumably be doing?
# cat /proc/1528591/stack
[<0>] do_wait+0x156/0x2f0
[<0>] kernel_wait4+0x8d/0x140
[<0>] zap_pid_ns_processes+0x104/0x180
[<0>] do_exit+0xa41/0xb80
[<0>] do_group_exit+0x3a/0xa0
[<0>] __x64_sys_exit_group+0x14/0x20
[<0>] do_syscall_64+0x37/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
It’s within the strategy of exiting, however it appears caught. The one little one is the ndrive course of in Z (i.e. “zombie”) state, although. Zombies are processes which have efficiently exited, and are ready to be reaped by a corresponding wait()
syscall from their mother and father. So how might the kernel be caught ready on a zombie?
# ls /proc/1544450/activity
1544450 1544574
Ah ha, there are two threads within the thread group. Considered one of them is a zombie, perhaps the opposite one isn’t:
# cat /proc/1544574/stack
[<0>] request_wait_answer+0x12f/0x210
[<0>] fuse_simple_request+0x109/0x2c0
[<0>] fuse_flush+0x16f/0x1b0
[<0>] filp_close+0x27/0x70
[<0>] put_files_struct+0x6b/0xc0
[<0>] do_exit+0x360/0xb80
[<0>] do_group_exit+0x3a/0xa0
[<0>] get_signal+0x140/0x870
[<0>] arch_do_signal_or_restart+0xae/0x7c0
[<0>] exit_to_user_mode_prepare+0x10f/0x1c0
[<0>] syscall_exit_to_user_mode+0x26/0x40
[<0>] do_syscall_64+0x46/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
Certainly it isn’t a zombie. It’s making an attempt to develop into one as onerous as it might probably, however it’s blocking inside FUSE for some cause. To search out out why, let’s take a look at some kernel code. If we take a look at zap_pid_ns_processes()
, it does:
/*
* Reap the EXIT_ZOMBIE youngsters we had earlier than we ignored SIGCHLD.
* kernel_wait4() may even block till our kids traced from the
* mum or dad namespace are indifferent and develop into EXIT_DEAD.
*/
do {
clear_thread_flag(TIF_SIGPENDING);
rc = kernel_wait4(-1, NULL, __WALL, NULL);
} whereas (rc != -ECHILD);
which is the place we’re caught, however earlier than that, it has accomplished:
/* Do not permit any extra processes into the pid namespace */
disable_pid_allocation(pid_ns);
which is why docker can’t setns()
— the namespace is a zombie. Okay, so we will’t setns(2)
, however why are we caught in kernel_wait4()
? To grasp why, let’s take a look at what the opposite thread was doing in FUSE’s request_wait_answer()
:
/*
* Both request is already in userspace, or it was pressured.
* Wait it out.
*/
wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
Okay, so we’re ready for an occasion (on this case, that userspace has replied to the FUSE flush request). However zap_pid_ns_processes()
despatched a SIGKILL
! SIGKILL
ought to be very deadly to a course of. If we take a look at the method, we will certainly see that there’s a pending SIGKILL
:
# grep Pnd /proc/1544574/standing
SigPnd: 0000000000000000
ShdPnd: 0000000000000100
Viewing course of standing this manner, you may see 0x100
(i.e. the ninth bit is ready) underneath ShdPnd
, which is the sign quantity equivalent to SIGKILL
. Pending indicators are indicators which were generated by the kernel, however haven’t but been delivered to userspace. Indicators are solely delivered at sure occasions, for instance when getting into or leaving a syscall, or when ready on occasions. If the kernel is at the moment doing one thing on behalf of the duty, the sign could also be pending. Indicators may also be blocked by a activity, in order that they’re by no means delivered. Blocked indicators will present up of their respective pending units as properly. Nevertheless, man 7 sign
says: “The indicators SIGKILL
and SIGSTOP
can’t be caught, blocked, or ignored.” However right here the kernel is telling us that now we have a pending SIGKILL
, aka that it’s being ignored even whereas the duty is ready!
Nicely that’s bizarre. The wait code (i.e. embody/linux/wait.h
) is used all over the place within the kernel: semaphores, wait queues, completions, and so forth. Certainly it is aware of to search for SIGKILL
s. So what does wait_event()
really do? Digging by the macro expansions and wrappers, the meat of it’s:
#outline ___wait_event(wq_head, situation, state, unique, ret, cmd)
({
__label__ __out;
struct wait_queue_entry __wq_entry;
lengthy __ret = ret; /* express shadow */
init_wait_entry(&__wq_entry, unique ? WQ_FLAG_EXCLUSIVE : 0);
for (;;) {
lengthy __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);
if (situation)
break;
if (___wait_is_interruptible(state) && __int) {
__ret = __int;
goto __out;
}
cmd;
}
finish_wait(&wq_head, &__wq_entry);
__out: __ret;
})
So it loops endlessly, doing prepare_to_wait_event()
, checking the situation, then checking to see if we have to interrupt. Then it does cmd
, which on this case is schedule()
, i.e. “do one thing else for some time”. prepare_to_wait_event()
appears like:
lengthy prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
{
unsigned lengthy flags;
lengthy ret = 0;spin_lock_irqsave(&wq_head->lock, flags);
if (signal_pending_state(state, present)) {
/*
* Unique waiter should not fail if it was chosen by wakeup,
* it ought to "devour" the situation we had been ready for.
*
* The caller will recheck the situation and return success if
* we had been already woken up, we cannot miss the occasion as a result of
* wakeup locks/unlocks the identical wq_head->lock.
*
* However we have to be sure that set-condition + wakeup after that
* cannot see us, it ought to get up one other unique waiter if
* we fail.
*/
list_del_init(&wq_entry->entry);
ret = -ERESTARTSYS;
} else {
if (list_empty(&wq_entry->entry)) {
if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
__add_wait_queue_entry_tail(wq_head, wq_entry);
else
__add_wait_queue(wq_head, wq_entry);
}
set_current_state(state);
}
spin_unlock_irqrestore(&wq_head->lock, flags);
return ret;
}
EXPORT_SYMBOL(prepare_to_wait_event);
It appears like the one means we will escape of this with a non-zero exit code is that if signal_pending_state()
is true. Since our name website was simply wait_event()
, we all know that state right here is TASK_UNINTERRUPTIBLE
; the definition of signal_pending_state()
appears like:
static inline int signal_pending_state(unsigned int state, struct task_struct *p)
__fatal_signal_pending(p);
Our activity will not be interruptible, so the primary if fails. Our activity ought to have a sign pending, although, proper?
static inline int signal_pending(struct task_struct *p)
{
/*
* TIF_NOTIFY_SIGNAL is not actually a sign, however it requires the identical
* conduct when it comes to making certain that we escape of wait loops
* in order that notify sign callbacks could be processed.
*/
if (unlikely(test_tsk_thread_flag(p, TIF_NOTIFY_SIGNAL)))
return 1;
return task_sigpending(p);
}
Because the remark notes, TIF_NOTIFY_SIGNAL
isn’t related right here, despite its title, however let’s take a look at task_sigpending()
:
static inline int task_sigpending(struct task_struct *p)
{
return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
}
Hmm. Looks like we must always have that flag set, proper? To determine that out, let’s take a look at how sign supply works. Once we’re shutting down the pid namespace in zap_pid_ns_processes()
, it does:
group_send_sig_info(SIGKILL, SEND_SIG_PRIV, activity, PIDTYPE_MAX);
which ultimately will get to __send_signal_locked()
, which has:
pending = (sort != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
...
sigaddset(&pending->sign, sig);
...
complete_signal(sig, t, sort);
Utilizing PIDTYPE_MAX
right here as the kind is a bit bizarre, however it roughly signifies “that is very privileged kernel stuff sending this sign, you need to positively ship it”. There’s a little bit of unintended consequence right here, although, in that __send_signal_locked()
finally ends up sending the SIGKILL
to the shared set, as an alternative of the person activity’s set. If we take a look at the __fatal_signal_pending()
code, we see:
static inline int __fatal_signal_pending(struct task_struct *p)
{
return unlikely(sigismember(&p->pending.sign, SIGKILL));
}
Nevertheless it seems this can be a little bit of a crimson herring (although it took a while for me to know that).
To grasp what’s actually happening right here, we have to take a look at complete_signal()
, because it unconditionally provides a SIGKILL
to the duty’s pending set:
sigaddset(&t->pending.sign, SIGKILL);
however why doesn’t it work? On the high of the perform now we have:
/*
* Now discover a thread we will get up to take the sign off the queue.
*
* If the principle thread needs the sign, it will get first crack.
* Most likely the least stunning to the typical bear.
*/
if (wants_signal(sig, p))
t = p;
else if ((sort == PIDTYPE_PID) || thread_group_empty(p))
/*
* There is only one thread and it doesn't should be woken.
* It'll dequeue unblocked indicators earlier than it runs once more.
*/
return;
however as Eric Biederman described, principally each thread can deal with a SIGKILL
at any time. Right here’s wants_signal()
:
static inline bool wants_signal(int sig, struct task_struct *p)
So… if a thread is already exiting (i.e. it has PF_EXITING
), it doesn’t need a sign. Think about the next sequence of occasions:
1. a activity opens a FUSE file, and doesn’t shut it, then exits. Throughout that exit, the kernel dutifully calls do_exit()
, which does the next:
exit_signals(tsk); /* units PF_EXITING */
2. do_exit()
continues on to exit_files(tsk);
, which flushes all information which can be nonetheless open, ensuing within the stack hint above.
3. the pid namespace exits, and enters zap_pid_ns_processes()
, sends a SIGKILL
to everybody (that it expects to be deadly), after which waits for everybody to exit.
4. this kills the FUSE daemon within the pid ns so it might probably by no means reply.
5. complete_signal()
for the FUSE activity that was already exiting ignores the sign, because it has PF_EXITING
.
6. Impasse. With out manually aborting the FUSE connection, issues will dangle endlessly.
It doesn’t actually make sense to attend for flushes on this case: the duty is dying, so there’s no one to inform the return code of flush()
to. It additionally seems that this bug can occur with a number of filesystems (something that calls the kernel’s wait code in flush()
, i.e. principally something that talks to one thing outdoors the native kernel).
Particular person filesystems will should be patched within the meantime, for instance the repair for FUSE is here, which was launched on April 23 in Linux 6.3.
Whereas this weblog publish addresses FUSE deadlocks, there are positively points within the nfs code and elsewhere, which now we have not hit in manufacturing but, however nearly actually will. It’s also possible to see it as a symptom of other filesystem bugs. One thing to look out for you probably have a pid namespace that received’t exit.
That is only a small style of the number of unusual points we encounter working containers at scale at Netflix. Our crew is hiring, so please attain out for those who additionally love crimson herrings and kernel deadlocks!