Now Reading
Screwing up my web page tables

Screwing up my web page tables

2023-06-13 02:48:03

Just lately, the epic yak shave that’s my life dragged me down into rectangular pits of web page desk corruption. I’ve made it out the opposite facet, with a main takeaway of “how the hell did this ever work?”, adopted carefully by “what the hell remains to be hiding?”.

Image this: you’re carefree, wholesome, “hey, I ought to rewrite axle’s community stack in Rust!”. Sinister snap of the violin. Unhappy oboe, crying clown.

OK, this could’t be too unhealthy. Let’s begin with the fundamentals: we’ll hearth up the community card driver to jog our reminiscence about the place we left issues off.

Proper off the bat, we will see the pointer fields expect a 32-bit tackle house. I clearly by no means up to date this part after upgrading axle to support x86_64.

Hopefully this part gained’t simply silently run into bugs on x86_64, although? Let’s preserve wanting.

Ah ha! Thanks past-me. This driver desires to name amc_physical_memory_region_create(), however I left myself a fast be aware that this code path wants one other look. amc_physical_memory_region_create() permits the caller to allocate a area of bodily contiguous reminiscence, and I can see that I eliminated this functionality from axle’s bodily reminiscence allocator someday previous. I inserted assertions in order that any code path that tries to allocate a bodily contiguous areas will alert me, upon execution, that the time has come for extra bodily reminiscence allocator work.

Nicely, the time has come for extra bodily reminiscence allocator work. The present design, carried out in C, simply maintains a stack of free bodily frames. It’s a bit troublesome to determine and allocate contiguous areas with only a stack for storage, and adjusting our design will make issues rather a lot simpler for us. Let’s reserve a area for contiguous allocations upon startup, and allocate from it individually.

Nonetheless, I can’t be bothered to do it in C.

Rust to the ruscue

Following my new SOP, I made a decision to rewrite my kernel-module-du-jour in Rust, and make additional enhancements on the module from there.

No worries, this module ought to be fairly easy! Let’s get began by making only a one-to-one alternative to the C API. Then, we will add our contiguous chunk pool as soon as we’re positive the Rust-based module is working as meant.

The interface to the bodily reminiscence allocator is fairly simple, as is its work:

  • Upon boot in pmm_init(), learn the UEFI-provided reminiscence map to find bodily reminiscence areas.
  • For all of the areas which can be marked within the reminiscence map as ‘usable’, excluding particular frames we deliberately exclude from the bodily reminiscence allocator, observe the frames within the allocator’s free-frames stack.
  • When an allocation is requested through pmm_alloc(), pop a free body from the free-frames stack.
  • When a body is freed through pmm_free(), push the body again onto the free-frames stack.

This may be carried out in properly below a few hundred traces, and there’s nothing snazzy about it. Think about my horror, then, once I turned on the brand new Rust-based allocator and began getting web page faults throughout boot!

Bug #1: Buffer overflow in SMP bootstrap setup

Debugging reminiscence bugs in kernel-space isn’t enjoyable, and on the worst of occasions could be an outright nightmare. Timing and totally different sequencing of occasions throughout testing generally preclude bugs from manifesting, making it troublesome to isolate no matter goes improper.

We all know that the bug doesn’t manifest when utilizing the C allocator, and does when utilizing the Rust allocator. Perhaps these allocators have barely totally different allocation patterns?

That’s attention-grabbing! The C allocator operates in LIFO: it enqueues freed frames to the highest of its stack, such that the following allocation yields no matter body was final freed. The Rust allocator, in contrast, operates in FIFO: freed frames are pushed to the again, and we’ll solely see them once more as soon as we exhaust all of the frames we enqueued at the start.

Though noteworthy, the distinction in conduct isn’t stunning: within the C allocator, I’m utilizing a small handwritten stack, which naturally has LIFO semantics. Within the Rust allocator, I’m utilizing heapless::Queue, which is FIFO by design. The distinction in entry sample isn’t an issue in-and-of itself, however the truth that totally different frames at the moment are getting used for early-boot constructions is inflicting some deeper bug to crawl out the shadowy depths.

I began hacking on the codebase, lobbing off chunks of performance left and proper, haphazardly disabling subsystems inside which the bug may nest. Ultimately, I managed to isolate a reproducing case to inside SMP bringup.

As a refresher, every time we boot one other core on the machine throughout symmetric multiprocessing bringup, the core will invariably begin booting in 16-bit mode, and we’ll have to manually assist it undergo the motions of enabling 32-bit and 64-bit mode (or ’lengthy mode’). Booting into 16-bit mode imposes some constraints: significantly, we have to have some code at a recognized bodily reminiscence tackle beneath the 1MB mark that the core can start executing from. In axle, body 0x8000 is used for this goal, and is known as the SMP bootstrap code body. axle locations a hand-coded meeting routine into this body that performs the lengthy stroll to lengthy mode.

To make the transition to 64-bit mode extra handy, a number of of the info constructions that this program wants are positioned into one other particular body, 0x9000, which axle denotes the SMP bootstrap information body. Particular offsets inside this body maintain tips that could totally different information constructions that the bootstrap program wants whereas transitioning to 64-bit mode, reminiscent of:

  • A 32-bit GDT
  • A 64-bit GDT containing ’low reminiscence’ pointers
  • A 64-bit GDT containing ‘excessive reminiscence’ pointers
  • A 64-bit PML4
  • A 64-bit IDT
  • A high-memory stack pointer
  • A high-memory entry level to leap to as soon as the bootstrap program completes

Whereas establishing the SMP bootstrap information body, the kernel will create these constructions, and replica the tips that could their anticipated places throughout the information body. For instance, right here’s how the kernel units up the C entry level tackle:

Right here’s how the kernel units up the IDT tackle:

Wait, what? We’re copying sizeof(idt_pointer_t) + current_idt->table_size bytes, however nobody was anticipating us to repeat the entire desk! We’re solely meant to be copying a pointer to the desk.

Double-checking the bootstrap program, we positively didn’t count on this; the bootstrap program simply expects a pointer, not the complete desk information.

Having a more in-depth take a look at current_idt->table_size, we discover that this desk occupies 0xfff bytes. Which means our stream above will write exterior the allowed body, which ends at 0xa000, overwriting no matter’s occupying 0xa000 to 0xa408 with the again half of the kernel’s interrupt descriptor desk. This overflow didn’t trigger any seen errors with our earlier body allocation sample, so it didn’t come up earlier than. Nonetheless, with our new allocation sample, this overflow is overwriting information in a approach that leads on to a web page fault. Let’s repair the overflow:


Altering our body allocation technique, an entry sample element that ideally shouldn’t drastically impression system conduct, brought about a number of low-level reminiscence bugs to come back to gentle. Perhaps we may deliberately randomize our body allocation patterns, so we will deliberately get your hands on bugs on this class. We may additionally permit setting a seed for this randomization, to allow reproducible eventualities.

All completed, let’s transfer on!

Bug #2: Overwriting paging constructions as a consequence of a truncated pointer

One bug squashed and I’m feeling buzzy. Let’s preserve going!

We’ve progressed a bit additional into boot, however now we see one thing actually odd. Sooner or later, we get a web page fault. Nonetheless, the faulting tackle factors to a location that ought to clearly be a sound reminiscence location, reminiscent of bits of the kernel which can be mapped into each tackle house. Moreover, this web page fault is far more immune to this slice-things-off-until-the-bug-remains strategy that we used above. The precise location of the web page fault appears to leap round between totally different runs. We’re going to have to unravel this fast, as if I chunk my fingernails any shorter I’ll lose my elbows.

I hearth up gdb and get to work monitoring down what’s going improper. Ultimately, I set a breakpoint in _handle_page_fault() and examine the digital tackle house state.

Right here’s what a standard digital tackle house structure appears like for an early-boot snapshot:

All of the ranges right here match our expectations, and are additional laid out in

  • 0x0000000000000000 - 0x0000001000000000: Early-boot bodily RAM identity-map
  • 0xffff800000000000 - 0xffff801000000000: Excessive-memory bodily RAM remap
  • 0xffff900000000000 - 0xffff900000045000: Kernel heap
  • 0xffffa00000000000 - 0xffffa00000001000: CPU core-specific storage web page
  • 0xffffffff80000000 - 0xffffffff8221a000: Kernel code and information

The digital tackle house state as soon as our web page fault breakpoint triggers:

That appears… terrifying, to not point out nonsensical. What are all these areas? Why are they in low reminiscence? Why does the dimensions of 1 area correspond to the bottom tackle of the following area? What am I doing with my life?

Issues begin to really feel much more manageable when you notice that the nonsensical tackle house mappings we’ve listed above is likely to be the results of our paging constructions themselves being overwritten as a consequence of incorrect reminiscence accesses. For those who overwrite your personal description of how the digital tackle house ought to be laid out, you possibly can guess you’re going to get bizarre violations on addresses that ought to clearly be legitimate.

This actually takes the sting out of those scary faults, and now we will get to work monitoring down the precise subsystem liable for the corrupted web page tables.

I discover one thing supremely odd: when creating the construction backing a brand new course of, the kernel heap seems to be returning addresses in very low reminiscence, somewhat than addresses above 0xffff900000000000 (the bottom of the kernel heap).

Okay, that is beginning to make sense!

  • For ~some purpose~, allocating a brand new process block returns an tackle in low reminiscence.
  • After we populate this process block, we overwrite this low reminiscence.
  • Paging constructions are a few of the first frames allotted by the PMM, and due to this fact reside in low reminiscence.
  • Paging constructions get overwritten with contextually bogus information, leading to mega-invalid digital tackle areas and web page entry violations at addresses that ought to be legitimate.

The massive query now could be, why is the kernel heap returning addresses in low reminiscence, somewhat than addresses throughout the kernel heap?

After poking round inside gdb additional, it seems that the return worth from kmalloc() is a sound pointer throughout the heap. This pointer is saved in $rax, then… is truncated by the following instruction?! Let’s check out the meeting.

Hmm, I’m not aware of cdqe. A fast search reveals that it sign-extends the decrease 4 bytes into the complete 8 bytes. Ah ha! A flash of perception illuminates my greedy waddles. Might or not it’s that kmalloc() is implicitly outlined on the name web site, so the compiler assumes the return kind is a 32-bit int, chucking out the highest half of the pointer worth?

Certainly, that is precisely what’s happening. A easy #embrace on the prime of task_small.c fixes this proper up, and no extra web page desk corruption for us.

I’ll end this part with the unedited notes I took as I debugged this.


Warnings are helpful, and an urge for food for ignoring them will wind you up with a bellyful of ache. I enabled -Werror=implicit-function-declaration to make sure that implicitly outlined features are all the time a tough error, as that is evidently very harmful: any name web site invoking an implicitly outlined perform, anticipating to obtain a pointer again, goes to have a really unhealthy time. I then cleaned up all of the implicitly outlined features within the kernel.

Additionally, it’s not nice that producing an invalid low pointer causes our paging constructions to get overwritten, as these constructions are important to sustaining a system with some semblance of coherence, and their corruption makes debugging far more complicated, difficult, and unpredictable. Userspace applications typically don’t map the underside area of the tackle house (or map it as inaccessible). This so-called ‘NULL area’ could be very helpful for debugging: it signifies that if a bug ever leads you to accessing low reminiscence, you’ll get a fast crash as a substitute of overwriting necessary state. For this reason you get a ‘segmentation fault’ when dereferencing a NULL pointer!

Since we’re working within the context of bodily reminiscence, we don’t have the luxurious of mapping frames nonetheless we like. If we need to exclude the bottom part of the tackle house, we now have to ‘waste’ that RAM: we will’t use it for anything. Perhaps it is a worthwhile tradeoff. I’m imagining marking the bottom few frames as a ‘scribble area’: all zeroes to start out, with some form of background job that watches for if any non-zero information pops up in these frames. If it does, we all know that some a part of the system has accessed an invalid pointer. For the reason that means of accessing an invalid pointer is just potential whereas the kernel’s low-memory id map is lively, we may free this scribble area as quickly as we disable the low-memory id map, and provides the reserved frames again to the bodily reminiscence allocator.

Certainly we’re completed..?

Bug #3: Overwriting paging constructions as a consequence of unhealthy pointer math in allocator

I feel I’m going to throw up.

We’ve obtained one other bug during which the paging constructions in low-memory are being overwritten whereas cloning a digital tackle house. Let’s check out our allocation sample once more…

Oh, gnarly! At first look, our PMM is double-allocating frames; see how 0x0000000000001000 is allotted as soon as close to the start, then once more a lot afterward?

At nearer look, the PMM begins this double-allocating as quickly because the body tackle ticks over to 0x0000000100000000 (ie. above 4GB). This triggers my spidey-senses that one thing, someplace, isn’t dealing with 64-bit addresses appropriately.

Certainly, I blithely translated some outdated code into Rust:

I felt fairly foolish once I noticed this one. I’m really doing away with all of the bits above u32::max, for the reason that masks I’m making use of is 0x00000000fffff000. Due to this fact, any addresses above 4GB which can be enqueued get truncated to their low-memory cousins. When these truncated body addresses get allotted, I’m telling every allocator consumer “no actually, go forward and write to 0x8000, it’s yours!”. This body is already being utilized by one thing else within the system, and chaos ensues.

The repair right here is basically easy! I moved away from the ’allow absolute bits’ strategy to a ‘disable relative bits’ strategy, which robotically masks to the right addresses whatever the goal’s phrase dimension.

See Also

Bug #4: High portion of tackle house is lacking as a consequence of unhealthy math in web page mapper

This one was bizarre. Over the course of writing the contiguous chunk pool within the new Rust-based PMM, I launched a brand new piece of world state to handle the pool.

Since this information is asserted static, the storage for this construction will likely be reserved when the kernel is loaded into reminiscence by axle’s UEFI bootloader.

Because it occurs, the construct system ended up inserting this construction on the very prime of the tackle house reserved for axle’s kernel.

However when this code runs, we get a web page fault when attempting to entry this construction.

Huh, that’s odd. The bootloader units up reminiscence areas based mostly on no matter is described by the kernel binary:

After we run this, the phase containing our PMM’s new information construction seems to be mapped appropriately.

And but, we get a web page fault upon entry! What will we see after we interrogate the precise state of the digital tackle house, disregarding out intention for what the tackle house ought to appear to be?

Yeah, that’s it! We’re intending for information to be mapped into the digital tackle house at 0xffffffff82217fff, however, for some purpose, our digital tackle house stops at 0xffffffff82200000. After we attempt to entry an information construction that occurs to be saved on this unhealthy area, we get a web page fault.

However why are we lacking the highest area of the tackle house? A probable suspect for investigation is the logic which units up a digital tackle house mapping based mostly on enter parameters, and certainly that is the place our issues lie.

The logic throughout the bootloader’s that maps areas of the digital tackle house appears roughly like this:

The logic above primarily has two jobs, that are being interleaved:

  1. Map the requested digital tackle house area
  2. Determine which paging constructions to make use of to map every body

The loops above work simply high quality when all of the paging tables we’re accessing alongside the way in which are empty. Nonetheless, if one or a number of of the tables we’re including information to is already partially full, the above loop gained’t populate every little thing we requested. It’ll calculate what number of tables it’d take to fulfill the request if all of these tables had been empty to start with. If a few of these tables had been storing some digital tackle mappings to start with, although, we gained’t map as a lot reminiscence as was requested, and can silently produce an surprising digital tackle house state!

The repair right here is conceptually simple: somewhat than attempting to compute precisely what number of constructions we’ll have to iterate upfront, let’s simply allocate pages one-by-one till we’ve glad the map request, and lookup the right paging constructions to map every web page on the fly:

Bug #5: Transitive heap allocations trigger crashes within the PMM

Over the course of debugging the Rust-based PMM, I naturally printed out some state at varied attention-grabbing places:

Upon execution, although, I might get stunning failures about an exhausted digital tackle house.

Taking a peek into the digital tackle house mappings like we realized above, we will see telltale indicators that issues have positively gone awry.

Because it seems, the difficulty right here is that the println!() macro transitively kicks off a heap allocation.

Performing a heap allocation throughout the bodily reminiscence supervisor is certainly not allowed, as this inverts the dependency tree. The bodily reminiscence supervisor sits on the backside of the stack, and the digital reminiscence supervisor sits above that. The kernel heap is only a consumer of the digital reminiscence supervisor, which requests and frees chunks of digital tackle house. In follow, attempting to carry out heap allocations from the bodily reminiscence allocator causes corrupted digital tackle house state, and, consequently, complicated crashes. The repair for this specific case is to invoke axle’s C-based printf perform immediately through FFI, somewhat than going via the formatting equipment. In fact, this implies I have to do a bit of labor by hand, reminiscent of remembering to null-terminate Rust-originating strings that I cross.

I do want Rust supplied some strategy to specific and implement {that a} specific module isn’t allowed to allocate!

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top