Screwing up my web page tables
Just lately, the epic yak shave that’s my life dragged me down into rectangular pits of web page desk corruption. I’ve made it out the opposite facet, with a main takeaway of “how the hell did this ever work?”, adopted carefully by “what the hell remains to be hiding?”.
Image this: you’re carefree, wholesome, “hey, I ought to rewrite axle’s community stack in Rust!”. Sinister snap of the violin. Unhappy oboe, crying clown.
OK, this could’t be too unhealthy. Let’s begin with the fundamentals: we’ll hearth up the community card driver to jog our reminiscence about the place we left issues off.
userspace/realtek_8139_driver/realtek_8139_driver.c
typedef struct rtl8139_state {
// Configuration encoded into the gadget
uint8_t tx_buffer_register_ports[4];
uint8_t tx_status_command_register_ports[4];
// Configuration learn from PCI bus
uint32_t pci_bus;
uint32_t pci_device_slot;
uint32_t pci_function;
uint32_t io_base;
// Configuration assigned by this driver
uint32_t receive_buffer_virt;
uint32_t receive_buffer_phys;
uint32_t transmit_buffer_virt;
uint32_t transmit_buffer_phys;
// Present operating state
uint8_t tx_round_robin_counter;
uintptr_t rx_curr_buf_off;
} rtl8139_state_t;
Proper off the bat, we will see the pointer fields expect a 32-bit tackle house. I clearly by no means up to date this part after upgrading axle to support x86_64.
Hopefully this part gained’t simply silently run into bugs on x86_64, although? Let’s preserve wanting.
void realtek_8139_init()
userspace/realtek_8139_driver/realtek_8139_driver.c
// Arrange RX and TX buffers, and provides them to the gadget
uintptr_t rx_buffer_size = 8192 + 16 + 1500;
Deprecated("revisit");
//amc_physical_memory_region_create(rx_buffer_size, &virt_memory_rx_addr, &phys_memory_rx_addr);
Ah ha! Thanks past-me. This driver desires to name amc_physical_memory_region_create()
, however I left myself a fast be aware that this code path wants one other look. amc_physical_memory_region_create()
permits the caller to allocate a area of bodily contiguous reminiscence, and I can see that I eliminated this functionality from axle’s bodily reminiscence allocator someday previous. I inserted assertions in order that any code path that tries to allocate a bodily contiguous areas will alert me, upon execution, that the time has come for extra bodily reminiscence allocator work.
Nicely, the time has come for extra bodily reminiscence allocator work. The present design, carried out in C, simply maintains a stack of free bodily frames. It’s a bit troublesome to determine and allocate contiguous areas with only a stack for storage, and adjusting our design will make issues rather a lot simpler for us. Let’s reserve a area for contiguous allocations upon startup, and allocate from it individually.
Nonetheless, I can’t be bothered to do it in C.
Rust to the ruscue
Following my new SOP, I made a decision to rewrite my kernel-module-du-jour in Rust, and make additional enhancements on the module from there.
No worries, this module ought to be fairly easy! Let’s get began by making only a one-to-one alternative to the C API. Then, we will add our contiguous chunk pool as soon as we’re positive the Rust-based module is working as meant.
kernel/kernel/pmm/pmm.h
void pmm_init(void);
uintptr_t pmm_alloc(void);
void pmm_free(uintptr_t frame_addr);
The interface to the bodily reminiscence allocator is fairly simple, as is its work:
- Upon boot in
pmm_init()
, learn the UEFI-provided reminiscence map to find bodily reminiscence areas. - For all of the areas which can be marked within the reminiscence map as ‘usable’, excluding particular frames we deliberately exclude from the bodily reminiscence allocator, observe the frames within the allocator’s free-frames stack.
- When an allocation is requested through
pmm_alloc()
, pop a free body from the free-frames stack. - When a body is freed through
pmm_free()
, push the body again onto the free-frames stack.
This may be carried out in properly below a few hundred traces, and there’s nothing snazzy about it. Think about my horror, then, once I turned on the brand new Rust-based allocator and began getting web page faults throughout boot!
Bug #1: Buffer overflow in SMP bootstrap setup
Debugging reminiscence bugs in kernel-space isn’t enjoyable, and on the worst of occasions could be an outright nightmare. Timing and totally different sequencing of occasions throughout testing generally preclude bugs from manifesting, making it troublesome to isolate no matter goes improper.
We all know that the bug doesn’t manifest when utilizing the C allocator, and does when utilizing the Rust allocator. Perhaps these allocators have barely totally different allocation patterns?
// Allocation sample for the C allocator
pmm_alloc() = 0x0000000000000000
pmm_alloc() = 0x0000000000001000
pmm_alloc() = 0x0000000000002000
pmm_free( 0x00000000be411000)
pmm_alloc() = 0x00000000be411000
pmm_alloc() = 0x0000000000003000
pmm_alloc() = 0x0000000000004000
// Allocation sample for the Rust allocator
pmm_alloc() = 0x0000000000000000
pmm_alloc() = 0x0000000000001000
pmm_alloc() = 0x0000000000002000
pmm_free( 0x00000000be411000)
pmm_alloc() = 0x0000000000003000
pmm_alloc() = 0x0000000000004000
pmm_alloc() = 0x0000000000005000
That’s attention-grabbing! The C allocator operates in LIFO: it enqueues freed frames to the highest of its stack, such that the following allocation yields no matter body was final freed. The Rust allocator, in contrast, operates in FIFO: freed frames are pushed to the again, and we’ll solely see them once more as soon as we exhaust all of the frames we enqueued at the start.
Though noteworthy, the distinction in conduct isn’t stunning: within the C allocator, I’m utilizing a small handwritten stack, which naturally has LIFO semantics. Within the Rust allocator, I’m utilizing heapless::Queue
, which is FIFO by design. The distinction in entry sample isn’t an issue in-and-of itself, however the truth that totally different frames at the moment are getting used for early-boot constructions is inflicting some deeper bug to crawl out the shadowy depths.
I began hacking on the codebase, lobbing off chunks of performance left and proper, haphazardly disabling subsystems inside which the bug may nest. Ultimately, I managed to isolate a reproducing case to inside SMP bringup.
As a refresher, every time we boot one other core on the machine throughout symmetric multiprocessing bringup, the core will invariably begin booting in 16-bit mode, and we’ll have to manually assist it undergo the motions of enabling 32-bit and 64-bit mode (or ’lengthy mode’). Booting into 16-bit mode imposes some constraints: significantly, we have to have some code at a recognized bodily reminiscence tackle beneath the 1MB mark that the core can start executing from. In axle, body 0x8000
is used for this goal, and is known as the SMP bootstrap code body. axle locations a hand-coded meeting routine into this body that performs the lengthy stroll to lengthy mode.
To make the transition to 64-bit mode extra handy, a number of of the info constructions that this program wants are positioned into one other particular body, 0x9000
, which axle denotes the SMP bootstrap information body. Particular offsets inside this body maintain tips that could totally different information constructions that the bootstrap program wants whereas transitioning to 64-bit mode, reminiscent of:
- A 32-bit GDT
- A 64-bit GDT containing ’low reminiscence’ pointers
- A 64-bit GDT containing ‘excessive reminiscence’ pointers
- A 64-bit PML4
- A 64-bit IDT
- A high-memory stack pointer
- A high-memory entry level to leap to as soon as the bootstrap program completes
kernel/kernel/ap_bootstrap.h
#outline AP_BOOTSTRAP_PARAM_OFFSET_PROTECTED_MODE_GDT 0x0
#outline AP_BOOTSTRAP_PARAM_OFFSET_LONG_MODE_LOW_MEMORY_GDT 0x100
#outline AP_BOOTSTRAP_PARAM_OFFSET_LONG_MODE_HIGH_MEMORY_GDT 0x200
#outline AP_BOOTSTRAP_PARAM_OFFSET_PML4 0x300
#outline AP_BOOTSTRAP_PARAM_OFFSET_IDT 0x400
#outline AP_BOOTSTRAP_PARAM_OFFSET_C_ENTRY 0x500
#outline AP_BOOTSTRAP_PARAM_OFFSET_STACK_TOP 0x600
Whereas establishing the SMP bootstrap information body, the kernel will create these constructions, and replica the tips that could their anticipated places throughout the information body. For instance, right here’s how the kernel units up the C entry level tackle:
kernel/kernel/smp.c
// Copy the C entry level
uintptr_t ap_c_entry_point_addr = (uintptr_t)&ap_c_entry;
memcpy((void*)PMA_TO_VMA(AP_BOOTSTRAP_PARAM_C_ENTRY), &ap_c_entry_point_addr, sizeof(ap_c_entry_point_addr));
Right here’s how the kernel units up the IDT tackle:
kernel/kernel/smp.c
// Copy the IDT pointer
idt_pointer_t* current_idt = kernel_idt_pointer();
// It is high quality to repeat the high-memory IDT because the bootstrap will allow paging earlier than loading it
memcpy(
(void*)PMA_TO_VMA(AP_BOOTSTRAP_PARAM_IDT),
current_idt,
sizeof(idt_pointer_t) + current_idt->table_size
);
Wait, what? We’re copying sizeof(idt_pointer_t) + current_idt->table_size
bytes, however nobody was anticipating us to repeat the entire desk! We’re solely meant to be copying a pointer to the desk.
Double-checking the bootstrap program, we positively didn’t count on this; the bootstrap program simply expects a pointer, not the complete desk information.
ap_bootstrap.s.x86_64.arch_specific
idt_ptr: dd 0x9400
; Load the IDT
mov eax, [$idt_ptr]
lidt [eax]
Having a more in-depth take a look at current_idt->table_size
, we discover that this desk occupies 0xfff
bytes. Which means our stream above will write exterior the allowed body, which ends at 0xa000
, overwriting no matter’s occupying 0xa000
to 0xa408
with the again half of the kernel’s interrupt descriptor desk. This overflow didn’t trigger any seen errors with our earlier body allocation sample, so it didn’t come up earlier than. Nonetheless, with our new allocation sample, this overflow is overwriting information in a approach that leads on to a web page fault. Let’s repair the overflow:
kernel/kernel/smp.c
// Copy the IDT pointer
idt_pointer_t* current_idt = kernel_idt_pointer();
// It is high quality to repeat the high-memory IDT because the bootstrap will allow paging earlier than loading it
memcpy(
(void*)PMA_TO_VMA(AP_BOOTSTRAP_PARAM_IDT),
current_idt,
sizeof(idt_pointer_t)
);
Lesson
Altering our body allocation technique, an entry sample element that ideally shouldn’t drastically impression system conduct, brought about a number of low-level reminiscence bugs to come back to gentle. Perhaps we may deliberately randomize our body allocation patterns, so we will deliberately get your hands on bugs on this class. We may additionally permit setting a seed for this randomization, to allow reproducible eventualities.
All completed, let’s transfer on!
Bug #2: Overwriting paging constructions as a consequence of a truncated pointer
One bug squashed and I’m feeling buzzy. Let’s preserve going!
We’ve progressed a bit additional into boot, however now we see one thing actually odd. Sooner or later, we get a web page fault. Nonetheless, the faulting tackle factors to a location that ought to clearly be a sound reminiscence location, reminiscent of bits of the kernel which can be mapped into each tackle house. Moreover, this web page fault is far more immune to this slice-things-off-until-the-bug-remains strategy that we used above. The precise location of the web page fault appears to leap round between totally different runs. We’re going to have to unravel this fast, as if I chunk my fingernails any shorter I’ll lose my elbows.
I hearth up gdb
and get to work monitoring down what’s going improper. Ultimately, I set a breakpoint in _handle_page_fault()
and examine the digital tackle house state.
Right here’s what a standard digital tackle house structure appears like for an early-boot snapshot:
0x0000000000000000 - 0x0000001000000000 -rw
0xffff800000000000 - 0xffff801000000000 -rw
0xffff900000000000 - 0xffff900000045000 -rw
0xffffa00000000000 - 0xffffa00000001000 -rw
0xffffffff80000000 - 0xffffffff8221a000 -rw
All of the ranges right here match our expectations, and are additional laid out in memory_map.md:
0x0000000000000000 - 0x0000001000000000
: Early-boot bodily RAM identity-map0xffff800000000000 - 0xffff801000000000
: Excessive-memory bodily RAM remap0xffff900000000000 - 0xffff900000045000
: Kernel heap0xffffa00000000000 - 0xffffa00000001000
: CPU core-specific storage web page0xffffffff80000000 - 0xffffffff8221a000
: Kernel code and information
The digital tackle house state as soon as our web page fault breakpoint triggers:
0x0000000000000000 - 0x00000000bee00000 -rw
0x00000000bee00000 - 0x00000000bf000000 -r-
0x00000000bf000000 - 0x00000000bfa5b000 -rw
0x00000000bfa5b000 - 0x00000000bfa5c000 -r-
0x00000000bfa5c000 - 0x00000000bfa5e000 -rw
0x00000000bfa5e000 - 0x00000000bfa5f000 -r-
0x00000000bfa5f000 - 0x00000000bfa61000 -rw
0x00000000bfa61000 - 0x00000000bfa63000 -r-
0x00000000bfa63000 - 0x00000000bfa65000 -rw
0x00000000bfa65000 - 0x00000000bfa66000 -r-
0x00000000bfa66000 - 0x00000000bfa68000 -rw
0x00000000bfa68000 - 0x00000000bfac3000 -r-
0x00000000bfac3000 - 0x00000000bfadf000 -rw
0x00000000bfadf000 - 0x00000000bfae0000 -r-
0x00000000bfae0000 - 0x00000000bfae3000 -rw
0x00000000bfae3000 - 0x00000000bfae4000 -r-
0x00000000bfae4000 - 0x00000000bfae7000 -rw
0x00000000bfae7000 - 0x00000000bfae8000 -r-
0x00000000bfae8000 - 0x00000000bfaeb000 -rw
0x00000000bfaeb000 - 0x00000000bfaed000 -r-
0x00000000bfaed000 - 0x00000000bfc00000 -rw
0x00000000bfc00000 - 0x00000000bfe00000 -r-
0x00000000bfe00000 - 0x0000001000000000 -rw
That appears… terrifying, to not point out nonsensical. What are all these areas? Why are they in low reminiscence? Why does the dimensions of 1 area correspond to the bottom tackle of the following area? What am I doing with my life?
Issues begin to really feel much more manageable when you notice that the nonsensical tackle house mappings we’ve listed above is likely to be the results of our paging constructions themselves being overwritten as a consequence of incorrect reminiscence accesses. For those who overwrite your personal description of how the digital tackle house ought to be laid out, you possibly can guess you’re going to get bizarre violations on addresses that ought to clearly be legitimate.
This actually takes the sting out of those scary faults, and now we will get to work monitoring down the precise subsystem liable for the corrupted web page tables.
I discover one thing supremely odd: when creating the construction backing a brand new course of, the kernel heap seems to be returning addresses in very low reminiscence, somewhat than addresses above 0xffff900000000000
(the bottom of the kernel heap).
task_small_t* _thread_create(void*, uintptr_t, uintptr_t, uintptr_t)
kernel/kernel/multitasking/duties/task_small.c
task_small_t* new_task = kmalloc(sizeof(task_small_t));
printf("Debug: Allotted new process 0xpercentpn", new_task);
syslog.log
Cpu[0],Pid[10],Clk[4661]: [10] Debug: Allotted new process 0x0000000000002110
Cpu[0],Pid[10],Clk[4662]: [10] Debug: Allotted new process 0x00000000000045c0
Okay, that is beginning to make sense!
- For ~some purpose~, allocating a brand new process block returns an tackle in low reminiscence.
- After we populate this process block, we overwrite this low reminiscence.
- Paging constructions are a few of the first frames allotted by the PMM, and due to this fact reside in low reminiscence.
- Paging constructions get overwritten with contextually bogus information, leading to mega-invalid digital tackle areas and web page entry violations at addresses that ought to be legitimate.
The massive query now could be, why is the kernel heap returning addresses in low reminiscence, somewhat than addresses throughout the kernel heap?
After poking round inside gdb additional, it seems that the return worth from kmalloc()
is a sound pointer throughout the heap. This pointer is saved in $rax
, then… is truncated by the following instruction?! Let’s check out the meeting.
task_small_t* _thread_create(void*, uintptr_t, uintptr_t, uintptr_t)
axle.bin
mov edi, 0xf0
mov eax, 0x0
name kmalloc
cdqe
Hmm, I’m not aware of cdqe
. A fast search reveals that it sign-extends the decrease 4 bytes into the complete 8 bytes. Ah ha! A flash of perception illuminates my greedy waddles. Might or not it’s that kmalloc()
is implicitly outlined on the name web site, so the compiler assumes the return kind is a 32-bit int
, chucking out the highest half of the pointer worth?
Certainly, that is precisely what’s happening. A easy #embrace
on the prime of task_small.c
fixes this proper up, and no extra web page desk corruption for us.
I’ll end this part with the unedited notes I took as I debugged this.
task_small_t* _thread_create(void*, uintptr_t, uintptr_t, uintptr_t)
kernel/kernel/multitasking/duties/task_small.c
task_small_t* new_task = kmalloc(sizeof(task_small_t));
// That is giving out addresses in very low pages?!
// Allotted new_task 0000000000002110
// Allotted new_task 00000000000045c0
// These frames are getting used to carry paging constructions, and once they get overwritten, kaboom
// Compiler is emitting a `cltq` instruction after the decision to kmalloc(),
// which truncates 0xffff900000000d70 to 0xd70
// Jesus christ, it is as a result of kmalloc wasn't outlined
Lesson
Warnings are helpful, and an urge for food for ignoring them will wind you up with a bellyful of ache. I enabled -Werror=implicit-function-declaration
to make sure that implicitly outlined features are all the time a tough error, as that is evidently very harmful: any name web site invoking an implicitly outlined perform, anticipating to obtain a pointer again, goes to have a really unhealthy time. I then cleaned up all of the implicitly outlined features within the kernel.
Additionally, it’s not nice that producing an invalid low pointer causes our paging constructions to get overwritten, as these constructions are important to sustaining a system with some semblance of coherence, and their corruption makes debugging far more complicated, difficult, and unpredictable. Userspace applications typically don’t map the underside area of the tackle house (or map it as inaccessible). This so-called ‘NULL
area’ could be very helpful for debugging: it signifies that if a bug ever leads you to accessing low reminiscence, you’ll get a fast crash as a substitute of overwriting necessary state. For this reason you get a ‘segmentation fault’ when dereferencing a NULL
pointer!
Since we’re working within the context of bodily reminiscence, we don’t have the luxurious of mapping frames nonetheless we like. If we need to exclude the bottom part of the tackle house, we now have to ‘waste’ that RAM: we will’t use it for anything. Perhaps it is a worthwhile tradeoff. I’m imagining marking the bottom few frames as a ‘scribble area’: all zeroes to start out, with some form of background job that watches for if any non-zero information pops up in these frames. If it does, we all know that some a part of the system has accessed an invalid pointer. For the reason that means of accessing an invalid pointer is just potential whereas the kernel’s low-memory id map is lively, we may free this scribble area as quickly as we disable the low-memory id map, and provides the reserved frames again to the bodily reminiscence allocator.
Certainly we’re completed..?
Bug #3: Overwriting paging constructions as a consequence of unhealthy pointer math in allocator
I feel I’m going to throw up.
We’ve obtained one other bug during which the paging constructions in low-memory are being overwritten whereas cloning a digital tackle house. Let’s check out our allocation sample once more…
pmm_alloc() = 0x0000000000000000
pmm_alloc() = 0x0000000000001000
pmm_alloc() = 0x0000000000002000
pmm_alloc() = 0x0000000000003000
pmm_alloc() = 0x0000000000004000
// A lot of output....
pmm_alloc() = 0x00000000ffffd000
pmm_alloc() = 0x00000000ffffe000
pmm_alloc() = 0x00000000fffff000
pmm_alloc() = 0x0000000000000000
pmm_alloc() = 0x0000000000001000
Oh, gnarly! At first look, our PMM is double-allocating frames; see how 0x0000000000001000
is allotted as soon as close to the start, then once more a lot afterward?
At nearer look, the PMM begins this double-allocating as quickly because the body tackle ticks over to 0x0000000100000000
(ie. above 4GB
). This triggers my spidey-senses that one thing, someplace, isn’t dealing with 64-bit addresses appropriately.
Certainly, I blithely translated some outdated code into Rust:
const FRAME_MASK: usize = 0xfffff000;
fn frame_ceil(mut addr: usize) -> usize {
if (addr & FRAME_MASK) == 0 {
// Already frame-aligned
addr
}
else {
// Drop the overflow into the body, then add a body
(addr & FRAME_MASK) + FRAME_SIZE;
}
}
I felt fairly foolish once I noticed this one. I’m really doing away with all of the bits above u32::max
, for the reason that masks I’m making use of is 0x00000000fffff000
. Due to this fact, any addresses above 4GB which can be enqueued get truncated to their low-memory cousins. When these truncated body addresses get allotted, I’m telling every allocator consumer “no actually, go forward and write to 0x8000
, it’s yours!”. This body is already being utilized by one thing else within the system, and chaos ensues.
The repair right here is basically easy! I moved away from the ’allow absolute bits’ strategy to a ‘disable relative bits’ strategy, which robotically masks to the right addresses whatever the goal’s phrase dimension.
rust_kernel_libs/pmm/src/lib.rs
fn page_ceil(mut addr: usize) -> usize {
((addr) + PAGE_SIZE - 1) & !(PAGE_SIZE - 1)
}
Bug #4: High portion of tackle house is lacking as a consequence of unhealthy math in web page mapper
This one was bizarre. Over the course of writing the contiguous chunk pool within the new Rust-based PMM, I launched a brand new piece of world state to handle the pool.
rust_kernel_libs/pmm/src/lib.rs
static mut CONTIGUOUS_CHUNK_POOL: ContiguousChunkPool = ContiguousChunkPool::new();
Since this information is asserted static
, the storage for this construction will likely be reserved when the kernel is loaded into reminiscence by axle’s UEFI bootloader.
Because it occurs, the construct system ended up inserting this construction on the very prime of the tackle house reserved for axle’s kernel.
However when this code runs, we get a web page fault when attempting to entry this construction.
syslog.log
Cpu[0],Pid[-1],Clk[0]: [-1] Web page fault at 0xffffffff82217138
Cpu[0],Pid[-1],Clk[0]: |--------------------------------------|
Cpu[0],Pid[-1],Clk[0]: | Web page Fault in [-1] |
Cpu[0],Pid[-1],Clk[0]: | Unmapped learn of 0xffffffff82217138 |
Cpu[0],Pid[-1],Clk[0]: | RIP = 0xffffffff800d48a6 |
Cpu[0],Pid[-1],Clk[0]: | RSP = 0x00000000bfeba520 |
Cpu[0],Pid[-1],Clk[0]: |--------------------------------------|
Huh, that’s odd. The bootloader units up reminiscence areas based mostly on no matter is described by the kernel binary:
uint64_t kernel_map_elf(const char*, pml4e_t*, axle_boot_info_t*)
bootloader/principal.c
// Iterate this system header desk, mapping binary segments into the digital tackle house as requested
for (uint64_t i = 0; i < elf->e_phnum; i++) {
Elf64_Phdr* phdr = (Elf64_Phdr*)(kernel_buf + elf->e_phoff + (i * elf->e_phentsize));
if (phdr->p_type == PT_LOAD) {
uint64_t bss_size = phdr->p_memsz - phdr->p_filesz;
efi_physical_address_t segment_phys_base = 0;
int segment_size = phdr->p_memsz;
int segment_size_page_padded = ROUND_TO_NEXT_PAGE(phdr->p_memsz);
int page_count = segment_size_page_padded / PAGE_SIZE;
// Allocate bodily reminiscence to retailer the binary information that we'll place within the digital tackle house
efi_status_t standing = BS->AllocatePages(AllocateAnyPages, EFI_PAL_CODE, page_count, &segment_phys_base);
// Copy the binary information, and 0 out any further bytes
memcpy((void*)segment_phys_base, kernel_buf + phdr->p_offset, phdr->p_filesz);
memset((void*)(segment_phys_base + phdr->p_filesz), 0, bss_size);
// Arrange the digital tackle house mapping
printf("Mapping [phys 0x%p - 0x%p] - [virt 0x%p - 0x%p]n", segment_phys_base, segment_phys_base + segment_size_page_padded - 1, phdr->p_vaddr, phdr->p_vaddr + segment_size_page_padded - 1);
map_region_4k_pages(vas_state, phdr->p_vaddr, segment_size_page_padded, segment_phys_base);
}
}
After we run this, the phase containing our PMM’s new information construction seems to be mapped appropriately.
syslog.log
Mapping [phys 0x00000000adcb3000 - 0x00000000addb1fff] - [virt 0xffffffff80000000 - 0xffffffff800fefff]
Mapping [phys 0x00000000abb97000 - 0x00000000adcaffff] - [virt 0xffffffff800ff000 - 0xffffffff82217fff]
And but, we get a web page fault upon entry! What will we see after we interrogate the precise state of the digital tackle house, disregarding out intention for what the tackle house ought to appear to be?
0x0000000000000000 - 0x0000001000000000 -rw
0xffff800000000000 - 0xffff801000000000 -rw
0xffffffff80000000 - 0xffffffff82200000 -rw
Yeah, that’s it! We’re intending for information to be mapped into the digital tackle house at 0xffffffff82217fff
, however, for some purpose, our digital tackle house stops at 0xffffffff82200000
. After we attempt to entry an information construction that occurs to be saved on this unhealthy area, we get a web page fault.
However why are we lacking the highest area of the tackle house? A probable suspect for investigation is the logic which units up a digital tackle house mapping based mostly on enter parameters, and certainly that is the place our issues lie.
The logic throughout the bootloader’s that maps areas of the digital tackle house appears roughly like this:
uint64_t map_region_4k_pages(pml4e_t*, uint64_t, uint64_t, uint64_t)
bootloader/paging.c
int page_directory_pointer_table_idx = VMA_PML4E_IDX(vmem_start);
pdpe_t* page_directory_pointer_table = get_or_create_page_directory_pointer_table(page_directory_pointer_table_idx);
uint64_t first_page_directory = VMA_PDPE_IDX(vmem_start);
uint64_t page_directories_needed = (remaining_size + (VMEM_IN_PDPE - 1)) / VMEM_IN_PDPE;
for (int page_directory_iter_idx = 0; page_directory_iter_idx < page_directories_needed; page_directory_iter_idx++) {
int page_directory_idx = first_page_directory + page_directory_iter_idx;
uint64_t first_page_table = VMA_PDE_IDX(current_page);
uint64_t page_tables_needed = (remaining_size + (VMEM_IN_PDE - 1)) / VMEM_IN_PDE;
for (int page_table_iter_idx = 0; page_table_iter_idx < page_tables_needed; page_table_iter_idx++) {
int page_table_idx = first_page_table + page_table_iter_idx;
uint64_t first_page = VMA_PTE_IDX(current_page);
uint64_t pages_needed = (remaining_size + (VMEM_IN_PTE - 1)) / VMEM_IN_PTE;
for (int page_iter_idx = 0; page_iter_idx < pages_needed; page_iter_idx++) {
int page_idx = page_iter_idx + first_page;
mark_page_present(
page_directory_pointer_table,
page_directory_idx,
page_table_idx,
page_idx
);
}
}
}
The logic above primarily has two jobs, that are being interleaved:
- Map the requested digital tackle house area
- Determine which paging constructions to make use of to map every body
The loops above work simply high quality when all of the paging tables we’re accessing alongside the way in which are empty. Nonetheless, if one or a number of of the tables we’re including information to is already partially full, the above loop gained’t populate every little thing we requested. It’ll calculate what number of tables it’d take to fulfill the request if all of these tables had been empty to start with. If a few of these tables had been storing some digital tackle mappings to start with, although, we gained’t map as a lot reminiscence as was requested, and can silently produce an surprising digital tackle house state!
The repair right here is conceptually simple: somewhat than attempting to compute precisely what number of constructions we’ll have to iterate upfront, let’s simply allocate pages one-by-one till we’ve glad the map request, and lookup the right paging constructions to map every web page on the fly:
uint64_t map_region_4k_pages(pml4e_t*, uint64_t, uint64_t, uint64_t)
bootloader/paging.c
uint64_t remaining_size = vmem_size;
uint64_t current_page = vmem_start;
whereas (true) {
// Discover the right paging construction indexes for the present digital tackle
//
// Get the web page listing pointer desk comparable to the present digital tackle
int page_directory_pointer_table_idx = VMA_PML4E_IDX(current_page);
pdpe_t* page_directory_pointer_table = get_or_create_page_directory_pointer_table(page_directory_pointer_table_idx);
// Get the web page listing comparable to the present digital tackle
int page_directory_idx = VMA_PDPE_IDX(current_page);
pde_t* page_directory = get_or_create_page_directory(page_directory_idx);
// Get the web page desk comparable to the present digital tackle
int page_table_idx = VMA_PDE_IDX(current_page);
pte_t* page_table = get_or_create_page_table(page_table_idx);
// Get the PTE comparable to the present digital tackle
uint64_t page_idx = VMA_PTE_IDX(current_page);
mark_page_present(
page_directory_pointer_table,
page_directory_idx,
page_table_idx,
page_idx
);
remaining_size -= PAGE_SIZE;
current_page += PAGE_SIZE;
if (remaining_size == 0x0) {
break;
}
}
Bug #5: Transitive heap allocations trigger crashes within the PMM
Over the course of debugging the Rust-based PMM, I naturally printed out some state at varied attention-grabbing places:
rust_kernel_libs/pmm/src/lib.rs
if !CONTIGUOUS_CHUNK_POOL.is_pool_configured() && region_size >= CONTIGUOUS_CHUNK_POOL_SIZE
{
println!("Discovered a free reminiscence area that may function the contiguous chunk pool {base} to {}", base + region_size);
CONTIGUOUS_CHUNK_POOL.set_pool_description(base, CONTIGUOUS_CHUNK_POOL_SIZE);
// Trim the contiguous chunk pool from the area and permit the remainder of the frames to
// be given to the general-purpose allocator
region_size -= CONTIGUOUS_CHUNK_POOL_SIZE;
}
Upon execution, although, I might get stunning failures about an exhausted digital tackle house.
syslog.log
Cpu[0],Pid[-1],Clk[0]: Discovered a free reminiscence area that can serve as the contiguous chunk pool 0000000001500000 to 00000000a5f22000
Cpu[0],Pid[-1],Clk[0]: liballoc: initialization of liballoc 1.1
Cpu[0],Pid[-1],Clk[0]: Increase kernel heap by 16 pages (64kb)
Cpu[0],Pid[-1],Clk[0]: [-1] Assertion failed: VAS will exceed max tracked ranges!
Cpu[0],Pid[-1],Clk[0]: kernel/kernel/vmm/vmm.c:296
Taking a peek into the digital tackle house mappings like we realized above, we will see telltale indicators that issues have positively gone awry.
syslog.log
0x0000000000000000 - 0x00000000bee00000 -rw
0x00000000bee00000 - 0x00000000bf000000 -r-
0x00000000bf000000 - 0x00000000bfa5b000 -rw
0x00000000bfa5b000 - 0x00000000bfa5c000 -r-
0x00000000bfa5c000 - 0x00000000bfa5e000 -rw
0x00000000bfa5e000 - 0x00000000bfa5f000 -r-
0x00000000bfa5f000 - 0x00000000bfa61000 -rw
0x00000000bfa61000 - 0x00000000bfa63000 -r-
0x00000000bfa63000 - 0x00000000bfa65000 -rw
0x00000000bfa65000 - 0x00000000bfa66000 -r-
0x00000000bfa66000 - 0x00000000bfa68000 -rw
0x00000000bfa68000 - 0x00000000bfac3000 -r-
0x00000000bfac3000 - 0x00000000bfadf000 -rw
0x00000000bfadf000 - 0x00000000bfae0000 -r-
0x00000000bfae0000 - 0x00000000bfae3000 -rw
0x00000000bfae3000 - 0x00000000bfae4000 -r-
0x00000000bfae4000 - 0x00000000bfae7000 -rw
0x00000000bfae7000 - 0x00000000bfae8000 -r-
0x00000000bfae8000 - 0x00000000bfaeb000 -rw
0x00000000bfaeb000 - 0x00000000bfaed000 -r-
0x00000000bfaed000 - 0x00000000bfc00000 -rw
0x00000000bfc00000 - 0x00000000bfe00000 -r-
0x00000000bfe00000 - 0x0000001000000000 -rw
Because it seems, the difficulty right here is that the println!()
macro transitively kicks off a heap allocation.
rust_kernel_libs/ffi_bindings/src/lib.rs
#[macro_export]
macro_rules! println {
() => ($crate::printf!("n"));
($($arg:tt)*) => ({
let s = alloc::fmt::format(core::format_args_nl!($($arg)*));
for x in s.cut up(' ') {
let log = ffi_bindings::cstr_core::CString::new(x).count on("printf format failed");
unsafe { ffi_bindings::printf(log.as_ptr() as *const u8); }
}
})
}
Performing a heap allocation throughout the bodily reminiscence supervisor is certainly not allowed, as this inverts the dependency tree. The bodily reminiscence supervisor sits on the backside of the stack, and the digital reminiscence supervisor sits above that. The kernel heap is only a consumer of the digital reminiscence supervisor, which requests and frees chunks of digital tackle house. In follow, attempting to carry out heap allocations from the bodily reminiscence allocator causes corrupted digital tackle house state, and, consequently, complicated crashes. The repair for this specific case is to invoke axle’s C-based printf
perform immediately through FFI, somewhat than going via the formatting equipment. In fact, this implies I have to do a bit of labor by hand, reminiscent of remembering to null-terminate Rust-originating strings that I cross.
rust_kernel_libs/pmm/src/lib.rs
if !CONTIGUOUS_CHUNK_POOL.is_pool_configured() && region_size >= CONTIGUOUS_CHUNK_POOL_SIZE
{
printf("Discovered a free reminiscence area that may function the contiguous chunk pool %p to %pn ".as_ptr() as *const u8, base, base + region_size);
CONTIGUOUS_CHUNK_POOL.set_pool_description(base, CONTIGUOUS_CHUNK_POOL_SIZE);
// Trim the contiguous chunk pool from the area and permit the remainder of the frames to
// be given to the general-purpose allocator
region_size -= CONTIGUOUS_CHUNK_POOL_SIZE;
}
I do want Rust supplied some strategy to specific and implement {that a} specific module isn’t allowed to allocate!