rePalm – Dmitry.GR
rePalm
Photo Album(continually up to date)
Desk of Contents
- A LARGE warning!
- PalmOS Architecture (and a bit of history)
- Towards the first unauthorized PalmOS port
- Towards the first pirate PalmOS device
- We need hardware, but developing on hardware is … hard
- Um, but now we need a kernel…
- So, uh, what about all that pesky ARM code?
- You do not mean…? (pt2)
- A JIT’s job is never over
- Is PACE fast enough?
- But, you promised hardware…
- Tales of more PalmOS reverse engineering
- Real hardware: reSpring
- So where does this leave us?
- Article update history
- Comments…
A LARGE warning!
It is a pre-release article a few pre-release undertaking. I’ll replace each so this isn’t a static doc. Protecting monitor of modifications is your job, for those who so select, all I can promise you is that I am going to maintain a changelist on the underside of the article
PalmOS Structure (and a little bit of historical past)
Historical past
PalmOS earlier than 5.4 stored all information in RAM in databases. They got here in two sorts: document databases (what you’d think about it to be) and useful resource databases (much like MacOS basic sources). Every database had a sort and a creator ID, every a 32-bit integer, typically with every 8-bit piece being an ascii char. Mostly any utility would create databases with their creator ID set to its. Sure sorts additionally had that means, like for instance appl was an appliction and panl was a choice panel.
PalmOS began out on Motorola 68k processors and ran on them from first growth all the best way to model 4.x. For model 5, Palm Inc selected to modify to ARM processors, as they allowed much more velocity (which is all the time a plus). However what to do about all of the software program? A number of PalmOS apps had been written for OS 4.x and compiled for m68k processor. Palm Inc launched PACE – Palm Utility Compatibility Extension. PACE intercepted the OsCall SysAppLaunch (and quite a few others) and emulated m68k processor, permitting all of the previous software program to run. When m68k apps referred to as an OsCall, PACE would translate the parameters and name the ARM Native OsCall. This meant that whereas the app’s logic was operating in emulation, all OsCalls had been native ARM and quick. Mix this with the truth that PalmOS 4.x gadgets normally ran at 33MHz, and PalmOS 5.x gadgets normally ran at a whole bunch, there was virtually no slowdown, most elderly apps compiled for PalmOS 4.x ran at a superbly good velocity. It was even ok for Palm Inc, since most built-in apps (like calendar and contacts had been nonetheless m68k apps, not ARM). There was additionally PalmOS 6.x (Cobalt) nevertheless it by no means actually noticed the sunshine of day and is past the scope of this doc.
Palm Inc by no means documented methods to write full Native ARM functions on PalmOS 5.x. It as potential, however not documented. The very best official option to get the complete velocity of the brand new ARM processors was to make use of the OsCall PceNativeCall to leap right into a small little bit of native ARM code that Palm Inc referred to as “ARMlet”s and later “PNOlet”s. Palm stated that solely the most well liked items of code needs to be handled this fashion, and it was fairly onerous to name OsCalls from these bits of native ARM code (you needed to name again into PACE, which might marshal the parameters for the native API, after which name it. The methods to name the actual Native OsCalls had been additionally not documented.
PalmOS 5.x stored a variety of the design of PalmOS 4.x, together with the shared heap, lack of protected reminiscence, and lack of correct documented multithreading. A brand new factor was that PalmOS 5.x supported loadable modules. In actual fact, each Native ARM utility or library in PalmOS 5.x is a module. Every module has a module ID, which is required to be system-unique and exist within the vary of 0..1023. That is in all probability why Palm Inc by no means documented methods to produce full Native functions – they might by no means enable greater than 1024 of them to exist.
PalmOS licensees (sony, handspring, and so on) acquired the sources to the OS and all of this information in fact. They had been in a position to customise the OS as wanted after which shipped it, however the structure was all the time principally the identical. This additionally aids us loads.
Modules? Libraries? DALs? Drivers?
The kernel of the OS, reminiscence administration, many of the drivers, and low degree CPU wrangling is completed by the DAL. DAL(Module ID 0) exports about 200 OsCalls, give or take primarily based on the PalmOS model. These are low degree issues like getting battery state, uncooked entry to display drawing primitives, module loading and unloading, reminiscence map administration, interrupt administration, and so on. Mainly these are features that no user-facing app would ever want to make use of. On high of the DAL lives Boot. Boot(Module ID 1) supplies a variety of the lower-level user-facing OsCalls. Carried out listed here are issues just like the DataManager, MemoryManager, AlarmManager, ExchangeManager, BitmapManager, and WindowManager. Be happy to consult with the PalmOS SDK for particulars on all of these. On high of Boot lives UI. UI(Module ID 2) supplies all the UI primites to the consumer. These are issues like controls (buttons, sliders, and so on), varieties, menus, tables, and so forth. These three modules collectively make up the core of PalmOS. You can, in reality, virtually boot a ROM containing simply these three recordsdata.
These first three modules are literally considerably particular, being the core of the OS. They’re all the time loaded, and their exported features are all the time accessible through a particular shortcut. For modules 0, 1, and a pair of, you possibly can name an exported operate quantity N by executing these two directions: LDR R12, [R9, #-4 * (module_ID + 1)]; LDR PC, [R12, #4 * func_no]. This shortcut exists for simple calls to OsCalls by native modules and solely works as a result of these modules are all the time loaded. This isn’t a common rule, and it will NOT work for another modules. You would possibly ask if one may also write to those tables of operate pointers to exchange them. Sure, sure you possibly can and this was usually carried out by what had been referred to as “hacks” and likewise is liberally utilized by the OS itself (however not through direct writes however through an OsCall: SysPatchEntry).
PalmOS lacks any reminiscence safety, any consumer code can entry {hardware}. PalmOS really makes use of this – issues like SD card drivers, and drivers for different peripherals are normally separate modules and never a part of the DAL. The Boot module will load all PalmOS useful resource databases of sure sorts at boot, permitting them to initialize. An incomplete record of those sorts is: libs(slot driver), libf(filesystem driver), vdrv(serial port driver), aext(system extension), aexo(OEM extension). This stuff being separate is definitely very handy, since that implies that they are often simply eliminated/changed. There are in fact nook circumstances, since PalmOS builders by no means anticipated this. For instance, if NO serial drivers are loaded, the OS will crash because it by no means anticipated this. Fortunately, that is additionally simple to work round.
Anytime a module is loaded, the entry level known as with a particular code, and the module is free to initialize, arrange {hardware}, and so on. When it’s unloaded, it will get one other code, and may deinitialize. There may be one other particular code modules can get and that’s from PACE. If you happen to bear in mind, I stated that PACE marshals parameters from m68k apps to OsCalls and again, however PACE can’t presumably learn about parameters {that a} random native library takes, so the marshalling there should be carried out by the library itself. This particular code is used to inform the library to: learn parameters from the m68k emulated stack, course of them, and put the consequence unto the emulated m68k registers (PACE exports features to truly handle the emulated state, so the libraries don’t must know of its insides).
In direction of the primary unauthorized PalmOS port
So what’s so onerous?
As I discussed, not one of the native API of PalmOS 5.x was ever documented. There was a small quantity of people that discovered some components of it, however no one actually acquired all of it, and even near it. To start out with, as a result of giant components are usually not helpful to an app developer, and thus attracted no curiosity. It is a downside, nonetheless, if one needs to make a brand new gadget. So I needed to really do a variety of reverse engineering for this undertaking – a variety of boring reverse engineering of very boring APIs that I nonetheless needed to implement. Oh, and I wanted a kernel, and precise {hardware} to run on.
ROM codecs are onerous
To start out with, I wrote a device to separate aside and put again collectively working PalmOS ROM photographs. The format is fairly convoluted, and adjusted between variations, however after a variety of work the “splitrom” device can now efficiently cut up a PalmOS ROM from pre-release pre-v.1.0 PalmOS gadgets all the best way to the PalmOS 6.0 cobalt ROMs. The “mkrom” device can now produce legitimate PalmOS 5.x photographs – I by no means bothered to truly make it produce different variations as I didn’t want it. At this level I took a detour from the undertaking to gather PalmOS ROMs. I now have one from virtually each gadget and prototype. I am going to share them with the world later. I examined this by pulling aside a T|T3 ROM, changing some recordsdata, placing it again collectively, and reflashing my T|T3. It booted! Cool!
So write a DAL and also you’re carried out!
I had no {hardware} to check on, no kernel to make use of, and much more “perhaps”s than I used to be prepared to reside with, so it was time for motion. The quickest manner I may consider to strive it was to make use of an actual ARM processor and an present kernel – linux. Since my desktop makes use of an x86 processor and never ARM, qemu was used. I wrote a fundamental rudimentary DAL that merely logged any operate referred to as after which crashed on objective. At boot, it did identical as PalmOS’s DAL does: load Boot and in a brand new thread name PalmOSMain OsCall. I then wrote a easy “runner” app that used mmap() to map an space of reminiscence at a selected location backed by “rom.bin” and one other by “ram.bin” and tried as well it. I acquired some logged messages and a crash, as anticipated. Cool! I suppose the idea may work. So, what’s the minimal variety of features my DAL must boot? Seems that the majority of them! Unhappy day…
Minimal DAL
It took months, however I acquired many of the DAL carried out, and it ran inside my “runner” inside qemu. It was a really scary setup. Because it was all a userspace app beneath Linux, I needed to name again out to the “runner” to request issues like thread creation, and so on. It was a multitude. Present rePalm code nonetheless helps this mode, however I don’t count on to make use of it a lot, for a wide range of causes. To start out with, Linux kernel lacks some API that PalmOS merely wants, for instance capability to disable and re-enable job switching. Yup… PalmOS generally asks for preemption to be disabled. Linux lacks that capability. PalmOS additionally wants capability to remotely pause and resume a thread, with out the thread’s consent. The pthreads library lacks such capability as effectively. I hacked collectively some hacks utilizing ptrace, nevertheless it was a multitude. Enjoyable story: since my machine is multi-core, and I by no means set any affinities, this was the primary time ever that PalmOS ran on a multi-core gadget. I didn’t notice it until a lot later, however that’s form of cool, no?
Drawing is difficult
There was one downside. For some cause, issues like drawing line, rectangles, circles, and bitmaps had been all a part of the DAL. Now, it isn’t onerous to attract a line, however issues like “draw a rounded rectangle with foreground coloration of X and a background coloration of Y, utilizing drawing mode ‘masks’ on this canvas” or “draw this compresed 16-bit full-color 144ppi picture on this 4-bits-per-pixel 108ppi canvas with dithering, respecting transparency colours, and utilizing ‘invert’ mode” and even “print string ‘Preferences’ with background coloration X, foreground Y, textual content coloration Z, dotted-underlined, utilizing this low-density font on this 1.5 density canvas” get convoluted rapidly. And sure, the DAL is predicted to deal with this all. Oh, and none of this was ever documented in fact! This was a nightmare. At first I handled all drawing features as NOPs and simply logged the drawn textual content to know the way far my boot has gotten. This allowed me to implement lots of the different OsCalls that DAL should present, however finally I needed to face having to attract. My first method was to only implement issues myself, primarily based on operate names and a few reverse engineering. This method failed rapidly – the matrix of prospects was just too giant. There are 8 drawing modes, 3 supported densities, 4 picture compression codecs, 5 supported coloration depths, and two font codecs. It was not potential to consider every part, particularly with no manner to make sure I had it proper. I’m not positive if a few of these modes ever acquired exercised by any software program in existence in any respect, nevertheless it didn’t matter – it needed to be pixel actual! What to do?
Theft is a type of flattery, proper?
I made a decision on a stopgap measure. I disassembled the Zire72 DAL. And I copied every of the required features, and all of the features they referred to as, and all the features these features referred to as, and so forth. I then cleaned up their direct references to Zire DAL‘s globals, and to one another, and I caught all of it into a large “drawing.S” file. It was over 30,000 traces lengthy, and I principally had no concept the way it labored. Or if it labored…
It did! Not instantly, in fact, nevertheless it did. Colours had been tousled, artifacts all over the place, however I noticed the touchscreen calibration display after boot! Success, sure? Nicely, not even remotely. To start out with, it seems that within the curiosity of optimization, PalmOS’s drawing code fortunately sticks its fingers into the show driver’s globals. My show “driver” at this level was simply an space of reminiscence backed by an SDL floor. It took a variety of work (throwaway work – the worst type) to determine what it was searching for and provides it to it. However after a couple of extra weeks, Zire72’s DAL‘s drawing code fortunately ran beneath rePalm and I used to be in a position to see issues drawn appropriately. After hooking up rudimentary pretend touchscreen help, I used to be even in a position to work together with the digital gadget and see the house display. Nice, however this was all a waste. I don’t personal that code and can’t ship it. I additionally can’t enhance it, increase it, repair it, and even declare to completely perceive it. This was not a path ahead.
Meticulously-performed imitation can be a type of flattery, no?
The time had come. I rewrote the drawing code. Operate by operate. Line by line. Meeting assertion by meeting assertion. I examined it after changing each operate as greatest as I may. Alongside the best way I gained the understanding of how PalmOS attracts, what shortcuts for what frequent circumstances there are, and so on. This effort took two months, after them, 30,000 traces of uncommented meeting become 8,000 traces of C. rePalm lastly was as soon as once more purely my very own code! Alongside the best way I optimized a couple of issues and added help for one-and-a-half density, one thing that the Zire72 DAL by no means supported. Of all of the components of this undertaking, this was the toughest to slog by way of, as a result of on the finish of each operate decoded, understood, and rewritten, there was no noticeable motion ahead – the objective was simply to not break something, and there have been all the time dozens of hundreds of traces of code to disasemble, perceive, and rewrite in C.
Digital SD card
For testing it might be handy to have the ability to load packages simpler into the gadget than baking them into the ROM. I wrote a customized slot driver that did nothing, however solely allowed you to make use of my customized filesystem. That filesystem used hypercalls to succeed in code within the “runner” to carry out filesystem ops on the host. Mainly this created a shared folder between my PC and rePalm. I used this to confirm that the majority software program and video games labored as anticipated
Which gadget ROM are you utilizing?
ANY! I examined pre-production Tungsten T picture, I examined LifeDrive picture, even Sony TH55 ROM boots! Sure, there have been customized per-device and per-OS-version tweaks, however I used to be in a position to get them to use mechanically at runtime. For instance, figuring out which OS model is operating is definitely carried out by inspecting the variety of exported entrypoints of Boot. And figuring out if the ROM is a Sony gadget is simple by searching for SonyDAL module. We then refuse to load it, and fake-export equal features ourselves. Why does the DAL must know the OS model? Some DAL entrypoints modified between PalmOS 5.0 and PalmOS 5.2, and PalmOS 5.4 or later count on a couple of further behaviours out of present funcs that we have to help.
So that you’re carried out, proper? It really works?
At this level, rePalm form of labored. It was a window on my desktop that ran REAL UNMODIFIED PalmOS with solely a single file within the ROM changed – the DAL. Time to name it carried out, and decide a brand new undertaking, proper? Nicely, not fairly. Like I stated, Linux was not a really perfect kernel for this, and making a slightly-more-open PalmOS simulator was not my objective. I wished to make a tool…
In direction of the primary pirate PalmOS gadget
A bit of bit about regular PalmOS 5.x gadgets, their CPUs, and the progress since…
With a view to perceive the difficulties I confronted, it’s needed to elucidate some extra about how PalmOS 5.x gadgets normally labored. PalmOS 5.x targetted ARMv4T or ARMv5 CPUs. They’d 4-32MB of flash or ROM to include the ROM, and 8-128MB or RAM for runtime allocations and information storage. PalmOS 5.4 added NVFS, which I shall for now fake doesn’t exist (as all of us wished we may when NVFS first got here out). ARMv4T and ARMv5 CPUs implement two separate instruction units: ARM and Thumb. ARM directions are every precisely 4 bytes, and are the unique instruction set for ARM CPUs. Thumb was added in v4T as a technique of enhancing code density. It’s a set of 2-byte lengthy directions that implement the most typical operations the code would possibly wish to do, and by being half the scale enhance code density. Clearly, you don’t get one thing for nothing. Within the CPUs again then, Thumb directions had one further pipeline stage, so this triggered them to be slower in code with a variety of jumps. Additionally, because the directions themselves had been less complicated, generally it took extra of them to do the identical factor. Thumb directions, most often, additionally solely have entry to half as many registers as ARM directions, additional resulting in barely much less optimum code. However, generally Thumb code was smaller, and velocity was not an element, so giant components of PalmOS had been compiled in Thumb mode. (Sony bucks this pattern, having splurged for bigger flash chips and compiling your entire OS in ARM mode). Some issues may additionally by no means be carried out in Thumb, for instance, 32×32->64 bit multiply, and a few had been very suboptimal to do in Thumb (like a variety of the drawing code with a variety of advanced bit shifts and addressing). These speed-critical items had been all the time compiled in ARM mode in PalmOS. Additionally all library entry factors had been all the time in ARM mode with no different choices, so even libraries completely compiled as Thumb, had small thunks from ARM to Thumb mode on every entrypoint.
How does one really change modes between ARM and Thumb in ARMv5? Sure, however not all, directions that change management move carry out the change. Since all ARM directions are 4-bytes lengthy and all the time aligned on a 4-byte boundary, any legitimate ARM instruction’s tackle has the low two bits cleared. Thumb directions are 2 bytes lengthy, and thus have the underside one bit cleared. 32-bit-long Thumb2 directions are additionally aligned on a 2-byte boundary. Which means for any instruction in any mode, the decrease little bit of its tackle is all the time clear. ARM used this reality for mode switching. The BX instruction would now take a look at the underside little bit of the register you are leaping to, and if it was 1, deal with the vacation spot as Thumb, else as ARM. Any instruction that hundreds PC with a phrase will do the identical: POP, LDM, LDR directions. Arithmetic carried out on PC in Thumb mode doesn’t change to ARM mode ever (low bit ignored) and arithmetic carried out on PC in ARM mode is undefined if the decrease 2 bits produced are nonzero (CAUTION: this is without doubt one of the issues that ARMv7 modified: this now has outlined behaviour). Additionally an additional instruction was added for simple calls between modes: BLX. There’s a type of it that takes a relative offset encoded within the instruction itself, which principally acts like a BL, but in addition switches modes to no matter NOT the present mode is. There may be additionally a register mode of it that mixes what a BX does with saving the return tackle. After all to make it possible for returns to Thumb mode work as anticipated, Thumb directions that save a return tackle, particularly BL and BLX set the decrease little bit of LR.
ARMv5 at this time limit is historical historical past. ARM structure is as much as v8.x by now, with 64-bit-wide-registers and a very completely different instruction set. ARMv7 remains to be usually seen round (v8 may also run in v7 mode) and is definitely an virtually good (however really not completely so) superset of ARMv5. So I may principally take a dev board for any ARMv7 chip, that are considerable and low cost, and use that as my base, proper? Technically sure, however I didn’t go this fashion. To start out with, few of those CPUs are documented effectively, so until you employ linux kernel, you will by no means get them up – writing your personal kernel and drivers for them shouldn’t be possible (I’m taking a look at you, allwinner). “However,” you would possibly object, “what about Raspberry Pi, is not its CPU totally documented?” I thought-about it, however discarded the thought – RasPi is extremely unstable, and I had no need to construct on such a shaky platform. Launch firefox in your RasPi, open dailymail or another advanced website, and go away, come again in 2 weeks, I assure you will be greeted by a hung display and a kernel panic on the serial console. If even Linux kernel builders can’t make this factor work stably, I had no need to strive. No thanks. So what then?
ARMv7M
The opposite choice was to make use of a microcontroller – they’re plentiful, documented, low cost, and accessible. ARM designs and sells a lot of small cores beneath the Cortex model. Cortex-M0/M0+/M1 are cores primarily based on the ARMv6M spec – principally they run the identical Thumb instruction set that ARMv5 CPUs did, with a couple of further directions to permit them to handle privileged state (MRS/MSR/CPS). Cortex-M23 is their successor, which provides a couple of further directions (DIV/CBZ/CBNZ/MOVW/MOVT/B.W) which makes it a bit much less of a ache within the ass, nevertheless it nonetheless could be very a lot a ache for advanced work. Cortex-M3/M4/M7 implement ARMv7M spec, which has a really expanded Thumb2 instruction set. It’s the identical instruction set that ARM launched into the ARM cores again within the day with ARMv6T2 structure CPUs. These directions are a mixture of 2 and 4-byte lengthy items and are literally fairly good for advanced code, supporting lengthy multiplies, advanced management move, and bitfield operations. They will additionally tackle all registers and never simply half of them just like the Thumb instruction set of yore. Cortex-M33 is the successor to those, including a couple of extra issues we don’t presently care about. Optionally, these cores may also embody an FPU for {hardware} floating level help. We additionally don’t care about that. There is just one downside: None of those CPUs help ARM instuctions. All of them solely run Thumb/Thumb2. This implies we are able to run most of PalmOS’s Boot and UI, however many different issues will fail. Not acceptable. Nicely, really, since each library must be entered in ARM mode, nothing will run…
My kingdom for an ARM!
It’s at this level that I made a decision to increase PalmOS’s module format to help direct entry into Thumb mode and transformed my DAL to this now format. I additionally taught my module loader to grasp when an library’s entry level factors to a easy ARM-to-Thumb thunk, and to resolve this instantly. This allowed an virtually full boot without having ARM. However this was not an answer. Massive components of the OS had been nonetheless in ARM mode (issues like MemMove, MemCmp, division routines), and if the objective was to run an unmodified OS and apps, enhancing every part all over the place was not an choice. Some issues we may simply patch through SysPatchEntry. This I did to the abovementioned MemMove and MemCmp for velocity, offering optimum Thumb2 implementations. Different issues I may do nothing about – issues like integer division (which ARMv5 has no instruction for) had been scattered in virtually each library, and couldn’t be patched away as they weren’t exported. We actually did want one thing that ran ARM directions.
However what if we strive?
What precisely will occur if we attempt to change an ARMv7M microcontroller into ARM mode? The guide fortunately could be very clear on that. It WILL change, clear the standing bit that indicated we’re in Thumb mode, after which when it tries to execute the subsequent instruction, it is going to take a UsageFault because it can’t execute on this mode. The Thumb BLX instruction of the shape that all the time switches modes is undefined in ARMv7M, and if executed, the CPU will take a UsageFault as effectively, indicating in invalid instruction. This all sounds grim, however that is really unbelievable information! We are able to catch a UsageFault… If you happen to see the place I’m going with this, and are appropriately horrified, thanks for paying consideration! We’ll come again to this story arc later, to present everybody an opportunity to catch up.
We’d like {hardware}, however growing on {hardware} is … onerous
CortexEmu to the rescue
I assumed I may make this all work on a Cortex-M class chip, however I didn’t wish to develop on one – too gradual and painful. I additionally didn’t discover any good emulators for Cortex-M class chips. At this level, I took a two-week-long break from this undertaking to jot down CortexEmu. It’s a totally useful Cortex-M0/M3/M23 emulator that faithfully emulates actual Cortex {hardware}. It has a GDB stub so I can connect GDB to it to debug the operating code, It has rudimentary {hardware} emulated to indicate a display, and help an RTC, a console, and a touchscreen. It helps privileged and unprivileged mode, and emulates the reminiscence safety unit (MPU) as effectively. CortexEmu stays one of the best ways to develop rePalm.
Waaaah! You promised actual {hardware}
Sure, sure, we’ll get to that, and much more later, however that’s nonetheless months later within the story, so be affected person!
Um, however now we want a kernel…
Want a kernel? Why not Linux?
PalmOS wants a kernel with a selected set of primitives. We already mentioned some (however undoubtedly not all) the reason why Linux is a horrible selection. Add to that the truth that Cortex-M3 appropriate linux is gradual AND big, it was merely not an choice. So, what’s?
I ended up writing my very own kernel. It’s easy, and works effectively. It should run on any Cortex-M class CPU, helps multithreading with priorities, exact timers, mutexes, semaphores, occasion teams, mailboxes, and all of the primitives PalmOS needs like capability to force-pause threads, and skill to disable job switching. It additionally takes benefit of the MPU so as to add some fundamental security like stack guards. Additionally, there may be nice (& quick) help for thread native storage, which turns out to be useful later. Why write my very own kernel, aren’t there sufficient on the market? Not one of the ones on the market actually had the primitives I wanted and bolting them on would take simply as lengthy.
So, uh, what about all that pesky ARM code?
The ARM code nonetheless was an issue
PalmOS nonetheless wouldn’t boot all the best way to UI due to the ARM code. However, for those who bear in mind, as few paragraphs in the past I identified that we are able to lure makes an attempt to get into ARM mode. I wrote a UsageFault handler that did that, after which…I emulated it
You don’t imply…?
Oh, however I do. I wrote an ARM emulator that will learn every instruction and execute it, till the code exited ARM mode, at which level I might exit the emulation and resume native execution. The precise particulars of how this works are attention-grabbing for the reason that emulator wants its personal stack and can’t run on the stack of the emulated code. There additionally must be a spot to stash the emulated registers since we can’t simply maintain them in the actual registers (not sufficient registers for each). Exiting emulation can be form of enjoyable since you want to load ALL register and standing register as effectively atomically. Not really trivial on Cortex-M. Nicely, in any case, “emu.c” and “emuC.c” have the code – go wild and discover.
However is not writing an emulator in C form of gradual?
You haven’t any concept! The emulator was gradual. I instrumented CortexEmu to rely cycles, and got here up with a mean of 170 cycles of host CPU to emulate a single ARM instruction. Not ok. Not even remotely. It’s well-known that emulators written in C are gradual. C compilers form of suck at optimizing emulator code. So what subsequent? Nicely, I went forward and rewrote the emulator core in meeting. Truly I did it twice. As soon as for ARMv7M (Cortex-M3 goal) and as soon as for ARMv6M (Cortex-M0 goal). The velocity improved loads. Now for the M3 core I used to be averaging 14 cycles per cycle, and for the M0 it was 19. A really respectable emulator efficiency if I do say so myself.
So, is it quick sufficient now?
As talked about earlier than, on unique PalmOS gadgets, ARM code was typically quicker than Thumb, so many of the hottest, tightest, quickest code was written in ARM. For us, ARM is 14x slower than Thumb. So the code that was meant to be quickest is gradual. However allow us to take a list of this code and see what it truly is. Division routines are a part of it. ARMv7M implements division in {hardware}, however ARMv5 didn’t (nor does ARMv6M). These routines are 100 cycles or so in ARM mode. MemMove, MemMSet and MemCmp We spoke about already, and we don’t care as a result of we changed them, however a lot of libraries had their very own inner copies we can’t substitute. My guess is that the compiler prefers to inline its personal “memset” and “memcpy” most often. That made up a big a part of the boot course of’s ARM code utilization. Fortunately, all of those features are the identical all over the place…
So, can we pattern-match a few of these within the emulator code and execute quicker native routines? I did this and boot course of did go quicker. The common per-instr overhead rose on account of matching, however boot time shrank. Cool. However what occurs after boot? After boot we meet the actual monster… PACE‘s m68k emulator is written in ARM. 60 kilobytes of what’s clearly hand-written meeting with a lot of intelligent methods. Intelligent methods suck once you’re caught emulating them… So which means each single m68k utility (which is most of them) is now operating beneath double emulation. Gross… Oh, additionally: gradual. One thing needed to be carried out. I thought-about rewriting PACE, however that may be a poor resolution – there are a variety of ARM libraries and I can’t rewrite all of them. Plus, in what manner can I declare to be operating an unmodified OS if I substitute each little bit of it?
There may be another option to make non-native code quick…
You don’t imply…? (pt2)
Simply in time: this
PACE comprises a variety of sizzling code that’s static. On actual gadgets it lives in ROM and doesn’t change. Most libraries are the identical. So, what can we do to make it run quicker? Translate it to what we are able to run natively, in fact. Most individuals wouldn’t tackle a job of writing a just-in-time translator alone. However that’s simply because they’re wimps 🙂 (Or perhaps they moderately assume that it’s a big time sink with extra nook circumstances than one may shake a stick at)
JITs: how can we begin?
Mainly the identical manner we did for the emulator. We create a per-thread translation cache (TC) which can maintain our translations. Why per thread? As a result of this avoids the issue of 1 thread flushing the cache whereas one other is operating in it with no sign of ending. The TC will include translation items (TU) every of which represents some translated code. Every TU comprises its unique “supply” ARM tackle, after which simply legitimate Thumb2 code. There may also be a hashtable which can map supply “ARM” addresses to a bucket the place the primary TU for that hash worth is saved. Every bucket is a linked record, and 4096 buckets are used. That is configurable. A quick & easy hash is used. Examined on a consultant pattern of addresses it gave good distribution. Now, every time we take a UsageFault that signifies an tried entry to ARM mode, we lookup the specified tackle within the hashtable. If we get successful, we merely substitute the PC within the exception body with the “code” pointer of the matching TU and return. The CPU proceeds to execute native code rapidly. Great! What if we don’t get successful? We then save the state and substitute the PC within the exception body with the tackle of the interpretation code (we don’t wish to translate in kernel mode).
Parlez-vous ARM?
The entrance finish of a JIT principally simply must ingest ARM directions and perceive them. We’ll lure on any we don’t perceive, and attempt to translate all people who we do. Right here we hit our first snag. Some video games use directions that aren’t legitimate. Bejeweled, I’m taking a look at you! The sport “Bejeweled” has some ARM code included in it and it likes to return by executing LDMDB R11, {R0-R12, SP, PC}^. Ignoring the truth that R0-R2 and R12 don’t should be saved and they’re being inefficient, that can be not a legitimate instruction to execute in consumer mode in any respect. That little caret on the finish means “additionally switch SPSR to CPSR“. That request is invalid in consumer mode and ARM structure reference guide could be very clear that executing this in consumer mode can have undefined results. This explains why Bejeweled didn’t run beneath rePalm beneath QEMU. QEMU appropriately refused to execute this madness. Nicely, I dragged out a Palm gadget out of a drawer and examined to see what really occurs for those who execute this. Seems that it’s simply ignored. Nicely, I suppose my JIT will try this too. My emulator cores had no bother with this instr since as this instr is undefined, treating it prefer it has no caret was protected, and thus they by no means even checked the bit that indicated it.
Fortunately for us, ARM solely has a couple of instruction codecs. Unluckily for us they’re all fairly advanced. Fortunately, decoding is simple. Nearly each ARM instruction is conditional and the highest 4 bits decide if it executes in any respect or doesn’t. Information Processing operations are all the time 3-operand. Vacation spot reg, Supply reg, and “Operand” which is ARM’s addressing mode 1. It may be a direct of sure varieties, a register, a register shifted by a direct, or a register shifted by a register. Say what?! Yup, you are able to do issues like ADD R0, R1, R2, ROR R3. Be scared. Be very scared! Setting flags is optionally available. Loading/storing bytes or phrases makes use of addressing mode 2, which permits a use of a register plus/minus a direct, or register plus/minus register, or register plus/minus register shifted by a direct. All of those modes might be index, postindex, or index-with-writeback, so scary issues like LDR R0, [R1], R2, LSL #12 might be concocted. Loading/storing halfwords or signed information makes use of addressing mode 3, which is rather like mode 2 besides no register shifts can be found. This mode can be used for LDRD and STRD directions that some ARMv5 cores implement (that is a part of the optionally available DSP extension). Addressing mode 4 is used for LDM and STM directions, that are terrifying of their complexity and variety of nook circumstances. They will load or retailer any subset of registers to a given base tackle with pre-or-post increment-or-decrement and optionally available writeback. They’re used for stack ops. And final, however not least, there are branches that are all encoded merely and decode simply. Phew…
2 Thumbs don’t make an ARM
Initially the thought was that the interpretation can’t be all that arduous? The directions look related, and it should not be all that unhealthy. Then actuality hit. Exhausting. Thumb2 has a variety of restrictions on operands, like for instance SP can’t in any respect be handled like a common register, and LR and PC can’t ever be loaded collectively. It additionally lacks something equalling addressing mode 1’s capability to shift a register by a register as a 3rd operand to an ALU operation. It lacks capability to shift a 3rd register by greater than 3, like mode 2 can in ARM. I’m not even going to speak about LDM and STM! Oh, after which there may be the difficulty of not letting the translated code know it’s being translated. Which means it should nonetheless assume it’s operating from unique place, and if it reads itself, see ARM directions. Which means we can’t ever leak PC’s actual worth into any executable state. The sensible upshot of that’s that we are able to by no means emit a BL instruction, and every time PC is learn, we should as a substitute produce a direct worth which is the same as what PC would have been, had the precise ARM code run from its precise place in reminiscence. Not enjoyable…
Thumb2’s LDM/STM really lack half the modes that ARM has (modes ID and DA) so we would must increase these directions to much more code. Oh, and Thumb has limits on writeback that don’t match ARM’s (extra strict) and likewise you possibly can by no means use SP within the register set, nor are you able to ever retailer PC this fashion in Thumb2. At this level it turns into abundantly clear that this is not going to be a straightforward instruction in -> instruction out job. We’ll want locations to retailer non permanent immediates, we’ll must rewrite a lot of directions, and we’ll must do all of it with out inflicting unintended effects. Oh, and it needs to be quick too!
A JIT’s job is rarely over
LDM and STM, could they burn in hell endlessly!
How LDM/STM work in ARM
ARM has two multiple-register ops: LDM and STM. Every has a couple of addressing modes. First is the order: up or down in addresses (that’s, does the bottom register tackle the place to retailer the lowest-numbered register or highest. Subsequent is whether or not the bottom register itself is for use, or ought to or not it’s incremented/decremented first. This offers us the 4 fundamental modes: IA(“increment after”), IB(“increment earlier than”), DA(“decrement after”), DB(“decrement earlier than”). Moreover that, it’s optionally available to writeback the up to date base tackle to the bottom register. There are in fact nook circumstances, like what worth gests saved if base register with writeback is saved, or what worth the bottom register can have if loaded, whereas writeback can be specified. ARM spec explicitly defines a few of these circumstances as having unpredictable penalties.
For stack, ARM makes use of a full-descending stack. That implies that at any level, the SP register factors to the final ALREADY USED stack place. So, to pop a price, you load it from [SP], after which increment SP by 4. This might be carried out utilizing an LDM instruction with an IA addressing mode. To push a price unto the stack, one ought to first decrement SP by 4, after which retailer the specified worth into [SP]. This corresponds to an STM instruction with an DB addressing mode. IB and DA modes are usually not used for stack in regular ARM code.
How LDM/STM work in Thumb2
So why did I let you know all this? Nicely, whereas designing the Thumb2 instruction set, ARM determined what to help and what to not. This principally meant that unusual issues didn’t get carried ahead. Yup…you see the place that is going. Thumb2 doesn’t help IB and DA modes. In any respect. Not cool. However there may be extra. Thumb2 forbids utilizing PC or SP registers within the record of registers to be saved for STM. Thumb2 additionally forbids ever loading SP utilizing LDM, additionally if an LDM hundreds PC, it might not additionally load LR, and if it hundreds LR, it might not additionally load PC. There may be extra but… PC shouldn’t be allowed as the bottom register, and the register record should be no less than two registers lengthy. It is a somewhat-complete record of what Thumb2 is lacking in comparison with ARM.
However wait, there may be extra. Even the instrutions that map properly from ARM to Thumb2 and adjust to all of the restrictions of Thubm2 are usually not that easy to translate. For instance, storing PC, is as all the time onerous – we want a spare register to retailer the anticipated PC worth so we are able to push it. However, registers are pushed so as, so relying on what register we decide as our non permanent reg, it could be out of different relative to others, we’d want to separate the shop into a couple of shops. However, there may be extra but. What if the shop was to SP or included SP? We modified SP by pushing our temp reg, so we have to alter for that. However what if this was a STMDB SP!(aka: PUSH). Then we can’t pre-push a temp register that simply…
However wait, there’s extra … ache
There may be one other complication. LDM/STM is predicted to behave as an atomic instruction to userspace. It’s both aborted or resumable at system degree. However in Thumb2 in Cortex-M chips, SP is particular for the reason that exception body will get saved there. Which means SP should all the time be legitimate, and any information saved BELOW SP shouldn’t be assured to ever persist (since an interrupt could occur anytime). Fortunately, on ARM it was additionally discouraged to retailer information beneath SP and this was hardly ever carried out. There may be one frequent piece of PalmOS code that does this: the code round SysLinkerStub that’s used to lazy-load libraries. For different causes rePalm changed this code in any case although. In all different circumstances the JIT will emit a warning if an try is made to load/retailer beneath SP.
As you see, that is very very very advanced. In actual fact, the whole code to translate LDM/STM ended up being simply over 4 thousand traces lengthy and the worst-case translation might be 60-ish bytes. Fortunately that is just for very bizarre directions the likes of which I’ve by no means seen in actual code. “So,” you would possibly ask, “how may this be examined if no code makes use of it?” I really used a modified model of my uARM emulator to emulate each orignal code and translated code to confirm that every vacation spot tackle is loaded/saved as soon as precisely and with correct vales solely, after which made a check program that will generate a variety of random legitimate LDM/STM directions. It was then left to run over a couple of weeks. All bugs had been exterminated with excessive prejudice, and I’m now happy that it really works. So right here is how the JIT handles it, generally (look in “emuJit.c” for particulars).
Translating LDM/STM
- Examine if the instruction triggers any undefined behaviour, or is in any other case not outlined to behave in a selected manner as per the ARM Structure Reference Handbook. In that case, log an error and bail out.
- Examine if it may be emitted as a Thumb2 LDM/STM, that’s: does it adjust to ALL the restrictions Thumb2 imposes, and in that case, and likewise if PC shouldn’t be being saved, emit a Thumb2 LDM/STM
- Examine if it may be emitted as a LDR/STR/LDRD/STRD whereas complying with Thumb2 limits on these. In that case, that’s emitted.
- Just a few particular quick circumstances to emit translations for frequent circumstances that aren’t coated by the above (for instance ADS appreciated to make use of STMIB for storing operate parameters to stack)
- For unsupported modes IB and DA, if no writeback is used, they are often rewritten when it comes to the supported modes.
- If instruction hundreds SP, it’s unimaginable to emit a legitimate translation on account of ohw ARMv7-M makes use of SP. For this one particular case, the JIT emits a particular undefined instruction and we lure it and emulate it. Fortunately no frequent code makes use of this ever!
- Lastly, the generic gradual path is taken:
- Generate an inventory of registers to be loaded/saved, and at what addresses.
- Calculate writeback if wanted.
- If wanted, allocate a short lived register or two (we want two if storing PC and SP) and spill their contents to stack
- For all registers left to be loaded/saved, see what number of we are able to load/retailer directly, and achieve this. This entails emitting a set of directions: LDR/STR/LDRD/STRD/LDM/STM till all is completed.
- If we had allotted non permanent registers, restore them
Barely much less hellish directions
Addressing mode 1 was onerous as effectively. Mainly because of these rotate-by-register modes, we want a short lived register to calculate that worth, so we are able to then use it. If the vacation spot register shouldn’t be used, we are able to use that as temp storage, since it’s about to be overwritten in any case by the consequence, until it is usually one of many different supply operands..or SP…or PC… oh god, that is changing into a multitude. Now what if PC can be an operand? We’d like a short lived register to load the “pretend” PC worth into earlier than we are able to function on it. However as soon as once more we’ve got no non permanent registers. This acquired messy in a short time. Be happy to look in “emuJit.c” for particulars. Lengthy story brief: we do our greatest to not spill issues to stack however generally we do must.
The identical applies to some advanced addressing modes. Thumb2 optimized its directions for frequent circumstances, which makes unusual circumstances very onerous to translate. Right here it’s even more durable to search out non permanent registers, as a result of if we push something, we’d must account for that if our base register is SP. As soon as once more: lengthy story, scary story, see “emuJit.c”. Mainly: frequent issues get translated effectively, unusual ones are usually not. Particular case is PC-based hundreds. These are used to load fixed information. Most often we inline the fixed information into the produced translations for velocity.
Conditional directions
Thumb2 does have methods to make conditional directions: the IT instruction that makes the subsequent 1-4 directions conditional. I selected to not use it on account of the truth that it additionally modifications how flags get set by 2-byte Thumb directions and I didn’t wish to particular case it. Additionally generally 4 directions are usually not sufficient for a translation. Eg: some STMDA directions increase to twenty-eight directions or so. I simply emit a department of reverse polarity (situation) over the interpretation. This works since these branches are additionally simply 2 bytes lengthy for all potential translation lengths.
Jumps & Calls
That is the place it will get attention-grabbing. Mainly there are two sort of jumps/calls. These whose locations are identified at translation time, and people whose are usually not. These whose addresses are identified at translation time are fairly easy to deal with. We glance up the vacation spot tackle in our TC. Whether it is discovered, we actually emit a direct leap to that TU. This makes sizzling loops quick – no exit from translated code is required. Oblique or computed jumps are usually not frequent, so one would assume that they don’t seem to be that necessary. That is flawed as a result of there may be one sort of such leap that occurs loads: operate return. We don’t, at translation time, know the place the return goes to go to. So how can we deal with it? Nicely, if the code instantly hundreds PC, every part will work as anticipated. Both it will likely be an ARM tackle and our UsageFault handler will do its factor or it will likely be a Thumb tackle and our CPU will leap to it instantly. An optimization exists in case an precise BX LR instruction is seen. We then emit a direct leap to a operate that appears up LR within the hash – this protects us the time wanted to take an exception and return from it (~60 cycles). Clearly extra optimizations are potential, and extra shall be added, however for now, that is how it’s. So what can we do for a leap whose vacation spot is thought and we have not but translated it? We depart ourselves a marker, particularly an instruction we all know is undefined, and we comply with that up with the goal tackle. This fashion if the leap is ever really taken (not all are), we’ll take the fault, translate, after which substitute that undefined instr and the phrase following it with an precise leap. Subsequent time that leap shall be quick, taking no faults.
Translating a TU
The method is simple: translate directions till we attain one which we determine is terminal. What’s terminal? An unconditional department is terminal. A name is just too (conditional or not). Why? As a result of somebody would possibly return from it, and we would fairly have the return code be in a brand new TU so we are able to then discover it when the return occurs. An unconditional write to PC of any type is terminal as effectively. There’s a little bit of cleverness additionally for jumps to close by locations. As we translate a TU, we maintain monitor of the previous couple of dozen directions we translated and the place their translations ended up. This fashion if we see a brief leap backwards, we are able to actually inline a leap to that translation proper in there, thus making a splendidly quick translation of this small loop. However what about brief jumps ahead? We bear in mind these as effectively, and if earlier than we attain our terminal instr we translate an tackle we remembered a previous leap to from this identical TU, we’ll return and substitute that leap with a brief one to right here.
And if the TC is full?
You would possibly discover that I stated we emit jumps between TUs. “Does not this imply,” you would possibly ask, “that you just can’t simply delete a single TU?” That is right. Seems that retaining monitor of which TUs are used loads and which aren’t is an excessive amount of work, and the advantages of inter-TU jumps are too large to disregard. So what can we do when the TC is full? We flush it – actually throw all of it away. This additionally helps make it possible for previous translations which might be not wanted finally do get tossed. Every thread’s TC grows as much as a most measurement. Some threads by no means run a variety of ARM and find yourself with small TCs. The TC of the primary UI thread will principally all the time develop to the utmost (presently 32KB).
Rising up
After the JIT labored, I rewrote it. The preliminary model was stuffed with magic values and holes (circumstances that would occur in respectable code however can be mistranslated). It additionally generally emitted invalid opcodes that Cortex-M4 would nonetheless execute (regardless of docs saying they weren’t allowed). The JIT was cut up into two items. The primary was the frontend that ingested ARM directions, maintained the TC, and stored monitor of assorted different state. The second was the backend. The backend had a operate for every potential ARMv5 addressing mode or instruction format, and given ANY legitimate ARMv5 instruction, it may produce a sequence of ARMv7M directions to carry out the identical job. For frequent circumstances the sequence was effectively optimized, for unusual ones, it was not. Nonetheless, the backend handles ANY potential legitimate ARMv5 request, even insane issues like, for instance, RSBS PC, SP, PC, ROR SP. No sane individual would ever produce this instruction, however the backend will correctly translate it. I wrote assessments and ran them mechanically to confirm that every one potential inputs are dealt with, and appropriately so. I additionally optimized the most well liked path in the entire system – the emulation of the BLX instruction in thumb. It’s now a whopping 50 cycles quicker, which noticeably impacted efficiency. As an additional small optimization, I observed that oftentimes Thumb code would use a BLX merely to leap to an OsCall (which on account of utilizing R12 and R9 can’t be written in Thumb mode). The brand new BLX handler detects this and skips emulation by calling the requisite OsCall instantly.
I then wrote a sub-backend for the EDSP extension (ARMv5E directions) since some Sony apps use them. The explanation for a separate sub-backend is that ARMv7E (Cortex-M4) has directions we are able to use to translate EDSP directions very effectively, whereas ARMv7 (Cortex-M3) doesn’t, and requires longer instruction sequences to do the identical work. rePalm helps each.
Later, I went again and, regardless of it being an enormous ache, labored out a manner to make use of the IT instruction on Cortex-M3+. This resulted in an enormous quantity of code refactoring – principally pushing “situation code” to each backend operate and anticipating it to conditionalize itself nonetheless it needs. This produced a change with an over-4000-line diff nevertheless it workes very effectively and resulted in a noticeable velocity icnrease!
The Cortex-M0 backend
Why that is insane
It was fairly an endeavor, however I wished to see if I may make a working Cortex-M0 backend for my JIT. Cortex-M0 executes the ARMv6-m instruction set. That is principally simply Thumb-1, with a couple of minor additions. Why is that this scary? In Thumb-1, most directions solely have entry to half the registers (r0..r7). Solely three directions have entry to excessive registers: CMP, MOV, and ADD. Nearly all Thumb-1 directions all the time set flags. There are additionally no long-multiply directions in Thumb-1. And, there isn’t a RRX rotation mode in any respect. The confluence of all these points makes trying a one-to-one instruction-to-instruction translation from ARM to Thumb-1 a non-starter.
To make all of it work, we’ll want some non permanent working area: a couple of registers. It’s all doable with three with a variety of work, and cozy with 4. So I made a decision to make use of 4 work registers. We’ll additionally want a register to level to our context (the place the place we’ll retailer further state). And, for velocity, we’ll need a reg to retailer the digital standing register. Why do we want a kind of? As a result of virtually all of our Thumb-1 directions clobber flags, whereas the ARM code we’re translating expects flags to stay round throughout lengthy instruction sequences. So our whole is: 6. We’d like 6 registers. They should be low registers since, as we had mentioned, excessive registers are principally ineffective in Thumb-1.
The fundamentals
Registers r0 by way of r3 are non permanent work registers for us. The r4 register is the place we maintain our digital standing register, and r5 factors to our context. We use r12 as one other non permanent. Sure it’s a high-reg however generally we actually simply must retailer one thing, so solely with the ability to MOV one thing out and in of it’s sufficient. So, what’s in a context? Nicely, then state of the digital r0 by way of r5 registers, in addition to the digital r12 and the digital lr register. There, clearly, must be a separate context for each thread, since they might every run completely different ARM code. We allocate one the primary time a thread runs ARM (it’s really a part of the JIT state, and we copy it if we reallocate the JIT state).
“However,” you would possibly say, “if PalmOS’s Thumb code expects register values in registers, and our translated ARM code retains a few of them in a bizarre context construction, how will they work collectively?” That is really advanced. Earlier than each translation unit, we emit a prologue. It should save the registers from our actual registers into the context. On the finish of each translation unit, we emit an epilogue that restores registers from the context into the actual registers. After we generate jumps between translation items, we leap previous these items of code, so so long as we’re operating within the translated code, we take no penalty for saving/restoring contexts. We solely must take that penalty when switching between translated code and actual Thumb code. Truly, it seems that the prologue and epilogue are giant sufficient that emitting then inside each TU is a large waste of area, so we simply make a copy of every inside a particular place within the context, and have every TU simply name them as wanted. A later velocity enchancment I added was to have a number of epilogues, primarily based on whether or not we all know that the code is leaping to ARM code, Thumb code, or “undecided which”. This enables us to avoid wasting a couple of cycles on exiting translated code. Each cycle counts!
Fault dispatching
There is only one extra downside: These BLX directions in Thumb mode. If you happen to bear in mind, I wrote about how they don’t exist in ARMv7-m. In addition they don’t exist in ARMv6-m. So we additionally must emulate them. However, in contrast to ARMv7-m, ARMv6-m has no actual fault dealing with capability. All faults are thought-about unrecoverable and trigger a HardFault to happen. Clearly one thing needed to be carried out to work round that. This really led to a fairly giant side-project, which I printed individually: m0FaultDispatch. In brief: I discovered a option to fully and appropriately decide the fault trigger on the Cortex-M0, and get better as wanted from many forms of faults, together with invalid reminiscence accesses, unaligned reminiscence accesses, and invalid directions. With this remaining puzzle piece discovered, the Cortex-M0 JIT was useful.
Is PACE quick sufficient?
These oblique jumps…
Sadly, emulation virtually all the time entails a variety of oblique jumps. Mainly that’s how one does instruction decoding. 68k being a CISC structure with variable-length directions implies that the decoding stage is advanced. PACE‘s emulator is clearly hand-written in meeting, with some methods. It’s all ARM. It’s actualy the identical instruction-for-instruction from PalmOS 5.0 to PalmOS 5.4. The encompassing code modified, however the emulator core didn’t. That is really excellent news – means it was good as is. My JIT correctly and appropriately handles translating PACE, as evidenced by the truth that rePalm works on ARMv7-M. The primary downside is that each instruction emulated requires no less than one oblique leap (for frequent directions), two for medium-comonness ones, and as much as three some some uncommon ones. As a consequence of how my JIT works, every oblique leap that isn’t a operate return requires an exception to be taken (14 cycles in, 12 out), some glue code (~30 cycles), and a hash lookup (~20 cycles). So even in case that the goal code has been translated, this provides 70-ish cycles to every oblique leap. This places a ceiling on the effectivity of the 68k emulator at 1/seventieth the velocity. Not nice. PACE normally is about 1/15 the velocity of the native code, so that’s fairly a slowdown. I thought-about writing higher translation only for PACE, however it’s fairly nontrivial to do quick. Merely put, there is not a easy quick option to translate one thing like LDR R0, [R11, R1, LSL #2]; ADD PC, R11, R0. There merely isn’t any option to know the place that leap will go, or that even R11 factors to a location that’s immutable. Sadly that’s what PACE‘s high degree dispatch appears to be like like.
A particular resolution for a particular downside
I had already fulfilled my objective of operating PalmOS unmodified – PACE does work with my JIT, and the OS is usable and never gradual, however I wished a greater resolution and determined that PACE is a unique-enough downside to warrant it. The code emulator in PACE has a single entry level, and solely calls out to different code in a ten clear circumstances: Line1010 (instruction beginning with 0xA), Line1111 (instruction beginning with 0xF), TRAP0, TRAP8, TRAPF (OsCall), Division By Zero, Unlawful instrction, Unimplemented instruction, Hint Bit being set, and hitting a PC worth of exactly 0xFFFFFFF0. So what to do? I wrote a device “patchpace” that may soak up a PACE.prc from any PalmOS gadget, analyze it to search out the place these handlers are within the binary, and discover the primary emulator core. It should then substitute the core (in place if there may be sufficient area, appended to the binary if not) with code you present. The handler addresses shall be inserted into your code at offsets the header supplies, and a leap to your code shall be positioned the place the previous emulator core was. The header could be very easy (see “patchpace.c”) and simply contains halfword offsets from the beginning of the binary to the entry, and to the place to insert jumps to every of the abovementioned handlers as BL or BLX directions). The one param to the emulator is the state. It’s structured thusly: first phrase is free for emulator to make use of because it pleases, then 8 D-regs, then the 8 A-regs, then PC, after which SR. No additional information is allowed (PACE makes use of information after right here). This identical state should be handed to all of the handlers. TRAPF handler additionally wants the subsequent phrase handed to it (OsCall quantity). Sure, you perceive this appropriately, this lets you deliver your personal 68k emulator to the celebration. Any 68k emulator will do, it doesn’t must know something about PalmOS in any respect. Fairly candy!
Any 68k emulator…
So the place can we get us a 68k emulator? Nicely, wherever? I wrote a easy one in C to check this concept, and it labored effectively, however actually for this form of factor you need meeting. I took PACE’s emulator as a mode information, and did a LOT of labor to supply a thumb2 68k emulator. It’s far more environment friendly than PACE ever was. That is included within the “mkrom” folder as “PACE.0003.patch”. As acknowledged earlier than, that is completely optionally available and never required. However it does enhance uncooked 68k velocity by about 8.4x within the typical case.
However, you promised {hardware}…
{Hardware} has bugs
I wanted a dev board to play with. The STM32F429 discovery board appeared like begin. It has 8MB of RAM which is sufficient, 2MB of flash which is sweet, a show with a touchscreen. Mainly it’s good on paper. Oh, if solely I knew how imperfect the truth is. Studying the STM32F429 reference guide it does sound like the proper chip for this undertaking. And ST doesn’t fairly exit of their option to let you know the place to search out the issues. The errata sheet is damning. Mainly for those who make the CPU run from exterior reminiscence, put the stack in exterior reminiscence, and SDRAM FIFO is on, exceptions will crash the chip (incorrect vector tackle learn). Okay, I can work round that – simply flip off the FIFO. Subsequent erratum: Identical story but when the FIFO is off, generally writes shall be ignored and never really write. Ouchy! Nice! I am going to transfer my stacks to inner RAM. It’s fairly a rearchitecturing, however OK, high-quality! Nonetheless crashes. No errata about that! What provides? I eliminated rePalm and created a 20-line repro state of affairs. This isn’t in ST’s errata sheet, however here’s what I discovered: if PC factors to exterior RAM, and WFI instruction is executed (to attend for interrupts in a low energy mode), after which an interrupt occurs after greater than 60ms, the CPU will take a random interrupt vector as a substitute of the right one after waking up! Simply think about how lengthy that took to determine! What number of sleepless nights ripping my hair out at random crashes in interrupt handlers that merely couldn’t presumably be executing at the moment! I labored round this by not utilizing WFI. Energy is clearly wasted this fashion, however that is okay for growth for now, till I design a board with a chip that really works!
Subsequent concern: RAM adddress. STM32F429 helps two banks of RAM 0 and 1. Financial institution 0 begins at 0xC0000000 and Financial institution 1 at 0xD0000000. It is a downside as a result of PalmOS wants each RAM and flash to be beneath 0x80000000. Nicely, we’re fortunate. RAM Financial institution 0 is remappable to 0x00000000. Candy…. Till you notice that whoever designed this board hated us! The board solely has one RAM chip linked, so logically it’s Financial institution 0. Proper? Nope! It’s Financial institution 1, and that one shouldn’t be remappable. Nicely, rattling! Now we’re caught and this board is unusable as well PalmOS. The 0x80000000 restrict is fairly set in stone.
So why the 0x80000000 restrict?
PalmOS has two forms of reminiscence chunks: movable and nonmovable. That is what an OS with out entry to an MMU does to keep away from an excessive amount of reminiscence fragmentation. Mainly when a movable chunk shouldn’t be locked, the OS can transfer it, and one references it utilizing a “deal with”. One can then lock it to get a pointer, use it, after which unlock when carried out. So what has this acquired to do with 0x80000000? PalmOS makes use of the highest little bit of a pointer to point if it’s a deal with or an precise pointer. The highest bit being set signifies a deal with, clear signifies a pointer. So now you see that we can’t actually reside with RAM and ROM above 0x80000000. However then once more, perhaps…
Two wrongs don’t make a proper, however do two nasty hacks?
On condition that I’ve already determined that this board was just for non permanent growth, why not go additional? Deal with-vs-pointer disambiguation is barely carried out in a couple of locations. Why not patch them to invert the situation? At the very least for now. No, not at runtime. I really disassembled and hand-patched 58 locations whole. Most had been in Boot, the place the MemoryManager lives, a couple of had been in UI for the reason that code for textual content fields likes to search out out of a pointer handed to it’s a pointer (noneditable) or a deal with (editable). There have been additionally a couple of in PACE since m68k had a SysTrap to detemine the form of pointer, which PACE carried out internally. Sure, this isn’t anymore “unmodified PalmOS” however that is solely non permanent, so I’m prepared to reside with it! However, you would possibly ask, did not you additionally say that ROM and RAM each should be beneath 0x80000000? If we invert the situation, we want them each above. However flash is at 0x08000000… Oops. Yup, we can’t use flash anymore. I modified the RAM format once more, carving out 2MB at 0xD0600000 to be the pretend “ROM” and I copy the flash to it at boot. It really works!
Tales of extra PalmOS reverse engineering
SD-card Help
Fortunately, I had written a slot driver for PalmOS earlier than, so writing an SD card driver was not onerous. In actual fact, I reused some PowerSDHC supply code! rePalm helps SD playing cards now on the STM32F469 dev board. On the STM32F429 board, they’re additionally supported, however for the reason that board lacks a slot, you want to wire them up your self (CLK -> C12, CMD -> D2, DAT_0 -> C8). As a consequence of how the board is already wired, solely one-bit-wide bus will work (DAT_1 and DAT_2 are used for different tthings and can’t be remapped to different pins), in order that limits the velocity. Additionally since your wires shall be lengthy and floppy, they most velocity can be restricted. Which means on the STM32F429 the velocity is about 4Mbit/sec. On the STM32F469 board the velocity is a way more respectable 37MBit/sec. Greater speeds could possibly be reached with DMA, however that is ok for now. Whereas writing the SD card help for the STM32F4 chips, I discovered a {hardware} bug, one which was very onerous to debug. The abstract is that this: SD bus permits the host to cease the clock anytime. So the controller has a operate to cease it anytime it isn’t sending instructions or sending/receiving information. Good thus far. However that information traces will also be used to sign that the cardboard is busy. Particularly, the DAT_0 line is used for that. The issue is that the majority playing cards use the clock line as a reference as to after they can change the state of the DAT traces. Which means for those who do one thing that the cardboard might be busy after, like a write, after which shut down the clock, the cardboard will maintain the DAT_0 line low endlessly, since it’s ready for the clock to tick to boost it. “So,” you’ll ask, “why not allow clock auto-stopping apart from this one command?” It doesn’t work since clock auto-stopping can’t be simply flipped on and off. By some means it confuses the module’s inner state machine whether it is flipped whereas the clock is operating. So, why cease the clock in any respect? Minor energy financial savings. Positively not sufficient to warrant this mess, so I simply disabled the auto-stopping operate. Per week to debug, and a one line repair! The slot driver might be seen within the “slot_driver_stm32” listing.
Serial Port Help
Palm Inc did doc methods to write a serial port driver for PalmOS 4. There have been two sorts: digital drivers and serial drivers. The previous was for ports that weren’t hardwired to the exterior world (just like the port linked to the bluetooth chip or the Infra-red port), and the second for ports that had been (just like the cradle serial port). PalmOS 5 merged the 2 sorts right into a unified “digital” sort. Sadly this was not documented. It borrowed from each port sorts in PalmOS 4. I needed to reverse engineer the OS for a very long time to determine it out. I produced a working concept of how this works on PalmOS 5, and you may see it in “vdrvV5.h” embody file. This data is sufficient to produce a working driver for a serial port, IrDA SIR port, and USB for HotSync functions.
Truly making the serial port work on the STM32F4 hardwre was a bit onerous. The {hardware} has solely a single one-byte buffer. Which means to not lose any obtained information at excessive information charges, one wants to make use of {hardware} move management or make the serial port interrupt the very best precedence and hope for the most effective. This was unacceptable for me. I made a decision to make use of DMA. This was a enjoyable likelihood to jot down my first PalmOS 5 library that can be utilized by different libraries. I wrote a DMA library for STM32F4-serias chips. The code is within the “dma_driver_stm32” listing. With this, one would assume that every one can be simple. No. DMA must know what number of bytes you count on to obtain. In case of generic UART information obtain, we have no idea this. So how can we resolve this? With cleverness. DMA can interrupt us when half of a switch is completed, and once more when it’s all carried out. DMA might be round (restart from starting when carried out). This will get us virtually so far as we have to go. Mainly so long as information retains arriving, we’ll maintain getting one in every of these interrupts, after which the opposite so as. In our interrupt handler, we simply must see how far into the buffer we’re, and report the bytes since final time we checked as new information. So long as our buffer is large enough that it doesn’t overflow within the time it takes us to deal with these interrupts we’re all set, proper? Not fairly. What if we get only one byte? That is lower than half a switch so we’ll by no means get an interrupt in any respect, and thus won’t ever report this to the purchasers. That is unacceptable. How? STM32F4 UART has “IDLE detect” mode. This can interrupt us if after a byte has been RXed, 4 bit occasions have expired with no additional character beginning. That is principally simply what we want. If we wire this interrupt to our earlier dealing with code for the round buffer, we’ll all the time be capable to obtain information as quick because it comes, irrespective of the sizes. Cool! The Serial driver I produced does this, and might be seen within the “uart_driver_stm32” listing. I used to be in a position to efficiently Hotsync over it! IrDA is supported too. It really works effectively. See the picture album for a video demo!
Sure, you possibly can strive it!
If you wish to strive, on the STM32F429 discovery board, the “RX” unpopulated 0.1 inch gap is the STM32’s transmit (sure i do know, bizarre label for a transmit pin). B7 is STM32’s obtain pin. If you happen to join a USB-to-serial adapter there, you possibly can hotsync over serial. If you happen to as a substitute join an IrDA SIR transciever there, you will get working IR. I used MiniSIR2 transciever from Novalog, Inc. It’s the identical one as most Palm gadgets use.
Vibrate & LED help
Including vibration and LED help was by no means documented, since these are {hardware} options that distributors deal with. Fortunately, I had reverse engineered this a very long time in the past, once I was adding vibration support to T|X. Seems that I virtually acquired all of it proper again then. A bit extra reverse engineering yielded a whole results of the right API. LED follows the identical API as vibrator: one “GetAttributes” operate and one “SetAttributes” operate. The settable issues are the sample, velocity, delay in betweern repetitions, and variety of repetitions. The OS makes use of them as wanted and mechanically provides “Vibrate” and “LED” settings to “Sounds and Alerts” preferences panel if it notices the {hardware} is supported. And rePalm now helps each! The code is in “halVibAndLed.c”, be happy to peruse it at your leisure.
Networking help (WIP)
False begins
I actually wished so as to add help for networking to rePalm. There have been a couple of methods I may consider to do this, such that every one present apps would work. One may merely substitute Internet.lib with one with the same interface however managed by me. I may then wire it as much as any interface I wished to, and all can be magical. It is a poor method. To start out with, whereas giant components of Internet.lib are documented, there are lots of components that aren’t. Having to determine them out can be onerous, and proving correctness and staying bug-compatible much more so. Then there may be the difficulty with eager to run an unmodified PalmOS. Changing random libraries diminishes the power to say that. No, this method wouldn’t work. The following chance was to make a pretend serial interface, and inform PalmOS to attach through it, through SLIP or PPP to a pretend distant machine. The opposite finish of this serial port may go to a thread that talks to our precise community interface. This may be made to work. There can be overhead of encoding and decoding PPP/SLIP frames, and the UI can be complicated and all flawed. Additionally, I might want to search out methods to make the config UI. That is additionally fairly a multitude. However no less than this mess is achievable. However perhaps there’s a higher method?
The scary manner ahead
Conceptually, there’s a higher method. PalmOS’s Internet.lib helps pluggable community interfaces (I name it a NetIF driver). You’ll be able to see a couple of on all PalmOS gadgets: PPP, SLIP, Loopback. Some others even have one for WiFi or Mobile. So all I’ve to do is produce a NetIF driver. Sounds easy sufficient, no? Simply as you’d count on, the reply is a robust, resounding, and unequivocal “no!” Writing NetIF drivers was by no means documented. And a community interface is loads more durable than a serial port driver (which was the earlier plug-in driver interface of PalmOS that I had reverse engineered). Reverse engineering this is able to be onerous.
Those that research historical past…
I began with some PalmOS 4.x gadgets and checked out SLIP/PPP/Loopback NetIF drivers. Why? Like I had talked about earlier, in 68k, the compiler tends to depart operate names round within the binary until turned off. It is a big assist in reverse engineering. Now, don’t let this idiot you, operate names alone are usually not that a lot assist. You continue to must guess construction codecs, parameters, and so on. Thus even if Internet.lib and NetIF driver interface each modified between PalmOS 4.x and PalmOS 5.x, determining how NetIF drivers labored in PalmOS 4.x would nonetheless present some foundational data. It took a couple of weeks till I assumed I had that data. Then I requested myself: “Was there a PalmOS 4.x gadget with WiFi?” Hm… There was. Alphasmart Dana Wi-fi had WiFi. Now that I assumed I had a grip on the fundamentals of how these NetIF drivers labored, it was time to have a look at a extra advanced one since PPP, SLIP, and Loopback are all quite simple. Sadly, Alphasmart’s builders knew methods to flip off the insertion of operate names into the binary. Their WiFi driver was nonetheless useful, nevertheless it took weeks of massaging to make sense of it. It’s roughly at this level that I spotted that Internet.lib had many variations and I had to have a look at others. I ended up disassembling every model of Internet.lib that existed to see the evolution of the NetIF driver interface and Internet.lib itself. Thus I checked out Palm V’s model, Palm Vx’s, Palm m505’s, and Dana’s. Essentially the most attention-grabbing modifications had been with v9, the place help for ARP & DHCP was merged into Internet.lib, whereas beforehand every NetIF driver that wanted these, embedded their very own logic for them.
On to OS 5’s Internet.lib
This was all good and nice, however I used to be probably not on this to grasp how NetIF drivers labored in PalmOS 4.x. Time had come to maneuver on to reverse-engineering how PalmOS 5.x did it. I grabbed a duplicate of Internet.lib from the T|T3, and began tracing out its features, matching them as much as their PalmOS 4.x equivalents. It took a couple of extra weeks, however I roughly understood how PalmOS 5.x Internet.lib labored.
I discovered a bug!
Alongside the best way I discovered an precise bug: a use-after-free in arp_close()
NETLIB_T3:0001F580 CMP R4, #0 ; Linked record is empty?
NETLIB_T3:0001F584 BEQ loc_1F5A4 ; in that case, lust skip this whole factor
NETLIB_T3:0001F588 B loc_1F590 ; else go free it one-by-one
NETLIB_T3:0001F58C
NETLIB_T3:0001F58C loc_1F58C:
NETLIB_T3:0001F58C BEQ loc_1F598 ; this instr right here is innocent, however is unnecessary! We solely get right here on “NE” situation
NETLIB_T3:0001F590
NETLIB_T3:0001F590 loc_1F590:
NETLIB_T3:0001F590 MOV R0, R4 ; free the node
NETLIB_T3:0001F594 BL MemChunkFree ; after this, reminiscence pointed to by R4 is invalid (freed)
NETLIB_T3:0001F598
NETLIB_T3:0001F598 loc_1F598:
NETLIB_T3:0001F598 LDR R4, [R4] ; load “->subsequent” from now-invalid reminiscence…
NETLIB_T3:0001F59C CMP R4, #0 ; see whether it is NULL
NETLIB_T3:0001F5A0 BNE loc_1F58C ; and if not, loop to free that node too
NETLIB_T3:0001F5A4 loc_1F5A4:
Nicely, that was simple…
Then I began disassembling PalmOS 5.x SLIP/PPP/Loopback NetIF drivers to see how that they had modified from PalmOS 4.x. I assumed that no one actually modified their logic, so any modifications I see could possibly be hints on modified within the Internet.lib and NetIF construction between PalmOS 4.x and PalmOS 5.x. It turned out that not that a lot had modified. Constructions acquired realigned, a couple of attribute values acquired modified, however in any other case it was fairly shut. It’s at this level that I congratulated myself, and determined to begin writing my very own NetIF driver to check my understanding.
NOT!
The self-congratulating didn’t final lengthy. It turned out that in my notes I marked a couple of issues I had thought inconsequential as “to do: look into this later”. Nicely, it seems that they weren’t inconsequential. For instance: the callback from DHCP to the NetIF driver to inform it of DHCP standing was NOT purely informative as I had thought, and in reality a considerable amount of logic has to exist inside it. That logic, in flip, touches the insides of the DhcpState construction, half of which I had not totally understood since I assumed it was opaque to the NetIF driver. Rattling, effectively, again to IDA and extra reverse engineering. In some unspecified time in the future in time right here, to grasp what numerous callbacks between Internet.lib and the NetIF driver did, I spotted that I would like to grasp DHCP and ARP loads higher than I did. After sinking some hours into studying the DHCP and ARP RFCs, I dove again into the disassembled code. All of it form of made sense. I am going to summarize the remainder of the story: it took one other three weeks to doc each construction and performance that ARP and DHCP code makes use of.
Extra reverse engineering
There was only one other thing left. Because the NetIF driver comes up, it’s anticipated to indicate UI and name again into Internet.lib at numerous occasions. Completely different NetIF drivers I disassembled did this in very other ways, so I used to be not clear as to what was the right manner to do that. At this level I went to my archive of all of the PalmOS ROMs, and wrote a device to search out all of the recordsdata with the kind neti(NetIF drivers have this sort), skip all which might be PPP, SLIP, or Loopback, and replica the remaining to a folder, after deduplicating them. I then disassembled all of them, producing diagrams and notes about how every introduced itself up and down, the place UI was proven or hidden, and when every step was taken. Whereas doing this, I noticed some (however not a lot) logging in a few of these drivers, so I used to be in a position to rename my very own names for numerous values and structs to extra correct ones that writers of these NetIF drivers had been type sufficient to leak of their log statements. I ended up disassembling: Sony’s “CFEtherDriver” from the UX50, Hagiwara’s WiFi memorystick driver “HNTMSW_neti”, Janam’s “WLAN NetIF” from the XP30, Sony’s “CFEtherDriver” from the TH55, PalmOne’s “PxaWiFi” from Tungsten C, PalmOne’s “WiFiLib” from the TX, and PalmOne’s “WiFiLib” from their WiFi SD card. Phew, that was loads! Lengthy story brief: the reverse engineered NetIF interface is documented in “netIfaceV5.h” and it’s sufficient that I believe a working NetIF driver might be written utilizing it.
“You assume?” you would possibly ask, “have you ever not examined it?”. Nope, I’m nonetheless writing my NetIF driver so keep tuned…
1.5 density help
Density fundamentals
PalmOS since model 4.2 has help for a number of display densities. That’s to say that one may have a tool with a display of the identical measurement, however extra pixels in it and nonetheless see issues rendered on the identical measurement, simply with extra element. Sony did have high-res screens earlier than Palm, and HandEra did earlier than each of them, however Palm’s resolution was the primary OS-scale one, so that’s the one which PalmOS 5 used. The thought is easy. Every Bitmap/Window/Font/and so on has a coordinate system related to it, and all operations use that to determine methods to scale issues. 160×160 screens had been termed 72ppi (no relation to precise factors or inches), and the brand new 320×320 ones had been 144ppi (double density). This made life simple – when the right density picture/font/and so on was lacking, one may pixel-double the low-res one. The reverse labored to. Pen coordinates additionally needed to be adjusted in fact since now the developer may request to work in a selected coordinate system, and the entire system API then needed to.
How was this carried out? Just a few coordinate techniques are all the time in play: native (what the show is), customary (UI format makes use of this), and lively (what the consumer set utilizing WinSetCordinateSystem). So given three techniques, there are at any time limit 6 scaling components to transform from any to another. PalmOS 5.0 used only one. This was messy and we’ll not discuss this additional. Lets simply say this resolution didn’t stick. PalmOS 5.2 and later use 4 scaling components, representing bidirectional transforms between lively and native, and native and customary. Why not the third pair? It’s used uncommonly sufficient that doing two transformations is OK. Since floating-point math is gradual on ARMv5, fastened level numbers are used. Right here there’s a distinction between PalmOS 5.2 and PalmOS 5.4. The previous makes use of 16-bit fastened level numbers in 10.6 format, the latter makes use of 32-bit numbers in 16.16 format. I am going to allow you to learn up about fixed-point numbers by yourself time, however the crux of the matter is that the variety of fraction bits limits the precision of the quantity itself and the maths you are able to do with it. Now, for exact powers of two, one doesn’t want that many bits, so whereas there have been solely 72ppi an 144ppi screens, 10.6 was ok, with scale components all the time being 0x20 (x0.5), 0x40 (x1.0), and 0x80 (x2.0) . PalmOS 5.4 added help for one-and-a-half density as a result of overabundance of low cost 320×240 shows on the time. This new decision was specified as 108ppi, or exactly 1.5 occasions the usual decision. Technically every part in PalmOS 5.2 will work as is, and for those who give PalmOS 5.2 such a display, it is going to roughly form of work. To the best you possibly can see what that appears like. Sure, not fairly. However it doesn’t crash, and issues form of work as you’d count on. So why does it appear like crap? Nicely, that scaling factor. Let’s examine what scale components we’d want now. To begin with, PalmOS is not going to ever scale between 108 and 144ppi for bitmaps or fonts, so these scale components are usually not needed (rePalm will in a single particular case: to attract 144ppi bitmaps on 108ppi display, when no 72ppi or 108ppi bitmap is offered). So the one new scale components launched are between customary and 1.5 densities. From customary to 108ppi the dimensions issue is 1.5, which is representable as 0x60 in 10.6 fastened level format. To this point so good, that’s actual and math will work completely each time. However from 108ppi to 72ppi the dimensions issue is 2/3, which is NOT representable precisely in binary (irrespective of what number of bits of precision you’ve got). The easy rule with fixed-point math is that when your numbers are usually not representable precisely, your rounding errors will accumulate to a couple of as soon as the values you use on are better than one over your LSB. So for 10.6, the LSB is 1/64, so as soon as we begin working with numbers over 64, rounding can have errors of over one. It is a downside, since PalmOS routinely works with numbers over 64 when doing UI. Hell, the display’s standard-density width is 160. Oops… These collected rounding errors are what you see in that screenshot. Off by one right here, off by one there, they add as much as that mess. 108ppi density grew to become formally supported in PalmOS 5.4. So what did they do to make it work? Change to 16.16 format. The LSB there may be 1/65536, so math on numbers as much as 65536 will spherical appropriately. That is ok since all of PalmOS UI makes use of 16-bit numbers for coordinates.
How does all of it disintegrate?
So why am I telling you all this? Nicely, PalmOS 5.4 has a couple of different issues in it that make it undesirable for rePalm (rePalm can run PalmOS 5.4, however I’m not desirous about supporting it) on account of NVFS, which is necessary in 5.4. I wished PalmOS 5.2 to work, however I additionally wished 1.5 density help, since 320×240 screens nonetheless are fairly low cost, and in reality my STM32F427 dev board sports activities one. We can’t simply take Boot.prc from PalmOS 5.4 and transfer it, since that additionally brings NVFS. So what to do? I made a decision to take a list of each a part of the OS that makes use of these scaling values. They’re hidden contained in the “Window” construction, so principally this was inside Boot. However there are different methods to fuck up. For instance in a couple of locations in UI, sequences like this may be seen: BmpGetDensity( WinGetBitmap( WinGetDisplayWindow())). That is clearly a recipe for bother as a result of code that was by no means written to see something apart from a 72 or a 144 as a reply is about to see a 108. However, a few of that’s innocent, if math shouldn’t be being carried out with it. It may possibly fairly dangerous, nonetheless, whether it is utilized in math. I disassembled the Boot from a PalmOS 5.4 gadget (Treo 680) and one from a PalmOS 5.2 gadget (Tungsten T3). For every place I discovered within the T3 ROM that seemed bizarre, I checked what the PalmOS 5.4 Boot did. That supplied many of the locations of fear. I then searched the PalmOS 5.4 ROM for any references to 0x6C as that’s 108 in hex, and a most unlikely fixed to happen in code naturally for another cause (fortunately). I additionally checked out each single division to see if coordinate scaling was concerned. This produced a whole record of all of the locations within the ROM that wanted assist. There have been over 150…
How can we repair it?
Patching this many locations is doable, however what if tomorrow I determine to make use of the Boot from one other gadget? No, this was not resolution. I opted as a substitute to jot down an OEM extension (a module that the OS will load at boot it doesn’t matter what) and repair this. However how? If the ROM is learn solely, and we don’t have an MMU to map a web page over the areas we wish to repair, methods to repair them? Nicely, each such place is logically in a operate. And each operate is usually referred to as. It might be referred to as by a timer, a notification, be a thread, or be part of what the consumer does. Fortunately PalmOS solely count on UI work kind the UI thread, so ALL alf them had been solely referred to as from use-facing features. Sadly some had been buried fairly deep. I acquired began writing substitute features, basing them on what the Boot from PalmOS 5.4 did. For many features I wrote full patches (that’s my patch completely replaces the unique operate within the dispatch desk, by no means calling again to the unique). I wrote 73 of these: FntBaseLine, FntCharHeight, FntLineHeight, FntAverageCharWidth, FntDescenderHeight, FntCharWidth, FntWCharWidth, FntCharsWidth, FntWidthToOffset, FntCharsInWidth, FntLineWidth, FntWordWrap, FrmSetTitle, FrmCopyTitle, CtlEraseControl, CtlSetValue, CtlSetGraphics, CtlSetSliderValues, CtlHandleEvent, WinDrawRectangleFrame, WinEraseRectangleFrame, WinInvertRectangleFrame, WinPaintRectangleFrame, WinPaintRoundedRectangleFrame, WinDrawGrayRectangleFrame, WinDrawWindowFrame, WinDrawChar, WinPaintChar, WinDrawChars, WinEraseChars, WinPaintChars, WinInvertChars, WinDrawInvertedChars, WinDrawGrayLine, WinEraseLine, WinDrawLine, WinPaintLine, WinInvertLine, WinFillLine, WinPaintLines, WinGetPixel, WinGetPixelRGB, WinPaintRectangle, WinDrawRectangle, WinEraseRectangle, WinInvertRectangle, WinFillRectangle, WinPaintPixels, WinDisplayToWindowPt, WinWindowToDisplayPt, WinScaleCoord, WinUnscaleCoord, WinScalePoint, WinUnscalePoint, WinScaleRectangle,
WinUnscaleRectangle, WinGetWindowFrameRect, WinGetDrawWindowBounds, WinGetBounds, WinSetBounds, WinGetDisplayExtent, WinGetWindowExtent, WinGetClip, WinSetClip, WinClipRectangle, WinDrawBitmap, WinPaintBitmap, WinCopyRectangle, WinPaintTiledBitmap, WinCreateOffscreenWindow, WinSaveBits, WinRestoreBits,
WinInitializeWindow. Just a few issues had been a bit too messy to exchange completely. An instance of that was PrvDrawControl a operate that makes up the center of CtlDrawControl, however can be utilized in a variety of locations like occasion dealing with for controls. What to do? Nicely, I can substitute all callers of it: FrmHandleEvent and CtlDrawControl, however that doesn’t assist since PrvDrawControl itself has points and is HUGE and complicated. After tracing it very rigorously, I spotted that it solely actually cares about density in a single particular case, when drawing a body of sort 0x4004, through which case it as a substitute units the coordinate system to native, and attracts a body manually, after which resets the coordinate system. So, what I did is ready a particular international earlier than calling it if the body sort requested is that particular one, and the body drawing operate, the one I had already rewritten (WinDrawRectangleFrame) then sees that flag and as a substitute does this particular one factor. The identical needed to be carried out for erasing body sort 0x4004, and the identical technique was employed. The outcomes? It labored!
There was another advanced case left – drawing a window title. It was buried deep inside FrmDrawForm since a title is technically a sort of a body object. To intercept this with out rewriting your entire operate, earlier than it runs, I transformed a title object to a particular king of an inventory object, and saved the unique object in my globals. Why an inventory? FrmDrawForm will name LstDrawList on an inventory object, and won’t peek inside. I then intercept LstDrawList, examine for our magic pointer, in that case, draw the title, else let the unique LstDrawList operate run. On the best way out of FrmDrawForm, that is all undone. For kind title setting features, I simply changed them asince they redraw the title manually, and I already has written a title drawing operate. There was one small factor left: the little (i) icon on varieties which have assist related to them. It seemed unhealthy when tapped. My title drawing operate drew it completely, however the faucet responce was dealt with by FrmHandleEvent – one other behemoth I didn’t wish to substitute. I checked out it, and noticed that the dealing with of the consumer faucets on the assistance (i) icon was fairly early on. So, I duplicated that logic (and a few that preceded it) in my patch for FrmHandleEvent and didn’t let the unique operate get that occasion. It labored completely! So thus we’ve got 4 extra partial patches: LstDrawList, FrmDrawForm, FrmHandleEvent, and CtlDrawControl.
And now, for some polish
Nonetheless one factor was left to do: correct help for 1.5 density function set as outlined by the SDK. So: I modified the DAL to permit me to patch features that don’t exist within the present OS model in any respect, since some new ones had been added after 5.2 to make this function set work: WinGetScalingMode and WinSetScalingMode. Then I modified PACE‘s 68k dispatch handler for sysTrapHighDensityDispatch to deal with the brand new 68K lure selectors HDSelectorWinSetScalingMode and HDSelectorWinGetScalingMode, letting the remainder of the previous ones be dealt with by PACE as they had been. I additionally acquired a maintain of 108ppi fonts, and wrote some code to exchange the system fonts with them, and I acquired a maintain of 108ppi system photographs (just like the alert icons) and made my extension put them in the best locations.
The consequence? The system appears to be like fairly good! There are nonetheless issues left to patch, technically, and “fundamental.c” within the “Fix1.5DD” folder has a remark itemizing them, however they’re all minor and the system appears to be like nice as is. The “Fix1.5DD” extension is a part of the supply code that I’m releasing with rePalm, and you may see the comparability “after” screenshot simply above to the best. It’s about 4000 traces of code, in 77 patches and a little bit of glue and set up logic.
Dynamic Enter Space/Pen Enter Supervisor Companies help
DIA/PINS fundamentals
PalmOS initially supported sq. screens. Just a few OEMS (Handera, Sony) did produce non-square screens, however this was not customary. Sony made fairly a headway with their 320×480 Sony Clie gadgets. However their API was sony-only and was not adopted by others. When PalmOS 5.2 added help for non-square screens, Palm made an API that they referred to as PINS (or alternatively DIA or AIA). It was not so good as Sony’s API nevertheless it was official, and thus everybody migrated to it. Later sony gadgets had been pressured to help it too. Why was it worse? Sony’s API was easy: collapse dynamic enter space, or deliver it again. Allow or disable the button to take action. Straightforward. Palm’s API tries to be sensible, with issues like per-form insurance policies, and a complete lot of mess. It additionally has the easy issues: put space down or up, or allow or disable the button. However all these settings get randomly mutated/erased anytime a brand new kind comes onscreen, which makes it an enormous ache! Nicely, in any case. That’s the public API. How does all of it work? In PalmOS 5.4, that is all a part of the OS correct, and built-in into Boot.
The way it works pre-garnet
However, as I had stated, I used to be tergetting PalmOS 5.2. There, it was not part of the OS, it was an extension. The DAL presents to the system a uncooked display of regardless of the precise decision is (generally 320×480) and the extension hides the underside space from the apps and attracts the dynamic enter space on it. This requires some interception of some OS calls, like FrmDrawForm (to use the brand new coverage), FrmSetActiveForm (to use coverage to re-activated already drawn varieties), SysHandleEvent (to deal with occasions within the dynamic enter space), and UIReset (to reset to defaults the settings on app switching). There are additionally some issues we wish to be notified about, like display coloration depth change. When that occurs, we could must redraw the enter space. That’s the gist of it. There are a variety of small however important specifics although.
The intricacies of writing a DIA implementation
Earlier than embarking on writing my very own DIA implementation, I attempted all the prevailing ones to see if they might help decision apart from 320×480. I don’t wish to write pointles code, afterall. None of them labored effectively. Even such easy issues as 160×240 (direct 2x downscaling) had been damaged. Screens with completely different facet ratios just like the frequent 240×320 and 160×220 had been much more damaged. Why? I suppose no one ever writes generic code. It’s less complicated to only hack issues up for “now” with no plan for “later”. Nicely, I made a decision to jot down a DIA implementation that would help virtually any decision.
When the DIA is collapsed, a standing bar is proven. It reveals small icons like the house button and menu button, in addition to the button to unhide the enter space. I attempted to make every part as generic as potential. For each display decision potential, one could make a pores and skin. A pores and skin is a set of graphics dpicting the DIA, in addition to some integers descriving the areas on it, and the way they act (what key codes they ship, what they do). The specifics are described within the code and feedback and samples (3 skins designed to look much like sony’s UIs). In addition they outline a “notification tray” space. Any app can add icons there. Even regular 68k apps can! I’m together with an instance of this too. The clock you see within the standing bar is definitely a 68k app caled “NotifGeneral” and its supply is supplied as a part of rePalm’s supply code! My pattern DIA skins presently help 320×480 in double-density, 240×320 in 1.5 density, and 160×220 single density. The cool half? The identical codebase helps all of those resolutions regardless of them having completely different facet ratios. NotifGeneral additionally runs on all of these unmodified. Cool, huh? The supply code for the DIA implementation can be printed with rePalm, in fact!
Audio help
PalmOS Audio fundamentals
Since PalmOS 1.0, there was help for easy sound through a piezo speaker. Meaning easy beeps. The official API permits one to: play a MIDI file (one channel, sq. waves solely), play a tone of a given quantity and amplitude (in background or in foreground), and cease the tone. In PalmOS 5.0, the low degree API that backs this easy sound API is sort of the identical because the high-level official API. HALSoundPlay is used to begin a tone for a given length. The tone runs within the background, the func itself returns instantly and instantly. If one other tone had beforehand been began, it’s changed with the brand new one. A detrimental length worth implies that the tone won’t ever auto-stop. HALSoundOff stops a currently-playing tone, if there may be one. HALPlaySmf performs a MIDI tune. This one is definitely optionally available. If the DAL returns an error, Boot will interpret the MIDI file itself, and make a sequence of calls to HALSoundPlay. Which means until you’ve got particular {hardware} that may play MIDI higher than easy one-channel sq. waves, it is unnecessary to implement HALPlaySmf in your DAL.
PalmOS sampled sudio help
Across the time PalmOS 5.0 got here out, the sampled sound API made an look. Technically it doesn’t require PalmOS 5.0, however I’m not conscious of any Palm OS 4 gadget that implement this API. There have been earlier vendor-specific audio APIs in older PalmOS releases, however they had been nonstandard and customarily relied on customized {hardware} accelerator chips, since 68k processor shouldn’t be actually quick sufficient to decode any advanced audio codecs. The sampled sound API is clearly extra advanced than the easy sound API, however it’s simply defined with the idea of streams. One can create an enter or output stream, set quantity and pan for it, and get a callback when information is offered (enter) or wanted (output). For output streams, the system is predicted to combine them collectively. That implies that a couple of audio stream could play on the identical time and they need to all be heard. Easy sound API also needs to work concurrently. PalmOS by no means actually required help for a couple of enter stream, so no less than that’s good.
A stream (in or out) has a couple of immutable properties. The three most necessary ones are the pattern fee, the channel quantity, and the pattern format. The pattern fee is principally what number of samples per second there are. CD audio makes use of 44,100 per second, most DVDs use 48,000 per second, and low cost voice recorders use 8,000 (roughly phone high quality). PalmOS help solely two channel widths: 1 and a pair of. These are generally referred to as “mono”, and “stereo”. Pattern sort is a illustration of how every pattern is represented within the information stream. PalmOS API paperwork the next pattern sorts: signed and unsigned 8-bit values, signed 16-bit values of any endianness, signed 32-bit values of any endianness, single-precision floating level values of any endianness. So far as I can inform, the one codecs ever supported by precise gadgets had been the 8 and 16-bit ones.
Why audio is difficult & how PalmOS makes it simple
Mixing audio is difficult. Doing it in good high quality is more durable, and doing it quick is more durable but. Why? The audio {hardware} can solely output one stream, so you want to combine a number of streams into one. Mixing could contain format conversion, for instance if {hardware} wants signed 16-bit little-endian samples and one of many streams is in float format. Mixing virtually definitely entails scaling since every stream has a quantity and will have a pan utilized. And, hardest of all, mixing could contain resampling. If, for instance, the {hardware} runs at 48,000 samples per second, and a consumer requested to play a stream with 44,100 samples per second, extra samples are wanted than are supplied – one must generate extra samples. That is all fairly easy to do, if in case you have giant buffers to work with, however that can be a nasty concept, since that provides a variety of latency – the bigger your buffer, the extra time passes between the app offering audio information and the audio popping out the speaker. Within the audio world, you might be pressured to work with comparatively small buffers. Customers may also discover if you’re late delivering audio samples to the {hardware} (they’re going to hear it). Which means you might be all the time on a really tight schedule when coping with audio.
What do present PalmOS DALs do to handle all this problem? Principally, they shamelessly lower corners. All present DALs have a really unhealthy resampler – it merely duplicates samples as wanted to upsample (convert audio to the next sampling charges), and drops samples as wanted to downsample (convert audio to a decrease sampling charges). Why is that this unhealthy? Nicely, when resampling between pattern charges which might be shut to one another on this method, this technique will introduce noticeable artifacts. What about format conversions? Nicely, solely supporting 4 codecs is fairly simple – the blending code was duplicated 4 occasions within the DAL, as soon as for every time.
How rePalm does audio mixing
I wished rePalm to supply good audio high quality, and I wished to help all of the codecs that PalmOS API claimed had been supported. Truly, I ended up supporting much more codecs: signed and unsigned 8, 16, and 32-bit integer, in addition to single-precision floating-point samples in any endianness. For pattern charges, rePalm’s mixer helps: 8,000, 11,025, 16,000, 22,050, 24,000, 32,000, 44,100, and 48,000 samples per second. The format the output {hardware} makes use of is set by the {hardware} driver at runtime in rePalm. Mono and stereo {hardware} is supported, any pattern fee is supported, and any pattern format is supported for native {hardware} output. If you happen to now contemplate the matrix of all of the potential stream enter and output codecs, pattern charges, and channel numbers, you will notice that it’s a very giant matrix. Clearly the PalmOS method of duplicating the code 4 occasions is not going to work, since we would must duplicate it a whole bunch or hundreds of occasions. The choice method of utilizing generic code that switches primarily based on the categories is just too gradual (the switching logic merely wastes too many cycles per pattern). No easy options right here. However earlier than we even get to resampling and mixing, we have to work out methods to take care of buffering.
The preliminary method concerned every channel having a single round buffer that the consumer would write and the mixer would learn. This turned out to be too tough to handle in meeting. Why in meeting? We’ll get to that quickly. The ultimate method I settled on was really less complicated to handle. Every stream has a couple of buffers (buffer depth is presently outlined to be 4), and after any buffer is 100% stuffed, it’s despatched to the mixer. If there are not any free buffers, the consumer blocks (as PalmOS expects). If the mixer has no buffers for a stream, the stream doesn’t play, as PalmOS API specifies. This setup is simple to handle from each side, for the reason that mixer now by no means has to take care of partially-filled buffers or checking out the circular-buffer wraparound standards. A semaphore is used to dam the consumer conveniently when there are not any buffers to fill. “However,” you would possibly ask, “what if the consumer doesn’t give a full buffer’s value of information?” Nicely, we don’t care. Finally if the consumer needs the audio to play, they’re going to have to present us extra samples. And in any case, bear in mind how above we mentioned that we’ve got to make use of small buffers? Any helpful audio shall be large enough to fill no less than a couple of buffers.
One mustn’t neglect that supporting sampled sound API doesn’t absolve you from having to help easy sound features. rePalm creates a sound stream for easy sound help, and makes use of it to play the required tones. They’re generated from an interpolated sine wave at request time. To help doing this with none pesky callbacks, the mixer helps particular “looped” channels. Which means as soon as the information buffer is stuffed, it’s performed repeatedly till stopped. Since no less than one full wave should match into the buffer, rePalm refuses to play any tones beneath 20Hz. That is acceptable to me.
How do meeting and audio combine?
The issue of resampling, mixing, and format conversion loomed giant over me. The naive method of taking a pattern from every stream, mixing it into the output stream, after which doing the identical for the subsequent stream is just too gradual, as a result of fixed “change”ing required primarily based on pattern sorts and pattern charges. Resampling can be advanced if carried out in good (or no less than satisfactory) high quality. So what does rePalm’s DAL do? For resampling, a lot of tables are used. For upsampling, a desk tells us methods to linearly interpolate between enter samples to supply output samples. One such carefully-tuned desk exists for every pair of frequencies. For downsampling, a desk tells us what number of samples to common and at what weight. One such desk exists for every pair of frequencies. Each of those approaches are strictly higher than what PalmOS does. However, if mixing was already onerous, now we simply made it more durable. Let’s attempt to cut up it into chewable chunks. First, we want an intermediate format – a format we are able to work with effectively and rapidly, with out severe information loss. I picked signed 32-bit fastened level with 8 integer bits and 24 fraction bits. Since no PalmOS gadget ever produced audio at greater than 24-bit decision, that is acceptable. The move is conceptually easy: first zero-fill an intermediate buffer. Then, for every stream for which we’ve got buffers of information, combine stated buffer(s) into the intermediate buffer, with resampling as wanted. Then clip the intermediate buffer’s samples, since mixing two loud streams can produce values over the utmost allowed. And, finaly, convert the intermediate buffer into the format {hardware} helps, and hand it off to the {hardware}. rePalm doesn’t trouble with a stereo intermediate buffer if the audio {hardware} is mono solely. The intermediate buffer is barely in stereo if the {hardware} is! How can we get this a lot flexibility? Due to how we combine issues into it.
The one onerous half from above is that “combine buffers into the intermediate buffer with resampling” step. In actual fact, not solely do we have to resample, however we additionally want to use quantity, pan, and presumably convert from mono to stereo or from stereo to mono. Essentially the most optimum method is to jot down a customized well-tuned combine operate for each potential mixture of inputs and outputs. The variety of combos is dizzying. Enter has 8 potential charges, 2 potential channel configs, and 12 potential pattern sorts. Output has 8 potential charges and a pair of potential channel configs. This implies that there’s a whole of simply over 3,000 combos (8 * 2 * 12 * 8 * 2). I used to be not going to jot down 3072 features by hand. In actual fact, even auto-generating them at construct time (if I had been to by some means try this) would bloat rePalm’s DAL‘s code measurement to megabytes. No, one other method was wanted.
I made a decision that I may reuse some issues I discovered whereas I used to be writing the JIT, and likewise reuse a few of its code. That is proper! Whenever you create a stream, a customized combine operate is created only for that stream’s configuration, and in your {hardware}’s output configuration. This practice meeting code makes use of all of the registers optimally and, in reality, it manages to make use of no stack in any respect! The profit is obvious! The blending code is all the time optimum since it’s customized in your configuration. For instance, if the {hardware} solely helps mono output, the blending code will downmix earlier than upsampling (to do it to fewer samples), however will solely downmix after downsampling (as soon as once more, so much less math is required). Since there are three main circumstances: upsampling, downsampling, and no-resampling, there are three paths by way of the codegen to supply combine features. Every combine operate matches a quite simple prototype: int32_t* (*MixInF)(int32_t* dst, const void** srcP, uint32_t maxOutSamples, void* resampleStateP, uint32_t volumeL, uint32_t volumeR, uint32_t numInSamples). It returns the pointer to the primary intermediate buffer pattern NOT written. srcP is up to date to level to the primary enter audio pattern not consumed, maxOutSamples limits what number of audio samples could also be produced, numInSamples limits what number of audio samples could also be consumed. Combine features return when both restrict is reached. Resampling logic could have long-lived state, so that’s saved in a per-stream information construction (5 phrases), and handed in as resampleStateP. The precise resample desk pointer is encoded within the operate itself (for velocity), since it is going to by no means change. Why? As a result of the stream’s pattern fee is fixed, and the {hardware} is not going to magically develop capability to play at one other pattern fee at a later time. The stream’s quantity and pan, nonetheless, could also be modified anytime, so they don’t seem to be hardcoded into the operate physique. They’re supplied as parameters at mixing time. I really thought-about hardcoding them in, and re-generating the combination operate anytime the amount or pan modified, however the achieve would have been too small to matter, so I made a decision in opposition to it. As a substitute we merely pre-calculate “left quantity” and “proper quantity” from the consumer settings of quantity” and “pan” and cross them to the combination operate.
Having a combination operate that good makes the remainder of the mixer simple. Merely: name the combination operate for every non-paused stream so long as there are buffers to eat and the output buffer shouldn’t be full. If we totally eat a buffer, launch it to the consumer. If not, simply bear in mind what number of samples in there we’ve not but used for later. That’s all! So does all this over-complex equipment work? Sure it does! The audio mixer is about 1,500 traces, BUT it could possibly resample and blend streams realtime at beneath 3 million cycles per stream per second, which is significantly better than PalmOS did, and with higher high quality as well! The code is in “audio.c”.
rePalm’s audio hw driver structure
rePalm’s audio {hardware} layer could be very easy. For easy sound help, one simply supplies the funcs for that and the sound layer clals them instantly. For sampled audio, the audio init operate tells the audio mixer the native channel quantity and pattern fee. What about native pattern format? The code supplies an inline operate to transform a pattern from the mixer’s intermediate format (8.24 signed integer) to no matter format the {hardware} wants. Thus, the {hardware}’s native pattern format is outlined by this inline operate. At init time the hw layer supplies to the mixer all this information, in addition to the scale of the {hardware} audio buffer. This buffer is required since interrupts have latency and we want the audio hw to all the time have some audio to play.
On the STM32F429 board, audio output is on pin A5. The audio is generated utilizing a PWM channel, operating at 48,000 samples per second, in mono mode. For the reason that PWM clock runs at 192MHz, if we wish to output 48,000 samples per second, the PWM unit will solely be capable to rely to 4000. Sure, certainly, for this board, because it lacks any actual audio output {hardware}, we’re caught with nearly 12-bit precision. That is ok for testing functions and really does not sound all that unhealthy. The one-ended output instantly from the pin of the microcontroller can’t present a lot energy, however with a small speaker, the sound is obvious and sounds nice! I’ll add a picture with audio help quickly.
On reSpring, the CPU clock (and thus PWM clock) is at 196.6MHz. Why this bizarre frequency? As a result of it’s exactly 48,000 x 4096. This enables us to not must scale audio in a posh vogue, like we do on the STM32F429 board. Simply saturating it to 12 bits will work. Additionally, on reSpring, two pins are used to output audio, in reverse polarity, this offers us twice the voltage swing, producing louder sounds.
Microphone
I didn’t implement a mixer/resampler for the microphone – PalmOS by no means supported a couple of consumer of a microphone at a time, so why trouble? – no apps will achieve this. As a substitute, whichever sampling fee was requested, I cross that to the {hardware} driver and have it really run at that sampling fee. As for pattern sort, identical as for audio out, a customized operate is generated to transform the pattern format from the enter (16 bit little-endian mono), to regardless of the requested format was. The generated code is fairly tight and works effectively!
Zodiac help
Tapwave Zodiac primer
Tapwave Zodiac was a fairly uncommon PalmOS gadget launched in 2003. It was designed for gaming and had some particular {hardware} only for that: panorama display, an analog stick, a Yamaha Midi chip, and an ATI Imageon W4200 graphics accelerator with devoted graphics RAM. There was quite a few Tapwave-exclusive titles launched that used the brand new {hardware} effectively, together with some fancy 3D video games. After all this new {hardware} wanted OS help. Tapwave launched quite a few new APIs, and, fortunately, documented them fairly effectively. The brand new API was fairly effectively designed and simple to comply with. The documentation was virtually good. Kudos, Tapwave! After all, I wished to help Tapwave video games in rePalm.
The reverse engineering
Tapwave’s customized API had been all uncovered through a large desk of operate pointers given to all Tapwave-targetting apps, after they cross the signature checks (Tapwave required approvals and app signing). However, in fact, someplace they needed to go to some library or {hardware}. Digging in, it grew to become clear that the majority of them go to Tapwave Utility Layer(TAL). This module is particular, in that on the Zodiac, just like the DAL, Boot, and UI, the TAL might be accessed instantly off of R9 through LDR R12, [R9, #-16]; LDR PC, [R12, #4 * tal_func_no]. However, after spending a variety of time within the TAL, I spotted that it was only a wrapper. All the opposite libraries had been too: Tapwave Midi Library and Tapwave Multiplayer Library. All of the particular sauce was within the DAL. And, boy, was there a variety of particular sauce. Regular PalmOS DALs have about 230 entrypoints. Tapwave’s has 373!
Plenty of tracing by way of the TAL, and a variety of trawling by way of the CPU docs acquired me the names and params to many of the further exported DAL funcs. I used to be in a position to deduce what all however 14 features do! And as for these 14: I may discover no makes use of of any of them wherever within the gadget’s software program! The precise implementations beneath matter a bit much less since I’m simply reimplementing them. My largest worries had been, in fact, the graphics acceleration APIs. Turned out that that half was the best!
The “GPU”
Zodiac’s graphics accelerator was fairly fancy for a handheld gadget on the time, however it is usually fairly fundamental. It has 8MB of reminiscence in-built, and accelerates solely 2D operations. Mainly, it could possibly: copy rectangles of picture information, mix rectangles between layers with fixed or parametric alpha mixing, do fundamental bilinear resizing, and draw traces, rectangles, and factors. It operates solely on 16-bit RGB565LE layers. This was really fairly simple to implement. After all doing this in software program wouldn’t be quick, however for the needs of my proof of idea, it was ok. Just a few days of labor, and … it really works! Just a few video games ran.
Subsequent step remains to be in-progress: utilizing the DMA2D unit within the STM32 to speed up many of the issues the ATI chip can do. Apart from picture resizing, it could possibly do them multi functional cross or two! For further credit score, it could possibly additionally function within the background just like the ATI chip did to the CPU within the Zodiac. However that’s for later…
Different Tapwave APIs
Enter subsystem within the Zodiac was fairly particular and required some work. As a substitute of the standard PalmOS strategies of studying keys, contact, and so on, they launched a brand new “enter queue” mechanism that allowed all of those occasions to be delivered all into one place. I needed to reimplement this from nothing however the documented excessive degree API and disassembly. It labored: rePalm now has a working implementation of TwInput and can be utilized as reference for anybody who additionally for some cause needs to implement it.
TwMidi was principally reverse engineered in per week. However I didn’t write a midi sequencer. I may and shall, however not but. The API is thought and that’s so far as I wanted to go to return correct error codes to permit the remainder of the system to go on.
Actual {hardware}: reSpring
The last word Springboard accent
Again when Handspring first launched the Visor, its Springboard Expansion Slot was probably the most revolutionary options. It allowed a couple of very cool enlargement gadgets, like cellular phones, GPS receivers, barcode readers, expansion card readers, and cameras. Springboard slot is cool as a result of it’s a literal direct connection to the CPU’s information and tackle bus. This supplies a variety of enlargement alternatives. I made a decision that the primary utility of rePalm needs to be a Springboard accent that may, when pluged in, improve a Visor to PalmOS 5. The thought is that reSpring will run rePalm on its CPU, and the Visor will act because the display, contact, and buttons. I collaborated with George Rudolf Mezzomo on reSpring, with me setting the specs, him doing the schematics and format, and me doing the software program and drivers.
Interfacing with the Visor
To the Visor, the sprinboard module appears to be like like two reminiscence areas (two chip choose traces), every a couple of megabytes giant at most. The primary will need to have a legitimate ROM picture for the Visor to search out, structured like a PalmOS ROM reminiscence, with a single heap. Normally that heap comprises a single utility – the motive force for this module. The second chip choose is normally used to interface to no matter {hardware} the Springboard unit has. For reSpring I made a decision to do issues in another way. There have been a couple of causes. The primary cause was {that a} NOR flash to retailer the ROM would take up board area, but in addition as a result of I actually didn’t wish to handle so many various flashable elements on the board. There was a 3rd cause too, however we’ll must get again to that in a bit.
The Visor expects to interface with the Springboard by doing reminiscence accesses to it (reads and writes) and the module is predicted to principally behave like a synchronous reminiscence gadget. That implies that there isn’t a “i’m able to reply” line, as a substitute you’ve got a set variety of cycles to answer to any request. When a module is inserted, the Visor configured that quantity to be six, however it could possibly then be lowered by the module’s driver app. Making an attempt to answer to requests coming in with a set (and really brief) deadline can be an enormous CPU load for our ARM CPU. I made a decision that the best option to accomplish that is to truly put a RAM there, and let the Visor entry that. However, then, how will we entry it, if the Visor can achieve this anytime? Nicely, there are particular forms of RAM that enable this.
Sure, the elusive (and costly) dual-ported RAM. I made a decision that reSpring would use a small quantity of dual-ported RAM as a malbox between the Visor and rePalm’s CPU. This fashion the Visor may entry it anytime, and so may rePalm. The Springboard slot additionally has two interrupt request traces, one to the Visor, one to the module. These can be utilized to sign when a message is within the mailbox. There are two issues. The primary is that dual-ported RAMs are normally giant, principally as a result of giant variety of pins wanted. For the reason that Visor wants a 16-bit-wide reminiscence within the Springboard slot, our hypotherical dual-ported RAM would should be 16-bit huge. After which we want tackle traces, management traces, byte lane choose traces, and chip choose traces. If we had been to make use of a 4KB reminiscence, for instance, we would want 11 tackle traces, 16 information traces, 2 byte lane choose traces, one chip choose line, one output allow line, and one write allow line, PER PORT! Add in no less than two energy pins, and our hypothetical chip is a 66-pin monstrosity. Since 66-pin packages don’t exist, we’re all in for a 100-pin half. And 4KB shouldn’t be even a lot. Ideally we would like to suit our complete framebuffer in there to keep away from advanced piecewise transfers. Sadly, as the good thinker Jagger as soon as stated, “You’ll be able to’t all the time get what you need.” Twin-ported RAMs are very costly. There are solely two firms making them, and so they cost loads. I settled on the 4KB half purely primarily based on value. Even at this measly 4KB measurement, this one RAM is by far the costliest part on the board at $25. On condition that the prices of placing in a 64KB half (my most popular measurement) had been past my creativeness (and past my pockets’s talents), I made a decision to invent a posh messaging protocol and make it work over a 4KB RAM used as a bidirectional mailbox.
However, allow us to get again to our want for a ROM to carry our driver program. Nowhere within the Sprinboard spec is there really a requirement for a ROM, only a reminiscence. So what does that imply? We are able to keep away from that further chip by having the reSpring CPU include the ROM picture inside it, and rapidly write it into the dual-ported RAM on powerup. For the reason that Visor provides the module as much as three seconds to supply a legitimate card header, we’ve got loads of time as well up and write the ROM to our RAM. One chip fewer to purchase and place on the board is fantastic!
Model 1
I admit: there was a little bit of function creep, however the remaining {hardware} design for model 1 ended up being: 8MB of RAM, 128MB of NAND flash, a 192MHz CPU with 2MB of flash for the OS, a microSD card slot, a speaker for audio out, and an amplifier to make use of the in-Visor microphone for audio in. Audio out shall be carried out the identical manner as on the STM32F429 board, audio in shall be carried out through the actual ADC. The primary RAM is on a 32-bit huge bus operating at 96MHz (384MB/s bandwidth). The NAND flash is on a QSPI bus at 96MHz (48MB/s bandwidth). The OS shall be saved within the inner flash of the STM32F469 CPU. The onboard NAND is simply an exploration I wish to do. It should both be an inner SD card, or perhaps storage for one thing like NVFS(however not as unstable), when i’ve had time to jot down it.
So, when is that this occurring? 5 model 1 boards had been delivered to me in late November 2019!
Bringup of v1
Having {hardware} in-hand is nice. It’s better but when it work proper the vey first time. Nice like unicorns, and simply as possible. Nope… nothing labored instantly. The boards didn’t wish to discuss to the debugger in any respect, and after weeks of torture, i noticed some pull ups and downs had been lacking from the boards. This was not a difficulty on STM’s dev boards since they embody these pull ups/downs. As soon as the CPU began speaking to me, it grew to become evident in a short time that it was very very unstable. It’s specified to run at 180MHz (sure, which means usually we’re overclocking it by 9.2% to 196.6MHz). On the reSpring boards the CPU wouldn’t run with anystability over 140MHz. I checked energy provide, and decoupling caps. All gave the impression to be in place, till… No VCAP1 and VCAP2. The CPU core runs at a decrease voltage than 3.3V, so the CPU has an inner regulator. This regulator wants capacitors to stabilize its output within the face of variable consumption by the CPU. That’s what VCAP1 and VCAP2 pins are for. Nicely, the board had no capacitors on VCAP1 and VCAP2. The interior regulator output was swinging wildly (+/- 600mV on a 1.8V provide is a lot of swing!). In actual fact, it’s superb that the CPU ran in any respect with such an unstable provide! Nicely, after one other rework beneath the microscope with two capacitors had been added, the board was steady. On to the subsequent downside…
The following concern was SDRAM. The primary place the code runs from and information is saved. The interface appeared completely borked. Any phrase that was written, the fifteenth bit would all the time learn as 1, and 0th and 1st bits would all the time learn as a zero. For sure, this isn’t acceptable for a RAM which I hoped to run code from. This was a large ache to debug, however in the long run it there out to be a typo in GPIO config not mapping the 2 decrease bits to be SDRAM DQ0 and DQ1. This left solely bit 15 caught excessive to resolve. That concern didn’t replicate on different boards, in order that was a neighborhood concern to 1 board. Plenty of cautious microscoping revealed a gob of solder beneath the pin left from PCBA, which was shorting to a close-by pin that was excessive. Lifting the pin, wicking the solder off, and reconnecting the pin to the PCB resolved this concern. SDRAM now labored. Since this SDRAM was fairly completely different than the one on the STM32F429 discovery board, I needed to dig up the configs to make use of for it, and translate between the timings STM makes use of and the RAM datasheet makes use of to give you correct settings. The consequence was fairly quick SDRAM which appears steady. Superior!
After all this was not practically the tip of it. I couldn’t entry the dual-ported SRAM in any respect. A fast examine with the board format revelaed that its chip choose pin was by no means wired to the STM. Out got here the microscope and soldering iron, and a wire was added. Lo and behold, SRAM was accessible. Extra datasheet studying ensued to configure it correctly. Whereas doing that, I observed that it is energy consumption is listed as “low”, simply 380 mW!!! So not solely is that this the costliest chip on the board, it is usually essentially the most energy hungry! It actually must go!
I can let you know of extra reworks that adopted after some in-Visor testing, simply to maintain all of the rework story collectively. It turned out that the road to interrupt the visor was by no means linked wherever, so I wired that as much as PA4, in order that reSpring may ship na IRQ to the visor. Additionally it turned out that SRAM has a variety of “modes” and it was configured for the flawed one. Three separate pins needed to be reworked to modify it from “grasp” mode into “slave” mode. These modes configure how a number of such SRAMs can be utilized collectively. As reSpring solely has one, logically it was configured as grasp. This seems to have been flawed. Whoops.
Let’s stick it right into a Visor?
Getting acknowledged
So easy, proper? Simply stick it into the Visor and be carried out with it? Studying and re-reading the Handspring Springboard Improvement Information supplied virtually all the information wanted, in concept. Apply was completely different. For some cause, irrespective of how I formatted the pretend ROM within the shared SRAM, the Visor wouldn’t acknowledge it. Lastly I gave up on this method, and wrote a check app to only dump what the Visor sees to display, in a sequence of messageboxes. Springboard ROM is all the time mapped at 0x28000000. I rapidly realized the problems. First, the visor Springboard byteswaps all accesses. It is because many of the world is little-endian, whereas the 68k CPU is big-endian. To permit peripheral designers to not fear, Handspring byteswaps the bus. “However,” you would possibly say, “what about non-word accesses?” There are not any such accesses. Visor all the time accesses 16 bits at a time. There are not any byte-select traces. For us that is really form of cool. So long as we talk utilizing solely 16-bit portions, no byteswapping in software program is required. There was one other concern: the Visor noticed each different phrase that reSpring wrote. This took some investigation, however the consequence was each hilarious and unhappy on the identical time. Regardless of all accesses to Springboard being 16-bit-wide, tackle line 0 is wired to the Springboard connector. Why? Who is aware of? However it’s all the time low. On reSpring board, Springboard connector’s A0 was wired to RAM’s A0. However since it’s all the time 0, this implies the Visor can solely entry each different phrase of RAM – the even addresses. …sigh… So we don’t have 4K of shared RAM. We’ve 2K… However, now that we all know all this, can we get the visor to acknowledge reSpring as a Springboard module? YES!. The picture on the best was taken the primary time the reSpring module was acknowledged by the Visor.
Saving useful area
After all, this was solely the start of the difficulties. Purposes run proper from the ROM of the module. That is good and unhealthy. For us that is principally unhealthy. What does this imply? The ROM picture we put within the SRAM should stay there, endlessly. So we have to make it as small as potential. I labored very onerous to reduce the scale, and acquired it all the way down to about 684 bytes. Most of my makes an attempt to overlap buildings to avoid wasting area didn’t work – the Visor code that validates the ROM on the Springboard module is cruel. The precise utility is tiny. It implements the best potential messaging protocol (one phrase at a time) to speak with the STM. It implements no graphics help and no pen help. So what does it do? It downloads a bigger piece of code, one phrase at a time, from the STM. This code is saved within the Visor’s RAM and may run from there. It then merely jupms to that code. Why? This enables us to avoid wasting useful SRAM area. So we find yourself with 2K – 684bytes = 1.3K of ram for sending information backwards and forwards. Not a lot however in all probability satisfactory.
Communications
So, we’ve got 1.3KB of shared RAM, an interrupt going every manner, how can we talk? I designed two communications protocols: a easy one and a posh one. The easy one is used solely to bootstrap the bigger code into Visor RAM. It sends a single 16-bit message and will get a single 16-bit response. The messages carried out are fairly fundamental: a request to answer – simply to examine comms, a couple of requests to get data on the place within the shared reminiscence the big mailboxes are for the advanced protocol, a request for the way large the downloaded code is, and the message to obtain the subsequent phrase of code. As soon as the code is downloaded and is aware of what he places and sizes of mailboxes are, it makes use of the advanced protocol. How does it differ? A big chunk of information is positioned within the mailbox, after which the easy protocol is used to point a request and get a response. The mailboxes are unidirectional, and sized very in another way. The STM-to-Visor mailbox occupies about 85% of the area, whereas the mailbox within the different course is tiny. The reason being apparent – display information is giant.
All requests are all the time originated from the Visor and get a response from the reSpring module. If the module has one thing to inform the Visor, it is going to increase an IRQ, and the visor will ship a request for the information. If the visor has nothing to ship, it is going to merely ship an empty NOP message. How does the Visor ship a request? First, the information is written to the mailbox, then the message sort is written to a particular SRAM location, after which a particular marker indicating that the message is completed is written to a different SRAM location. An IRQ is then raised to the module. The IRQ handler within the STM appears to be like for this “message legitimate” marker, and whether it is discovered the message is learn and replied to: first the information is written to the mailbox, then message sort is written to the shared SRAM location for message sort, after which the “it is a reply” marker is written to the marker SRAM location. This entire time, the Visor is just loop-reading the marker SRAM location ready for it to alter. Is that this busy ready an issue? No. The STM is so quick, and the code to deal with the IRQ does so little processing that the replies usually are available in microseconds.
A cautious studying of the Handspring Springboard Improvement Information would possibly depart you with a query: “what precisely do you imply once you say ‘interrupt to the module’? There are not any pins which might be there for that!” Certainly. There are, nonetheless, two chip-select traces going to the module. The primary should tackle the ROM (SRAM for us). The chip-select line second is free for the module to make use of. Its base tackle in Visor’s reminiscence map is 0x29000000. We use that because the IRQ to the STM, and easily entry 0x29000000 to trigger an interrupt to the STM.
Early Visor help
At this level, some staple items could possibly be examined, however all of them failed on Visor Deluxe and Visor Solo. In actual fact, every part crashed shortly after the module was inserted. Why? Truly the reason being apparent – they run PalmOS 3.1, whereas all different Visors ran PalmOS 3.5. A shocking variety of APIs one involves depend on in PalmOS programming are merely not accessible on PalmOS 3.1. Such easy issues like ErrAlertCustom(), BmpGetBits(), WinPalette(), and WinGetBitmap() merely don’t exist. I needed to write code to keep away from utilizing these in PalmOS 3.1. However a few of them are wanted. For instance, how do I instantly copy bits into the show framebuffer if I can’t get a pointer to the framebuffer through BmpGetBits(WinGetBitmap(WinGetDisplayWindow()))? I tried to only dig into the buildings of home windows and bitmaps myself, nevertheless it seems that the show bitmap shouldn’t be a legitimate bitmap in PalmOS 3.1 in any respect. On the finish, I spotted that PalmOS 3.1 solely supported MC68EZ328 and MC68328 processors, and each of them configure the show controller base tackle in the identical register, so I simply learn it instantly. As for palette setting, it isn’t wanted since PalmOS 3.1 doesn’t help coloration or palettes. Straightforward sufficient.
Making it work effectively
Preliminary information
Some information is required by rePalm earlier than it could possibly correctly boot: display decision and supported depths, {hardware} flags (eg: whether or not display has brightness or distinction adjustment), and whether or not the gadget as an alert LED (sure, you learn that proper, extra on this later). Thus rePalm doesn’t boot till it will get a “proceed boot” message that’s despatched by the code on the Visor as soon as it collects all this information.
Sending show information
The best-bandwidth information we have to switch between the Visor and the reSpring module is the show information. For instance for a 160×160 scren at 16 bits per pixel at 60 FPS, we would must switch 160x160x16x60 = 23.44Mbps. Not a low information fee in any respect to try on a 33MHz 68k CPU. In actual fact, I don’t assume that is even potential. For 4 bits-per-pixel greyscale the numbers look somewhat higher: 160x160x4x60 = 5.86Mbps. However there’s a second downside. Every message wants a full spherical journey. We’re restricted by Visor’s interrupt latency and our common round-trip latency. Sadly that latency is as excessive as 2-4ms. So we have to reduce the variety of packets despatched. We’ll come again to this later. Initially I simply despatched the information piecewise and displayed it onscreen. Did it work the primary time? Truly, virtually. The picture to the best reveals the outcomes. All it took was a single byteswap to get it to work completely!
It was fairly gradual, nonetheless – about 2 frames per second. Wanting into it, i noticed that the decision to MemMove was one of many causes. I wrote a routine optimized to maneuver the big chunks of information, provided that it was not overlapped and all the time aligned. This improved the refresh fee to about 8 frames per second on the greyscale gadgets. Extra enchancment was wanted. The key concern was the spherical journey time of copying information, ready, copying it out, and so forth. How can we reduce the variety of spherical journeys? Yup – compress the information. I wrote a really very quick lossless picture compressor on the STM. It really works considerably like LZ, with a hashtable to search out earlier occurrences of an information sample. The compression rations had been very excellent, and refresh charges went as much as 30-40 FPS on the greyscale gadgets. Colour Bejeweled grew to become playable even!
Truly getting the show information was additionally fairly attention-grabbing. PalmOS 5 expects the show to only be a framebuffer that could be written to freely. Whereas there are API to attract, one might also simply write to the framebuffer. Which means there is not actually a option to get notified when the picture onscreen modifications. We may ship display information continually. In actual fact, that is what I did initially. This depletes the Visor battery at about two % a minute for the reason that CPU is continually busy. Clearly this isn’t the best way to go. However how can we get notified when somebody attracts? The answer is a enjoyable one: we use the MPU. We are able to defend the framebuffer from writes. Reads are allowed however any write causes an exception. We deal with the exception by setting a timer for 1/60 of a second later, after which allow the writes and return. The code that was drawing them resumes, none the wiser. When our timer fires, we re-lock the framebuffer, and request to switch a screenful of information to Visor. This enables us to not ship the identical information time and again. Generally writes to display additionally change nothing, so I later added a second layer the place anytime we ship a screenful of information, we make a copy, and subsequent time we’re requested to ship, we evaluate, and do nothing if the picture is similar. Along with compression, these two strategies deliver us to an inexpensive energy utilization and display refresh fee.
Buttons, pen, brightness, distinction, and battery information
For the reason that Visor can ship information to the reSpring module anytime it needs, sending button and pen information is simple, simply ship a message with the information. For transferring information the opposite manner, the design can be easy. If the module requests an IRQ, the visor will ship a NOP message, in reply the module will ship its request. There are requests for setting show palette, brightness, distinction, or battery information. Visor will carry out the requested motion, and maybe reply (eg: for battery information).
Microphone help
The audio amp turned out to be fairly miswired on v1 boards, however after some sophisticated reworks, it was potential to check fundamental audio recording performance. It labored! As a consequence of how the reworks labored, the qulity was not stellar, however I may acknowledge my voice as i stated “1 2 3 4 5 6 7” to the voice memo app. However, in actuality, amplifying the visor mic is a large ache – we want a 40dB achieve to get anythign helpful out of the ADC. The analog elements of doing this correctly and noise-free are simply too costly and quite a few, so for v2 it was determined to only populate a digitla mic on the board – it’s really cheaper. Plus, no analog is the most effective quantity of analog for a board!
Polish
Serial/IrDA
I help forwarding the Visor’s serial port to reSpring. What is that this for? HotSync (works) and IR beaming (principally works). That is really fairly a tough downside to resolve. To start out with, to be able to help PalmOS 3.1, one should use the Outdated Serial Supervisor API. I had by no means used them since PalmOS 4.5 launched the New Serial Supervisor and I had virtually by no means written any code for PalmOS earlier than 4.1. The APIs are literally related, and each fairly hostile to what we want. We’d like to have the ability to be instructed when information arrives, with out busy-waiting for it. Seemingly there isn’t a API for this. Repeatedly and continually checking for information works, however wastes battery. Lastly I discovered that through the use of the “obtain window” and “wakeup handler” each of that are halfway-explained within the guide, I can get what I would like – a callback when information arrives. I additionally discovered that, whereas calmly documented, there’s a option to give the Serial supervisor a bigger recieve buffer. This enables us to not drop obtained information even when we take a couple of milliseconds to get it out of the buffer. I used to be ready to make use of all of this to wire up Visor’s serial port to a driver in reSpring. Sadly, beaming requires a fairly fast response fee, which is difficult to succeed in with our round-trip latency. Beaming works, however not each time. Hotsync does work, even over USB.
Alarm LED
Since rePalm helps alarm LEDs and a few Visors have LEDs (Professional, Prism, and Edge), I wished to wire one as much as the opposite. There are not any public API for LED entry within the Handspring gadgets. Some reverse engineering confirmed that Handspring HAL does have a operate to set the LED state: HalLEDCommand(). It does exactly what I need, and might be referred to as merely as TRAP #1; dc.w 0xa014. There is a matter. Earlier variations of Handspring HAL lack this operate, and for those who try to name it, they’ll crash. “Certainly,” you would possibly say, “all gadgets that help the LED implement this operate!” Nope… Visor Prism gadgets bought within the USA don’t. The EFIGS model does, as do all later gadgets. This handy hardware-independent operate was not accessible to me thus. What to do? Nicely, there are solely three gadgets which have a LED, and I can detect them. Let’s go for direct {hardware} entry then! On the visor edge the LED is on GPIO K4, on the Professional, it’s K3, and on the Prism it’s C7. We are able to write this GPUI instantly and it really works as anticipated.
There are two driver modes for LED and vibrator in rePalm – easy and complicated. Easy mode has rePalm give the LED/vibrator quite simple “activate now” “flip off now” instructions. That is appropriate for a instantly wired LED/vibrator. Within the reSpring case we really desire to make use of the advanced driver, the place the OS tells us “right here is the LED/vibrator sample, right here is how briskly to carry out it, this many occasions, with this a lot time in between. That is appropriate for when you’ve got an exterior controller that drives the LED/vibrator. Right here we do have one: the Visor is our exterior controller. So we merely ship these instructions to the Visor and our downloaded code performs the right actions utilizing a easy state machine.
Software program replace
I wished reSpring to have the ability to self-update from SD card. How may this be achieved? Nicely, the flash within the STM32 might be written by code operating on the STM32, so logically it shouldn’t be onerous. Just a few problems exist: to begin with, your entire PalmOS is operating kind flash, together with drivers for numerous {hardware} items. Our comms layer to speak to the Visor can be in there. So to carry out the replace we have to cease your entire OS and disable all interrupts and drivers. Okay, that’s simple sufficient, however amongst these drivers are the drivers for the SD card, the place our replace is. We’d like that. Straightforward to resolve: copy the replace to RAM earlier than beginning the replace – RAM wants no drivers. However how can we present the progress to the consumer – our framebuffer shouldn’t be actual, making visor present it requires a variety of code and dealing interrupts. There was no likelihood this is able to work as regular.
I made a decision that one of the best ways to do that was to have the Visor draw the replace UI itself, and simply use a single SRAM location to indicate progress. Writing a single SRAM location is one thing our replace course of can do with no points for the reason that SRAM wants no drivers – it’s simply reminiscence mapped. The remainder was simple: a program to load the replace into RAM, ship the “replace now” message, after which flash the ROM, all of the whereas writing to the right SRAM location the “% accomplished”. This required exporting the “ship a message” API from the rePalm DAL for functions to make use of. I did that.
Onboard NAND
You wished ache? This is some NAND
The reSpring board has 256MB of NAND flash on a QSPI bus. Why? As a result of on the time it was designed, I assumed it might be cool, and it was fairly low cost. NAND is the storage know-how underlying most fashionable storage – your SD playing cards, your SSD, and the storage in your cellphone. However, NAND is difficult – it has quite a few anti-features that make it fairly tough to make use of for storage. First, NAND could not correctly retailer information – error correction is required as it might sometimes flip a bit or two. Worse, extra bit flips could accumulate over time, to some extent the place error correction is probably not sufficient, necessitating shifting information when such a time approaches. The smallest addressable unit of NAND is a web page. That’s the measurement of NAND that could be learn or programmed. Programming solely flips one bits to zero, not the reverse. The one option to get one bits again is an erase operation. However that operates on a block – a big assortment of pages. Since you want error correcting codes, AND bits can solely be flipped from one to zero, overwriting information is difficult (for the reason that ECC code you employ virtually definitely will want extra ones). There are normally limits to what number of occasions a web page could also be programmed between erases in any case. There are additionally normally necessities that pages in a block be programmed so as. And, for further enjoyable, blocks could go unhealthy (failing to erase or program). In actual fact a NAND gadget could ship with unhealthy blocks instantly from the manufacturing unit! Clearly this isn’t in any respect what you consider once you think about block storage. NAND requires cautious administration to make use of for storage. Since blocks die on account of put on, attributable to erasing, you wish to evenly put on throughout your entire gadget. This will in flip necessitate movinig extra information. On the identical time whilst you transfer information, energy could exit so you want to watch out when and what’s erased and the place it’s written. Protecting a constant concept of what’s saved the place is difficult. That is the job of an FTL – a flash translation layer. An FTL takes the mess that’s nand and presents it as a traditional block gadget with quite a few sectors which perhaps learn and written to randomly, with no concern for issues like error correction, erase counts, and web page partial programming limits.
To write down an FTL…
I had written an FTL way back, so I had some fundamental concept of the method concerned. This was, nonetheless, greater than a decade in the past. It was enjoyable to attempt to do it once more, however higher. This time I set out with a couple of targets. The primary precedence was to completely by no means lose any information in face of random energy loss for the reason that module could also be faraway from the Visor randomly at any time. The FTL I produced won’t ever lose any information, irrespective of once you randomly lower its energy. A secondary precedence was to reduce the quantity of RAM used, since, afterall, reSpring solely has 8MB of it!
The pages within the NAND on reSpring are 2176 bytes in measurement. Of that, 4 are reserved for “unhealthy block marker”, 28 are free to make use of nonetheless you want, with no error correction safety, and the remaining is cut up into 4 equal components of 536 bytes, which, for those who need, the chip can error-correct (through the use of the final 16 of these bytes for the ECC code). Which means per web page we’ve got 2080 error-corrected bytes and 28 non-error-corrected bytes. Blocks are 64 pages every, and the gadget has 2048 blocks, of which they promise no less than 2008 shall be good from the manufacturing unit. Having the chip do the ECC for us is sweet – it has a particular {hardware} unit and may do it a lot quicker then our CPU ever may in software program. It should even report back to us what number of bits had been corrected on every learn. This data is significant as a result of it tells us in regards to the well being of this web page and thus informs our resolution as to when to relocate the information earlier than it turns into unreadable.
I made a decision that I would love my FTL to current itself as a block gadget with 4K blocks. That is the cluster measurement FAT16 ought to optimally use on our gadget, and having bigger blocks permits us to have a smaller mapping desk (the map from digital “sector quantity” to actual “web page quantity”). Thus we would deal with two pages collectively as one all the time. Which means every of our digital pages can have 4160 bytes of error-corrected information and 56 bytes of non-erorr corrected information. Since our flash permits writing the identical web page twice, we’ll use the un-error-corrected space ourselves with some handmade error corection to retailer some information we wish to persist. This shall be issues like what number of occasions this block has been erased, identical for prev and subsequent blocks, and the present era counter to determine how previous the knowledge is. The handmade ECC was trivial: hamming code to right as much as one little bit of error, after which replicate the information plus the hamming code 3 times. This could present sufficient safety. Since this solely used the un-error-corrected a part of the pages, we are able to then simply write error-correctd-data over this with no points. Each time we erase a web page, we write this information to it instantly. If we’re interrupted, the pages round it have the information we want and we are able to resume stated write after energy is again on.
The error-corected information comprises the consumer information (4096 bytes of it) and our service information, equivalent to what vitual sector this information is for, era counter, information on this and some neighboring blocks, and another information. This information permits us to rebuild the mapping desk after an influence cycle. However clearly studying your entire gadget every energy on is gradual and we don’t wish to do that. We thus help checkpoints. Each time the gadget is powered off, or the FTL is unmounted, we write a checkpoint. It comprises the mapping information and another information that permits us to rapidly resume operation with out scanning your entire gadget. After all in case of an sudden energy off we do must do a scan. For these circumstances there may be an optimization too – a listing on the finish of every block tells us what it comprises – this permits the scan to learn only one/thirty second of the gadget as a substitute of 100% of it – a 32x speedup!
Learn and write requests from PalmOS instantly map to the FTL layer’s learn and write. Besides there’s a downside – PalmOS solely helps block gadgets with sector sizes of 512 bytes. I wrote a easy translation layer that does read-modify-write as wanted to map my 4K sectors to PalmOS’s 512-byte sectors, if PalmOS’s request didn’t completely align with the FTL’s 4K sectors. This isn’t as scary or as gradual as you think about it, as a result of PalmOS makes use of FAT16 to format the gadget. When it does, it asks the gadget about its most popular block measurement. We repy with 4K and from then on, PalmOS’s FAT driver solely writes full 4K clusters – which align completely with out 4K FTL sectors. The runtime reminiscence utilization of the FTL is barely 128KB – not unhealthy in any respect, if I do say so myself! I wrote a really torturous set of assessments for the FTL and ran it on my pc over a couple of nights. The check simulated information going unhealthy, energy off randomly, and so on. The FTL handed. There may be really much more to this FTL, and you might be free to go take a look at the supply code to see extra.
One remaining WTF
Amongst all this work, rePalm labored effectively, principally. Sometimes it might lose a message from the Visor to the module or vice-versa. I spent a variety of time debugging this and got here to a startling realization. The twin-ported SRAM doesn’t really help simultaneous entry to the identical tackle by each ports directly. That is documented in its datasheet as a “useful function” however it’s something however. Now, it could be cheap to not enable two simultaneous writes to the identical phrase, positive. However two reads ought to work, and a learn and a write ought to work too (with a learn returning the previous information or the brand new information, and even a mixture of the 2). This SRAM as a substitute indicators “busy” (which is in any other case by no means does) to 1 facet. Since it isn’t purported to ever be busy, and the Springboard slot doesn’t also have a BUSY pin, these indicators had been wired nowhere. That is the place I discovered these things within the footnote within the guide. It stated that switching the chip to SLAVE mode and elevating the BUSY pins (which at the moment are inputs) to HIGH will enable simultaneous entry. Nicely, it form of does. There isn’t any extra busy signalling, however generally a write shall be DROPPED whether it is executed concurrently with a learn. And a learn will generally return ZERO if executed concurrently with one other learn or write, even when the previous and new information had been each not zero. There appears to be no manner round this. One other firm’s dual-ported SRAM had the identical nonsense limitation, main me to imagine that no one within the business makes REAL dual-ported SRAMs. This SRAM has one thing referred to as “sempahores” which can be utilized to implement precise semaphores which might be actually shared by each gadgets, however in any other case it isn’t true dual-ported RAM. Rattling!
Utilizing these semaphores would require important rewiring: we would want a brand new chip choose line going to this chip, and must invent a brand new option to interrupt the STM for the reason that second chip choose line can be now used to entry semaphores. This was past my rework talents, so I simply beefed up the protocol to keep away from these points. Now the STM will write every information phrase that could be concurently learn 64 occasions, after which learn it again to confirm it was written. The comms protocol was additionally modified to by no means ever use zeroes, and thus if a zero is seen, it’s clear {that a} re-read was needed. With these hacks the communication is steady, however within the subsequent board rev rev I believe we’ll wire up the semaphores to keep away from this nasty hack!
So the place does this depart us?
There may be nonetheless loads to do: implement BT, WiFi, USB, debug NVFS some extra, and possibly many extra issues. Nonetheless, I’m releasing somewhat preview picture to strive, for those who occur to have an STM32F429 discovery board. It has a minimal PalmOS 5.2.8 picture operating with 108ppi 240×300 show. It’s minimal to suit into flash on the board. No help for USB. Anyhow if you wish to play with it, right here: LINK. I’m additionally persevering with to work on the reSpring and also you would possibly even be capable to get your palms on one quickly 🙂 If you have already got a reSpring module (you already know who you might be), the archive linked to above has an replace to 1.3.0.0 for you too.
Article replace historical past
- picture above was up to date to v00001: jit is now on (a lot quicker), RTC works (time), notepad added, contact response improved
- picture above was up to date to v00002: grafitti space now drawn, grafitti works, extra apps added (Bejeweled eliminated for area causes)
- picture above was up to date to v00003: ROM is now compressed to permit extra issues to be in it. That is okay since we unpack it to RAM in any case. some work carried out on SD card help
- Defined how LDM/STM are translated
- Wrote a bit about SD card help
- Wrote a bit about serial port help
- Wrote a bit about Vibrate & LED help
- Wrote the primary half about NetIF drivers
- picture above was up to date to v00004: some drawing points fastened (underline beneath memopad textual content discipline), alert LED now works, SD card works (for those who wire it as much as the board)
- picture above was up to date to v00005: some help for 1.5 density shows works so picture now makes use of the complete display
- Wrote the doc part on 1.5-density show help
- Wrote the doc part on DIA help and uploaded v000006 picture with it
- Wrote a bit on PACE, uploaded picture v000007 with a lot quicker 68k execution and a few DIA fixes
- Uploaded picture v000008 with IrDA help
- Wrote about audio help
- Wrote about reSpring
- Uploaded picture v000009 with preliminary audio help
- Uploaded picture v000010 with new JIT backend and a number of JIT fixes
- Uploaded picture v000011 with an improved JIT backend and extra JIT fixes, and an SD-card primarily based updater. Wrote in regards to the Cortex-M0 backend
- Wrote loads about reSpring {hardware} v1 deliver up and present standing
- Uploaded STM32F429 discovery picture v000012 with important speedups and a few fixes (grafiti, notepad)! (this corresponds to rePalm v 1.1.1.8)
- Uploaded STM32F429 and, for the primary time ever, reSpring photographs for v 1.3.0.0 with many speedups, wrote about mic help and Zodiac help