xorvoid

SectorC (github) is a C compiler written in x86-16 meeting that matches throughout the 512 byte boot sector of an x86 machine. It helps a
subset of C that’s massive sufficient to jot down actual and fascinating packages. It’s fairly doubtless the smallest C compiler ever written.
In a base64 encoding, it seems to be like this:
6gUAwAdoADAfaAAgBzH/6DABPfQYdQXoJQHr8+gjAVOJP+gSALDDqluB+9lQdeAG/zdoAEAfy+gI
AegFAYnYg/hNdFuE9nQNsOiqiwcp+IPoAqvr4j3/FXUG6OUAquvXPVgYdQXoJgDrGj0C2nUGV+gb
AOsF6CgA68Ow6apYKfiD6AKrifgp8CaJRP7rrOg4ALiFwKu4D4Srq1fonP9ewz2N/HUV6JoA6BkA
ieu4iQRQuIs26IAAWKvD6AcAieu4iQbrc4nd6HkA6HYA6DgAHg4fvq8Bra052HQGhcB19h/DrVCw
UKroWQDoGwC4WZGrW4D/wHUMuDnIq7i4AKu4AA+ridirH8M9jfx1COgzALiLBOucg/j4dQXorf/r
JIP49nUI6BwAuI0G6wyE0nQFsLiq6wa4iwarAduJ2KvrA+gAAOhLADwgfvkx2zHJPDkPnsI8IH4S
weEIiMFr2wqD6DABw+gqAOvqicg9Ly90Dj0qL3QSPSkoD5TGidjD6BAAPAp1+eu86Ln/g/jDdfjr
slIx9osEMQQ8O3QUuAACMdLNFIDkgHX0PDt1BIkEMcBaw/v/A8H9/yvB+v/34fb/I8FMAAvBLgAz
wYQA0+CaANP4jwCUwHf/lcAMAJzADgCfwIUAnsCZAJ3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAVao=
Supported language
A reasonably large subset is supported: international variables, features, if statements, whereas statements, a number of operators, pointer dereference, inline machine-code, feedback, and so on.
All of those options make it fairly succesful.
For instance, the next program animates a shifting sine-wave:
int y;
int x;
int x_0;
void sin_positive_approx()
{
y = ( x_0 * ( 157 - x_0 ) ) >> 7;
}
void sin()
{
x_0 = x;
whereas( x_0 > 314 ){
x_0 = x_0 - 314;
}
if( x_0 <= 157 ){
sin_positive_approx();
}
if( x_0 > 157 ){
x_0 = x_0 - 157;
sin_positive_approx();
y = 0 - y;
}
y = 100 + y;
}
int offset;
int x_end;
void draw_sine_wave()
{
x = offset;
x_end = x + 314;
whereas( x <= x_end ){
sin();
pixel_x = x - offset;
pixel_y = y;
vga_set_pixel();
x = x + 1;
}
}
int v_1;
int v_2;
void delay()
{
v_1 = 0;
whereas( v_1 < 50 ){
v_2 = 0;
whereas( v_2 < 10000 ){
v_2 = v_2 + 1;
}
v_1 = v_1 + 1;
}
}
void important()
{
vga_init();
offset = 0;
whereas( 1 ){
vga_clear();
draw_sine_wave();
delay();
offset = offset + 1;
if( offset >= 314 ){ // mod the worth to keep away from 2^16 integer overflow
offset = offset - 314;
}
}
}
Screenshot
After I began eager about SectorC, I had simply completed Deobfuscating OTCC with numerous its concepts freshly loaded into my head.
I additionally simply had some wholesome doses of justine.lol and Tom7 to encourage the absurdity of all of it.
Did I feel I’d succeed? I suspected NO. Match a complete C compiler in 510 bytes of instruction reminiscence? Good luck (sarcasm).
Tokenizing
The primary drawback got here rapidly. In C, the tokenizer/lexer alone appears bigger than one 512 byte sector! We have to eat an arbitrary stream of bytes and produce “tokens”.
For instance:
int important()
{
if( a < 5 ){
func();
}
}
Could be consumed and transformed into:
'int' TOKEN_KEYWORD_INT
'important' TOKEN_IDENTIFIER
'(' TOKEN_LPAREN
')' TOKEN_RPAREN
'{' TOKEN_LBRACE
'if' TOKEN_KEYWORD_IF
'(' TOKEN_LPAREN
'a' TOKEN_IDENTIFIER
'<' TOKEN_OPERATOR
'5' TOKEN_NUMBER
')' TOKEN_RPAREN
'{' TOKEN_LBRACE
'func' TOKEN_IDENTIFIER
'(' TOKEN_LPAREN
')' TOKEN_RPAREN
';' TOKEN_SEMI
'}' TOKEN_RBRACE
'}' TOKEN_RBRACE
We have to particularly acknowledge key phrases
, identifiers
, operators
, and numbers
. After which we have to convert numbers from string to integer with one thing like atoi()
:
int atoi(const char *s)
{
int n = 0;
whereas (1) {
char c = *s++;
if (!c) break;
n = 10 * n + (c - '0');
}
return n;
}
I wrote a reasonably straight-forward and minimalist lexer and it took >150 strains of C code. A crude estimate of the identical code in x86-16 would require 300-450 bytes minimal (e.g. a easy add ax,bx
instruction encodes as 2 bytes). And this doesn’t embody any image desk, recursive-descent parser, code-generator, branch-patching, and so on.
No Probability.
So, naturally … I continued. At all times decide the losers. The lolz are extra enjoyable that manner.
Large Perception #1
Large Perception #1 got here whereas eager about different languages comparable to Forth. The tokenizer in Forth is almost trivial. Each token is solely space-delimited. Each token is simply referred to as a WORD
and nothing is particular (slight lie). Hmm, how a few C that does that? So dreamed up a C that’s technically nonetheless a C, might be turing-complete, and will certainly make each code maintainer terrified. ????
I’ll name it the Barely C Programming Language:
int achieved , a , b , c , p , cond ;
int(important)(){whereas(!achieved){
a = b - c ;
*(int*) p = b - c ;
a = *(int*) p ;
if(cond) a = b - 45 ;
}}
Right here we’ve got spacing strategically positioned to create “mega-tokens”
For instance: int(important)(){whereas(!achieved){
is one such “mega-token”.
In a way, we even have a language extra like:
VAR_BEGIN achieved AND a AND b AND c AND p AND cond END
MAIN_BEGIN
a = b - c END
DEREF p = b - c END
a = DEREF p END
COND a = b - 45 END
MAIN_END
However, a standard C compiler may even acknowledge it as C!
Even after utilizing space-delimiters, we nonetheless have numerous tokens and want to search out extra methods to reduce the tokenizer. What is crucial? Properly it’s fairly exhausting to keep away from the atoi()
if we need to even have integer literals. What else do we’d like? How about nothing.
Large Perception #2
Large Perception #2 is that atoi()
behaves as a (unhealthy) hash perform on odd textual content. It consumes characters and updates a 16-bit integer. Hashes are maybe the holy-grail of computer-science.
With a great hash, we are able to simply side-step all of the exhausting issues by buying and selling them for a fair more durable drawback (hash collisions), after which we simply ignore that more durable drawback. Good. (sticks fingers in ears) ????
So we’ve got this:
Token Kind | Which means of atoi() |
---|---|
Integer Literal | uint16 quantity |
Key phrase | token ”enum” worth |
Identifier | hash worth right into a 64K array |
Implementing Barely C
The primary implementation of Barely C slot in 468 bytes. It was a easy recursive-descent parser over the atoi
tokens. There was no image desk of any sort. Variables merely entry a 64K phase utilizing the hash worth. Codegen is emitted considerably much like OTCC, utilizing ax
because the consequence register and shuffling values to the stack after which to cx
for binary operators.
Minimizing with Byte-Threaded Code
In an try and steal each good concept Forth ever had, I then dreamed up what I’ll name “byte-threaded-code”. Since a sector is 512 bytes, if we merely align handle on a 2-byte boundary, we are able to do addressing with a single byte! We will have a sequence of “devices” and do forth-style threading:
bits 16
cpu 386
jmp 0x07c0:entry
entry:
push cs
pop ds
lea si,operations
subsequent:
xor ax,ax
lodsb
add ax,ax
push subsequent
jmp ax
putch:
mov ah,0x01
mov al,bl
mov dx,0
int 0x14
ret
align 2
cling:
jmp cling
align 2
print_F:
mov bx,'F'
jmp putch
align 2
print_G:
mov bx,'G'
jmp putch
operations:
db 17 ; print_F
db 20 ; print_G
db 17 ; print_F
db 17 ; print_F
db 20 ; print_G
db 17 ; print_F
db 17 ; print_F
db 17 ; print_F
db 20 ; print_G
db 16 ; cling
Annoyingly, nasm
received’t allow you to do one thing like db print_F/2
so I needed to write a customized little assembler to do it.
Alas, this concept didn’t work out. In 512 bytes, the overhead of this Forth-style computation mannequin doesn’t pay for itself. There are numerous little overheads: 2 byte alignment, further ret
directions, calling different “threads”, the subsequent
perform, and so on. The byte-threaded model of Barely C ended up on the identical dimension because the straight-forward model
Nevertheless, the concept is enjoyable and I made a decision to doc it anyhow within the occasion that another person finds utility.
Minimizing the Straight-Ahead model
As an alternative, I returned to the straight-forward model and minimized it as a lot as doable. From 468 bytes ⇒ 303 bytes (165 bytes saving): 510 – 303 ⇒ 207 spare bytes to make use of for brand spanking new options!
Some tips:
- Reorganize code to permit “fall-through” as a substitute of
jmp
orname
- Use tail-calls by way of
jmp
wherever doable - Carry out call-fusion (e.g.
name tok_next2
as a substitute ofname tok_next; name tok_next
) - Make the most of
stosw
andlodsw
extensively - Eradicate machine code tables for cheaper inline
stosw
variations - Choose
cmp ax,imm
overcmp bx,imm
- Hold leap offsets inside 8-bits to encode extra effectively
Look Ma, A Actual C!
Because it seems, quite a bit might be achieved in 200 bytes if you have already got a tokenizer, parser, and code-generator in 300 bytes. With these 200 bytes, Barely C grew to become a correct C:
- Arbitrarily nested
if
assertion block with an arbitrary expression situation - Arbitrarily nested
whereas
assertion block with an arbitrary expression situation - Numerous operators:
+
,-
,*
,&
,|
,^
,<<
,>>
,==
,!=
,<
,>
,<=
,>=
- Grouping expressions:
( expression )
- Operate definitions and recursive perform calls (utilizing
func()
as a hash worth into a logo desk at phase 0x3000) - A particular
asm
assertion for inline machine-code - Single-line
//
feedback - Multi-line
/*
feedback - A trick to do “space-injection” earlier than semicolons to make code look extra regular
The largest enabler right here is the binary_oper_tbl
which permits for a really low cost manner so as to add a number of operations. Every operator is solely a <16-bit token-value> <16-bit-machine-code>
pair, costing simply 4 bytes.The above 14 operators price simply 56 bytes plus a little bit overhead to scan the desk.
Grammar
This is the complete grammer specification:
program = (var_decl | func_decl)+
var_decl = "int" identifier ";"
func_decl = "void" func_name "{" assertion* "}"
func_name = <identifier that ends in "()" with no area>
assertion = "if(" expr "){" assertion* "}"
| "whereas(" expr "){" assertion* "}"
| "asm" integer ";"
| func_name ";"
| assign_expr ";"
assign_expr = deref? identifier "=" expr
deref = "*(int*)"
expr = unary (op unary)?
unary = deref identifier
| "&" identifier
| "(" expr ")"
| indentifier
| integer
op = "+" | "-" | "&" | "|" | "^" | "<<" | ">>"
| "==" | "!=" | "<" | ">" | "<=" | ">="
As well as, each // remark
and /* multi-line remark */
types are supported.
(NOTE: This grammar is 704 bytes in ascii, 38% bigger than it is implementation!)
Inline Machine-Code
A programming language with out I/O is ineffective. And, because the C language is outlined in an I/O agnostic manner, we’d like a way out. Thus, an asm
extension is supported. This enables packages to generate uncooked x86-16 machine code literals inline. Utilizing asm
, packages cans entry any low-level element of the machine. That is used extensively within the instance code.
Error-Dealing with
What’s “error-handling”? ????
In conventional C model, we belief the programmer to jot down appropriate and well-formed packages. We’re sure they’re all minor gods and goddesses with the flexibility of perfection. Clearly, spending bytes on error-checking could be silly. Absolutely all will agree that this can be a very affordable commonplace.
For the much less divine amongst us, a lint
was additionally written (that doesn’t slot in a sector) to detect errors. The creator actually didn’t require this software for improvement.
Runtime
If C compiler writers had been a secret shadow group just like the Free Masons, Illuminati, Lizard Peoples, or Pizzagaters our inner-secret could be “C truly has a runtime”.
SectorC has a runtime beneath rt/
consisting of two recordsdata applied in C itself:
rt/lib.c:
A set of library routines, typically coded in inlineasm
rt/_start.c:
The precise entry-point_start()
The runtime code is concatenated with program supply to assemble the complete supply to compile and run.
Examples
A couple of examples are offered that leverage the distinctive {hardware} features of the x86-16 IBM PC:
examples/howdy.c:
Print a textual content greeting on the display writing to reminiscence at0xB8000
examples/sinwave.c:
Draw a shifting sine wave animation with VGA Mode 0x13 utilizing an appropriately unhealthy approximation ofsin(x)
examples/twinkle.c:
Play “Twinkle Twinkle Little Star” by the PC Speaker (Warning: LOUD)
Conclusion
It appears becoming to finish an article with “takeaways” or “what did we study”. So.. umm.. what did we study? Truthfully, I’m undecided. However within the curiosity of enjoyable, right here’s a Selection Your Personal Journey model of What Did We Study:
What Did We Study | Your Chosen Journey |
---|---|
Issues that appear inconceivable typically aren’t and we should always Simply Do It anyway | Transfer to the South Pole with completely no gear on a Homesteading Mission |
Software program is just too bloated nowadays, we solely want a number of KBs | Go examine your self into the know-how hippie-commune of suckless |
Error checking is overrated | Take Elon up on his pitch to be a mars astronaut as a result of Earth actually doesn’t want extra software program that ignores errors. |
Something X can do, C can do higher | One thing like this? link. Monzy, we’d like a brand new rap! (name me) |
That was all gibberish nonsense and thanks for losing my life (passive-aggression) | Really feel remorse that you simply wasted the time as a result of there’s a lot higher content material on this planet you’d slightly eat and determine to get a therapist to work by your points with studying nonsense web gibberish |
This xorvoid individual/robotic/AI is ridiculous/absurd/dumb and does arguably pointless issues for enjoyable | Comply with, like, subscribe, ring the bell.. ???? |