Now Reading
xorvoid

xorvoid

2023-05-24 18:02:56

SectorC (github) is a C compiler written in x86-16 meeting that matches throughout the 512 byte boot sector of an x86 machine. It helps a
subset of C that’s massive sufficient to jot down actual and fascinating packages. It’s fairly doubtless the smallest C compiler ever written.

In a base64 encoding, it seems to be like this:

6gUAwAdoADAfaAAgBzH/6DABPfQYdQXoJQHr8+gjAVOJP+gSALDDqluB+9lQdeAG/zdoAEAfy+gI
AegFAYnYg/hNdFuE9nQNsOiqiwcp+IPoAqvr4j3/FXUG6OUAquvXPVgYdQXoJgDrGj0C2nUGV+gb
AOsF6CgA68Ow6apYKfiD6AKrifgp8CaJRP7rrOg4ALiFwKu4D4Srq1fonP9ewz2N/HUV6JoA6BkA
ieu4iQRQuIs26IAAWKvD6AcAieu4iQbrc4nd6HkA6HYA6DgAHg4fvq8Bra052HQGhcB19h/DrVCw
UKroWQDoGwC4WZGrW4D/wHUMuDnIq7i4AKu4AA+ridirH8M9jfx1COgzALiLBOucg/j4dQXorf/r
JIP49nUI6BwAuI0G6wyE0nQFsLiq6wa4iwarAduJ2KvrA+gAAOhLADwgfvkx2zHJPDkPnsI8IH4S
weEIiMFr2wqD6DABw+gqAOvqicg9Ly90Dj0qL3QSPSkoD5TGidjD6BAAPAp1+eu86Ln/g/jDdfjr
slIx9osEMQQ8O3QUuAACMdLNFIDkgHX0PDt1BIkEMcBaw/v/A8H9/yvB+v/34fb/I8FMAAvBLgAz
wYQA0+CaANP4jwCUwHf/lcAMAJzADgCfwIUAnsCZAJ3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAVao=

Supported language

A reasonably large subset is supported: international variables, features, if statements, whereas statements, a number of operators, pointer dereference, inline machine-code, feedback, and so on.
All of those options make it fairly succesful.

For instance, the next program animates a shifting sine-wave:

int y;
int x;
int x_0;
void sin_positive_approx()
{
  y = ( x_0 * ( 157 - x_0 ) ) >> 7;
}
void sin()
{
  x_0 = x;
  whereas( x_0 > 314 ){
    x_0 = x_0 - 314;
  }
  if( x_0 <= 157 ){
    sin_positive_approx();
  }
  if( x_0 > 157 ){
    x_0 = x_0 - 157;
    sin_positive_approx();
    y = 0 - y;
  }
  y = 100 + y;
}


int offset;
int x_end;
void draw_sine_wave()
{
  x = offset;
  x_end = x + 314;
  whereas( x <= x_end ){
    sin();
    pixel_x = x - offset;
    pixel_y = y;
    vga_set_pixel();
    x = x + 1;
  }
}

int v_1;
int v_2;
void delay()
{
  v_1 = 0;
  whereas( v_1 < 50 ){
    v_2 = 0;
    whereas( v_2 < 10000 ){
      v_2 = v_2 + 1;
    }
    v_1 = v_1 + 1;
  }
}

void important()
{
  vga_init();

  offset = 0;
  whereas( 1 ){
    vga_clear();
    draw_sine_wave();

    delay();
    offset = offset + 1;
    if( offset >= 314 ){ // mod the worth to keep away from 2^16 integer overflow
      offset = offset - 314;
    }
  }
}

Screenshot

Moving Sinwave

After I began eager about SectorC, I had simply completed Deobfuscating OTCC with numerous its concepts freshly loaded into my head.
I additionally simply had some wholesome doses of justine.lol and Tom7 to encourage the absurdity of all of it.

Did I feel I’d succeed? I suspected NO. Match a complete C compiler in 510 bytes of instruction reminiscence? Good luck (sarcasm).

Tokenizing

The primary drawback got here rapidly. In C, the tokenizer/lexer alone appears bigger than one 512 byte sector! We have to eat an arbitrary stream of bytes and produce “tokens”.

For instance:

int important()
{
  if( a < 5 ){
    func();
  }
}

Could be consumed and transformed into:

'int'  TOKEN_KEYWORD_INT
'important' TOKEN_IDENTIFIER
'('    TOKEN_LPAREN
')'    TOKEN_RPAREN
'{'    TOKEN_LBRACE
'if'   TOKEN_KEYWORD_IF
'('    TOKEN_LPAREN
'a'    TOKEN_IDENTIFIER
'<'    TOKEN_OPERATOR
'5'    TOKEN_NUMBER
')'    TOKEN_RPAREN
'{'    TOKEN_LBRACE
'func' TOKEN_IDENTIFIER
'('    TOKEN_LPAREN
')'    TOKEN_RPAREN
';'    TOKEN_SEMI
'}'    TOKEN_RBRACE
'}'    TOKEN_RBRACE

We have to particularly acknowledge key phrases, identifiers, operators, and numbers. After which we have to convert numbers from string to integer with one thing like atoi():

int atoi(const char *s)
{
  int n = 0;
  whereas (1) {
    char c = *s++;
    if (!c) break;
    n = 10 * n + (c - '0');
  }
  return n;
}

I wrote a reasonably straight-forward and minimalist lexer and it took >150 strains of C code. A crude estimate of the identical code in x86-16 would require 300-450 bytes minimal (e.g. a easy add ax,bx instruction encodes as 2 bytes). And this doesn’t embody any image desk, recursive-descent parser, code-generator, branch-patching, and so on.

No Probability.

So, naturally … I continued. At all times decide the losers. The lolz are extra enjoyable that manner.

Large Perception #1

Large Perception #1 got here whereas eager about different languages comparable to Forth. The tokenizer in Forth is almost trivial. Each token is solely space-delimited. Each token is simply referred to as a WORD and nothing is particular (slight lie). Hmm, how a few C that does that? So dreamed up a C that’s technically nonetheless a C, might be turing-complete, and will certainly make each code maintainer terrified. ????

I’ll name it the Barely C Programming Language:

int achieved , a , b , c , p , cond ;
int(important)(){whereas(!achieved){
     a = b - c ;
     *(int*) p = b - c ;
     a = *(int*) p ;
     if(cond) a = b - 45 ;
}}

Right here we’ve got spacing strategically positioned to create “mega-tokens”

For instance: int(important)(){whereas(!achieved){ is one such “mega-token”.

In a way, we even have a language extra like:

VAR_BEGIN achieved AND a AND b AND c AND p AND cond END
MAIN_BEGIN
  a = b - c END
  DEREF p = b - c END
  a = DEREF p END
  COND a = b - 45 END
MAIN_END

However, a standard C compiler may even acknowledge it as C!

Even after utilizing space-delimiters, we nonetheless have numerous tokens and want to search out extra methods to reduce the tokenizer. What is crucial? Properly it’s fairly exhausting to keep away from the atoi() if we need to even have integer literals. What else do we’d like? How about nothing.

Large Perception #2

Large Perception #2 is that atoi() behaves as a (unhealthy) hash perform on odd textual content. It consumes characters and updates a 16-bit integer. Hashes are maybe the holy-grail of computer-science.
With a great hash, we are able to simply side-step all of the exhausting issues by buying and selling them for a fair more durable drawback (hash collisions), after which we simply ignore that more durable drawback. Good. (sticks fingers in ears) ????

So we’ve got this:

Token Kind Which means of atoi()
Integer Literal uint16 quantity
Key phrase token ”enum” worth
Identifier hash worth right into a 64K array

Implementing Barely C

The primary implementation of Barely C slot in 468 bytes. It was a easy recursive-descent parser over the atoi tokens. There was no image desk of any sort. Variables merely entry a 64K phase utilizing the hash worth. Codegen is emitted considerably much like OTCC, utilizing ax because the consequence register and shuffling values to the stack after which to cx for binary operators.

Minimizing with Byte-Threaded Code

In an try and steal each good concept Forth ever had, I then dreamed up what I’ll name “byte-threaded-code”. Since a sector is 512 bytes, if we merely align handle on a 2-byte boundary, we are able to do addressing with a single byte! We will have a sequence of “devices” and do forth-style threading:

bits 16
  cpu 386

  jmp 0x07c0:entry

entry:
  push cs
  pop  ds

  lea si,operations
subsequent:
  xor ax,ax
  lodsb
  add ax,ax
  push subsequent
  jmp ax

putch:
  mov ah,0x01
  mov al,bl
  mov dx,0
  int 0x14
  ret

  align 2
cling:
  jmp cling

  align 2
print_F:
  mov bx,'F'
  jmp putch

  align 2
print_G:
  mov bx,'G'
  jmp putch

operations:
  db 17  ; print_F
  db 20  ; print_G
  db 17  ; print_F
  db 17  ; print_F
  db 20  ; print_G
  db 17  ; print_F
  db 17  ; print_F
  db 17  ; print_F
  db 20  ; print_G
  db 16  ; cling

Annoyingly, nasm received’t allow you to do one thing like db print_F/2 so I needed to write a customized little assembler to do it.

Alas, this concept didn’t work out. In 512 bytes, the overhead of this Forth-style computation mannequin doesn’t pay for itself. There are numerous little overheads: 2 byte alignment, further ret directions, calling different “threads”, the subsequent perform, and so on. The byte-threaded model of Barely C ended up on the identical dimension because the straight-forward model

Nevertheless, the concept is enjoyable and I made a decision to doc it anyhow within the occasion that another person finds utility.

Minimizing the Straight-Ahead model

As an alternative, I returned to the straight-forward model and minimized it as a lot as doable. From 468 bytes ⇒ 303 bytes (165 bytes saving): 510 – 303 ⇒ 207 spare bytes to make use of for brand spanking new options!

Some tips:

  • Reorganize code to permit “fall-through” as a substitute of jmp or name
  • Use tail-calls by way of jmp wherever doable
  • Carry out call-fusion (e.g. name tok_next2 as a substitute of name tok_next; name tok_next)
  • Make the most of stosw and lodsw extensively
  • Eradicate machine code tables for cheaper inline stosw variations
  • Choose cmp ax,imm over cmp bx,imm
  • Hold leap offsets inside 8-bits to encode extra effectively

Look Ma, A Actual C!

Because it seems, quite a bit might be achieved in 200 bytes if you have already got a tokenizer, parser, and code-generator in 300 bytes. With these 200 bytes, Barely C grew to become a correct C:

  • Arbitrarily nested if assertion block with an arbitrary expression situation
  • Arbitrarily nested whereas assertion block with an arbitrary expression situation
  • Numerous operators: +, -, *, &, |, ^, <<, >>, ==, !=, <, >, <=, >=
  • Grouping expressions: ( expression )
  • Operate definitions and recursive perform calls (utilizing func() as a hash worth into a logo desk at phase 0x3000)
  • A particular asm assertion for inline machine-code
  • Single-line // feedback
  • Multi-line /* feedback
  • A trick to do “space-injection” earlier than semicolons to make code look extra regular

The largest enabler right here is the binary_oper_tbl which permits for a really low cost manner so as to add a number of operations. Every operator is solely a <16-bit token-value> <16-bit-machine-code> pair, costing simply 4 bytes.The above 14 operators price simply 56 bytes plus a little bit overhead to scan the desk.

Grammar

This is the complete grammer specification:

program     = (var_decl | func_decl)+
var_decl    = "int" identifier ";"
func_decl   = "void" func_name "{" assertion* "}"
func_name   = <identifier that ends in "()" with no area>
assertion   = "if(" expr "){" assertion* "}"
            | "whereas(" expr "){" assertion* "}"
            | "asm" integer ";"
            | func_name ";"
            | assign_expr ";"
assign_expr = deref? identifier "=" expr
deref       = "*(int*)"
expr        = unary (op unary)?
unary       = deref identifier
            | "&" identifier
            | "(" expr ")"
            | indentifier
            | integer
op          = "+" | "-" | "&" | "|" | "^" | "<<" | ">>"
            | "==" | "!=" | "<" | ">" | "<=" | ">="

As well as, each // remark and /* multi-line remark */ types are supported.

(NOTE: This grammar is 704 bytes in ascii, 38% bigger than it is implementation!)

Inline Machine-Code

A programming language with out I/O is ineffective. And, because the C language is outlined in an I/O agnostic manner, we’d like a way out. Thus, an asm extension is supported. This enables packages to generate uncooked x86-16 machine code literals inline. Utilizing asm, packages cans entry any low-level element of the machine. That is used extensively within the instance code.

Error-Dealing with

What’s “error-handling”? ????

In conventional C model, we belief the programmer to jot down appropriate and well-formed packages. We’re sure they’re all minor gods and goddesses with the flexibility of perfection. Clearly, spending bytes on error-checking could be silly. Absolutely all will agree that this can be a very affordable commonplace.

For the much less divine amongst us, a lint was additionally written (that doesn’t slot in a sector) to detect errors. The creator actually didn’t require this software for improvement.

Runtime

If C compiler writers had been a secret shadow group just like the Free Masons, Illuminati, Lizard Peoples, or Pizzagaters our inner-secret could be “C truly has a runtime”.

SectorC has a runtime beneath rt/ consisting of two recordsdata applied in C itself:

  • rt/lib.c: A set of library routines, typically coded in inline asm
  • rt/_start.c: The precise entry-point _start()

The runtime code is concatenated with program supply to assemble the complete supply to compile and run.

Examples

A couple of examples are offered that leverage the distinctive {hardware} features of the x86-16 IBM PC:

  • examples/howdy.c: Print a textual content greeting on the display writing to reminiscence at 0xB8000
  • examples/sinwave.c: Draw a shifting sine wave animation with VGA Mode 0x13 utilizing an appropriately unhealthy approximation of sin(x)
  • examples/twinkle.c: Play “Twinkle Twinkle Little Star” by the PC Speaker (Warning: LOUD)

Conclusion

It appears becoming to finish an article with “takeaways” or “what did we study”. So.. umm.. what did we study? Truthfully, I’m undecided. However within the curiosity of enjoyable, right here’s a Selection Your Personal Journey model of What Did We Study:

What Did We Study Your Chosen Journey
Issues that appear inconceivable typically aren’t and we should always Simply Do It anyway Transfer to the South Pole with completely no gear on a Homesteading Mission
Software program is just too bloated nowadays, we solely want a number of KBs Go examine your self into the know-how hippie-commune of suckless
Error checking is overrated Take Elon up on his pitch to be a mars astronaut as a result of Earth actually doesn’t want extra software program that ignores errors.
Something X can do, C can do higher One thing like this? link. Monzy, we’d like a brand new rap! (name me)
That was all gibberish nonsense and thanks for losing my life (passive-aggression) Really feel remorse that you simply wasted the time as a result of there’s a lot higher content material on this planet you’d slightly eat and determine to get a therapist to work by your points with studying nonsense web gibberish
This xorvoid individual/robotic/AI is ridiculous/absurd/dumb and does arguably pointless issues for enjoyable Comply with, like, subscribe, ring the bell.. ????

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top