Sensible parsing with Flex and Bison
Though parsing is usually described from the attitude of writing a compiler,
there are various frequent smaller duties the place it’s helpful. Studying file codecs,
speaking over the community, creating shells, and analyzing supply code are all
simpler utilizing a sturdy parser.
By taking time to be taught general-purpose parsing instruments, you possibly can transcend
fragile do-it-yourself options, and rigid third-party libraries. We’ll cowl
Lex and
Yacc in
this information as a result of they’re mature and transportable. We’ll additionally cowl their later
incarnations as Flex and Bison.
Above all, this information is sensible. We’ll see find out how to correctly combine parser
turbines into your construct system, find out how to create thread-safe parsing modules,
and find out how to parse actual information codecs. I’ll inspire every characteristic of the parser
generator with a concrete drawback it may possibly clear up. And, I promise, not one of the
typical calculator examples.
Desk of contents
Lexical scanning
Individuals often use two levels to course of structured textual content. The primary stage,
lexing (aka scanning), breaks the enter into significant chunks of characters.
The second, parsing, teams the scanned chunks following doubtlessly recursive
guidelines. Nonetheless, a pleasant lexing instrument like Lex may be helpful by itself, even
when not paired with a parser.
The best solution to describe Lex is that it runs user-supplied C code blocks
for normal expression matches. It reads an inventory of regexes and constructs a
big state machine which makes an attempt to match all of them “concurrently.”
A lex enter file consists of three attainable sections: definitions, guidelines,
and helper capabilities. The sections are delimited by %%
. Lex transforms its
enter file right into a plain C file that may be constructed utilizing an odd C compiler.
Right here’s an instance. We’ll match the strings cot
, cat
, and cats
. Our
actions will print a substitute for every.
/* catcot.l */
%{
#include <stdio.h>
%}
%%
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }
To build it:
# turn the input into an intermediate C file
lex -t catcot.l > catcot.c
# compile it
cc -o catcot catcot.c -ll
(Alternately, build it in one step with make catcot
. Even in the absence of a
Makefile, POSIX make has suffix
rules
that deal with .l
recordsdata.)
This system outputs easy substitutions:
echo "the cat on the cot joined the cats" | ./catcot
the thankless pet on the portable bed joined the anti-herd
The reason it prints non-matching words (such as “the”) is that there’s an
implicit rule matching any character (.
) and echoing it. In most real parsers
we’ll want to override that.
Here’s what’s happening inside the scanner. Lex reads the regexes and generates
a state machine to consume input. Below is a visualization of the states, with
transitions labeled by input character. The circles with a double outline
indicate states that trigger actions.
Note there’s no notion of word boundaries in our lexer, it’s operating on
characters alone. For instance:
That sounds rather like an insult.
An important subtlety is how Lex handles multiple eligible matches. It picks
the longest possible match available, and in the case of a tie, picks the
matching pattern defined earliest.
To illustrate, suppose we add a looser regex, c.t
, first.
%%
c.t { printf("mumble mumble"); }
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }
Lex detects that the rule masks cat and cot, and outputs a warning:
catcot.l:10: warning, rule cannot be matched
catcot.l:11: warning, rule cannot be matched
It still compiles though, and behaves like this:
echo "the cat on the cot joined the cats" | ./catcot
the mumble mumble on the mumble mumble joined the anti-herd
Notice that it still matched cats
, because cats
is longer than c.t
.
Compare what happens if we move the loose regex to the end of our rules. It can
then pick up whatever strings get past the others.
%%
cot { printf("portable bed"); }
cat { printf("thankless pet"); }
cats { printf("anti-herd"); }
c.t { printf("mumble mumble"); }
It acts like this:
echo "cut the cot" | ./catcot
mumble mumble the portable bed
Now’s a good time to take a detour and observe how our user-defined code acts
in the generated C file. Lex creates a function called yylex()
, and inserts
the code blocks verbatim into a switch statement. When using lex with a parser,
the parser will call yylex()
to retrieve tokens, named by integers. For now,
our user-defined code isn’t returning tokens to a parser, but doing simple
print statements.
/* catcot.c (generated by lex) */
int yylex (void)
{
/* ... */
switch ( yy_act )
{
/* ... */
case 1:
YY_RULE_SETUP
#line 9 "catcot.l"
{ printf("portable bed"); }
YY_BREAK
case 2:
YY_RULE_SETUP
#line 10 "catcot.l"
{ printf("thankless pet"); }
YY_BREAK
case 3:
YY_RULE_SETUP
#line 11 "catcot.l"
{ printf("anti-herd"); }
YY_BREAK
/* ... */
}
/* ... */
}
As mentioned, a lex file is comprised of three sections:
DEFINITIONS
%%
RULES
%%
HELPER FUNCTIONS
The definitions section is where you can embed C code to include headers and
declare functions used in rules. The definitions section can also define
friendly names for regexes that can be reused in the rules.
The rules section, as we saw, contains a list of regexes and associated user
code.
The final section is where to put the full definitions of helper functions.
This is also where you’d put the main()
function. If you omit main()
, the
Lex library provides one that simply calls yylex()
. This default main()
implementation (and implementations for a few other functions) is available by
linking your lex-generated C code with -ll
compiler flag.
Let’s see a short, fun example: converting Roman numerals to decimal. Thanks to
lex’s behavior of matching longer strings first, it can read the single-letter
numerals, but look ahead for longer subtractive forms like “IV” or “XC.”
/* roman-lex.l */
/* the %{ ... %} enclose C blocks that are copied
into the generated code */
%{
#include <stdio.h>
#include <stdlib.h>
/* globals are visible to user actions amd main() */
int total;
%}
%%
/*<- notice the whitespace before this comment, which
is necessary for comments in the rules section */
/* the basics */
I { total += 1; }
V { total += 5; }
X { total += 10; }
L { total += 50; }
C { total += 100; }
D { total += 500; }
M { total += 1000; }
/* special cases match with preference
because they are longer strings */
IV { total += 4; }
IX { total += 9; }
XL { total += 40; }
XC { total += 90; }
CD { total += 400; }
CM { total += 900; }
/* ignore final newline */
n ;
/* but die on anything else */
. {
fprintf(stderr, "unexpected: %sn", yytext);
exit(EXIT_FAILURE);
}
%%
/* provide our own main() rather than the implementation
from lex's library linked with -ll */
int main(void)
{
/* only have to call yylex() once, since our
actions don't return */
yylex();
fprintf(yyout, "%dn", total);
return EXIT_SUCCESS;
}
More realistic scanner
Now that we’ve seen Lex’s basic operation in the previous section, let’s
consider a useful example: syntax highlighting. Detecting keywords in syntax is
a problem that lex can handle by itself, without help from yacc.
Because lex and yacc are so old (predating
C),
and utilized in so many tasks, you will discover grammars already written for many
languages. For example, we’ll take quut’s C
specification for lex, and modify
it to do syntax highlighting.
This comparatively brief program precisely handles the total complexity of the
language. It’s best to know by studying in full. See the inline
feedback for brand new and delicate particulars.
/* c.l syntax highlighter */
%{
/* POSIX for isatty, fileno */
#define _POSIX_C_SOURCE 200112L
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
/* declarations are visible to user actions */
enum FG
{
fgRED = 31, fgGREEN = 32,
fgORANGE = 33, fgCYAN = 36,
fgDARKGREY = 90, fgYELLOW = 93
};
void set_color(enum FG);
void reset_color(void);
void color_print(enum FG, const char *);
void consume_comment(void);
%}
/* named regexes we can use in rules */
O [0-7]
D [0-9]
NZ [1-9]
L [a-zA-Z_]
A [a-zA-Z_0-9]
H [a-fA-F0-9]
HP (0[xX])
E ([Ee][+-]?{D}+)
P ([Pp][+-]?{D}+)
FS (f|F|l|L)
IS (((u|U)(l|L|ll|LL)?)|((l|L|ll|LL)(u|U)?))
CP (u|U|L)
SP (u8|u|U|L)
ES ((['"?abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))
WS [ tvnf]
%%
/* attempting to match and capture an entire multi-line
comment could strain lex's buffers, so we match the
beginning, and call consume_comment() to deal with
the ensuing characters, in our own less resource-
intensive way */
"/*" {
set_color(fgDARKGREY);
/* For greater flexibility, we'll output to lex's stream, yyout.
It defaults to stdout. */
fputs(yytext, yyout);
consume_comment();
reset_color();
}
/* single-line comments can be handled the default way.
The yytext variable is provided by lex and points
to the characters that match the regex */
"//".* {
color_print(fgDARKGREY, yytext);
}
^[ t]*#.* {
color_print(fgRED, yytext);
}
/* you can use the same code block for multiple regexes */
auto |
bool |
char |
const |
double |
enum |
extern |
float |
inline |
int |
long |
register |
restrict |
short |
size_t |
signed |
static |
struct |
typedef |
union |
unsigned |
void |
volatile |
_Bool |
_Complex {
color_print(fgGREEN, yytext);
}
break |
case |
continue |
default |
do |
else |
for |
goto |
if |
return |
sizeof |
switch |
while {
color_print(fgYELLOW, yytext);
}
/* we use the named regexes heavily below; putting
them in curly brackets expands them */
{L}{A}* {
/* without this rule, keywords within larger words
would be highlighted, like the "if" in "life" --
this rule prevents that because it's a longer match */
fputs(yytext, yyout);
}
{HP}{H}+{IS}? |
{NZ}{D}*{IS}? |
"0"{O}*{IS}? |
{CP}?"'"([^'n]|{ES})+"'" |
{D}+{E}{FS}? |
{D}*"."{D}+{E}?{FS}? |
{D}+"."{E}?{FS}? |
{HP}{H}+{P}{FS}? |
{HP}{H}*"."{H}+{P}{FS}? |
{HP}{H}+"."{P}{FS}? {
color_print(fgCYAN, yytext);
}
({SP}?"([^"n]|{ES})*"{WS}*)+ {
color_print(fgORANGE, yytext);
}
/* explicitly mention the default rule */
. ECHO;
%%
/* definitions of the functions we declared earlier */
/* the color functions use ANSI escape codes, and may
not be portable across all terminal emulators. */
void set_color(enum FG c)
{
fprintf(yyout, "