Now Reading
Sensible parsing with Flex and Bison

Sensible parsing with Flex and Bison

2023-08-27 00:25:34

Though parsing is usually described from the attitude of writing a compiler,
there are various frequent smaller duties the place it’s helpful. Studying file codecs,
speaking over the community, creating shells, and analyzing supply code are all
simpler utilizing a sturdy parser.

By taking time to be taught general-purpose parsing instruments, you possibly can transcend
fragile do-it-yourself options, and rigid third-party libraries. We’ll cowl
Lex and
Yacc in
this information as a result of they’re mature and transportable. We’ll additionally cowl their later
incarnations as Flex and Bison.


Above all, this information is sensible. We’ll see find out how to correctly combine parser
turbines into your construct system, find out how to create thread-safe parsing modules,
and find out how to parse actual information codecs. I’ll inspire every characteristic of the parser
generator with a concrete drawback it may possibly clear up. And, I promise, not one of the
typical calculator examples.

Desk of contents

Lexical scanning

Individuals often use two levels to course of structured textual content. The primary stage,
lexing (aka scanning), breaks the enter into significant chunks of characters.
The second, parsing, teams the scanned chunks following doubtlessly recursive
guidelines. Nonetheless, a pleasant lexing instrument like Lex may be helpful by itself, even
when not paired with a parser.

The best solution to describe Lex is that it runs user-supplied C code blocks
for normal expression matches. It reads an inventory of regexes and constructs a
big state machine which makes an attempt to match all of them “concurrently.”

A lex enter file consists of three attainable sections: definitions, guidelines,
and helper capabilities. The sections are delimited by %%. Lex transforms its
enter file right into a plain C file that may be constructed utilizing an odd C compiler.

Right here’s an instance. We’ll match the strings cot, cat, and cats. Our
actions will print a substitute for every.

suffix
rules

that deal with .l recordsdata.)

This system outputs easy substitutions:

predating
C
),
and utilized in so many tasks, you will discover grammars already written for many
languages. For example, we’ll take quut’s C
specification
for lex, and modify
it to do syntax highlighting.

This comparatively brief program precisely handles the total complexity of the
language. It’s best to know by studying in full. See the inline
feedback for brand new and delicate particulars.

There are some comments in the example below about portability between yacc
variants. The three most prominent variants, in order of increasing features,
are: the POSIX
interface

matching roughly the AT&T yacc functionally,
byacc (Berkeley Yacc), and
GNU Bison.

S-expressions, restricted to string and
integer atoms. There’s extra happening on this one, corresponding to reminiscence administration,
completely different semantic varieties per token, and packaging the lexer and parser collectively
right into a single thread-safe library. This instance requires Bison.

Augmented Backus-Naur
Form
(ABNF)
grammars, which we will convert into sturdy yacc parsers.

Let’s look at RFC4181, which
describes the comma-separated worth (CSV) format. It’s fairly easy, however has
problematic edge instances: commas in quoted values, quoted quotes, uncooked newlines in
quoted values, and blank-as-a-value.

Right here’s the total grammar from the RFC. Discover how alternate options are specified
with “/” moderately than “|”, and the way ABNF has the constructions
*(zero-or-more-things) and [optional-thing]:

file = [header CRLF] file *(CRLF file) [CRLF]

header = title *(COMMA title)

file = subject *(COMMA subject)

title = subject

subject = (escaped / non-escaped)

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

non-escaped = *TEXTDATA

COMMA = %x2C

CR = %x0D

DQUOTE =  %x22

LF = %x0A

CRLF = CR LF

TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

The grammar makes no distinction between lexing and parsing, though the
uppercase identifiers trace at lexer tokens. Whereas it could be tempting to
translate to yacc top-down, beginning on the file degree, I’ve discovered probably the most
productive means is to start out with lexing.

We are able to mix many of the grammar into two lex guidelines to match fields:

  1. section
    5.6

    offers an awesome instance.

Now that we’ve seen the construction of the grammar, let’s fill within the skeleton to
course of the CSV content material. To any extent further, examples on this article will use my
libderp library for primary information
constructions like maps and vectors.

libderp model 0.1.0, which you
can see find out how to do within the mission readme.

Subsequent, confirm with check instances:

IRCv3 extends the Web Relay Chat (IRC) protocol
with helpful options. Its core syntactical change to help new options is
message tagging. We’ll write
a parser to extract info from RFC
1459
messages,
together with IRCv3 tags.

The BNF from this commonplace is written in a barely completely different dialect than that
of the CSV RFC.

<message>       ::= ['@' <tags> <SPACE>] [':' <prefix> <SPACE> ] <command> [params] <crlf>

<tags>          ::= <tag> [';' <tag>]*
<tag>           ::= <key> ['=' <escaped_value>]
<key>           ::= [ <client_prefix> ] [ <vendor> '/' ] <key_name>
<client_prefix> ::= '+'
<key_name>      ::= <non-empty sequence of ascii letters, digits, hyphens ('-')>
<escaped_value> ::= <sequence of zero or extra utf8 characters besides NUL, CR, LF, semicolon (`;`) and SPACE>
<vendor>        ::= <host>
<host>          ::= see RFC 952 [DNS:4] for particulars on allowed hostnames

<prefix>        ::= <servername> | <nick> [ '!' <user> ] [ '@' <host> ]
<nick>          ::= <letter>  <particular> 
<command>       ::= <letter> { <letter> } | <quantity> <quantity> <quantity>
<SPACE>         ::= ' ' { ' ' }
<params>        ::= <SPACE> [ ':' <trailing> | <middle> <params> ]
<center>        ::= <Any *non-empty* sequence of octets not together with SPACE
                    or NUL or CR or LF, the primary of which will not be ':'>
<trailing>      ::= <Any, presumably *empty*, sequence of octets not together with
                      NUL or CR or LF>

<person>          ::= <nonwhite> { <nonwhite> }
<letter>        ::= 'a' ... 'z' | 'A' ... 'Z'
<quantity>        ::= '0' ... '9'
<crlf>          ::= CR LF

As earlier than, it’s useful to start out from the underside up, making use of the facility of lex
regexes. Nonetheless, we run into the issue that many of the tokens match virtually
something. The identical string may conceivably be a bunch, nick, person, key_name,
and command unexpectedly. Lex would match the string with whichever rule comes
first within the grammar.

Yacc can’t simply go lex any clues about what tokens it expects, given what
tokens have come earlier than. Lex is by itself. For that reason, the designers of
lex gave it a solution to maintain a reminiscence. Guidelines may be tagged with a begin
situation
, saying they’re eligible solely in sure states. Rule actions can
then enter new states previous to returning.

  • Lex
    and Yacc.
    (To view POSIX docs regionally, attempt
    begriffs/posix-man.)
  • Lex & Yacc, 2nd ed by John R. Levine, Tony Mason, Doug Brown. Levine
    subsequently wrote an up to date e book known as flex & bison: Textual content Processing
    Instruments
    . Nonetheless I bought the older model to get a greater really feel for historical past and
    portability.
  • To bridge the hole between core information and the newest options, seek the advice of the
    GNU Bison manual and the
    Flex handbook. (You’ll be able to construct the Flex handbook from
    source
    , or obtain version
    2.6.4
    that I’ve pre-built for you as PDF.)
  • Efficient Flex & Bison by Chris L. verBurg is a group of suggestions for
    “correctness, effectivity, robustness, complexity, maintainability and
    usability.” It’s clear Chris has loads of expertise writing real-world
    parsers.
  • Vim has basic yacc highlighting inbuilt, however you possibly can add help for Bison
    extensions with
    justinmk/vim-syntax-extra.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top