AWK technical notes | Volodymyr Gubarkov
Within the earlier article Fascination with AWK we mentioned why AWK is nice for prototyping and is commonly the most effective various to the shell and Python. On this article I wish to present you some attention-grabbing technical details I discovered about AWK.
Lack of GC
AWK was designed to not require a GC (rubbish collector) for its implementation. By the way in which, identical to sh/bash.
(I discovered this outstanding truth from the oilshell blog – reasonably attention-grabbing technical weblog, the place writer describes his progress in creating the “higher bash”).
Probably the most substantial consequence is that it’s forbidden to return an array from a operate, you possibly can return solely a scalar worth.
operate f() {
a[1] = 2
return a # error
}
Nevertheless, you possibly can move an array to a operate and fill it there
BEGIN {
fill(arr)
print arr[0] " " arr[1]
}
operate fill(arr, i) { arr[i++] = "hiya"; arr[i++] = "world" }
The factor is, in a scarcity of GC all heap allocations should be deterministic. That’s, array, declared regionally in a operate should be destroyed in the mean time when operate returns. That’s why it’s disallowed to flee the declaration scope of a operate (by way of return).
The absense of GC permits to maintain the language implementation quite simple, thus quick and moveable. Additionally, with predictable reminiscence consumption. To me, this qualifies AWK as good embeddable language, though, for some motive this area of interest is firmly occupied by (GC-equipped) Lua.
Native variables
All variables are world by default. Nevertheless, should you add a variable to the operate parameters (like i
above) it turns into native. JavaScript works in an analogous approach, though there are extra appropriate var
/let
/const
key phrases. In observe, it’s customary to separate “actual” operate parameters from “native” parameters with further areas for readability.
Though Brian Kernighan (the Ok in AWK) regrets this design, in observe it really works simply tremendous.
The notation for operate locals is appalling (all my fault too, which makes it worse).
So it seems, the usage of native variables can also be a mechanism for computerized launch of sources. Small example:
operate NUMBER( res) tryParse1("123456789", res) && (tryParseDigits(res)
The NUMBER
operate parses the quantity. res
is a brief array that can be mechanically deallocated when the operate exits.
Autovivification
An associative array is said just by the actual fact of utilizing the corresponding variable arr
as an array.
Likewise, a variable that’s handled as a quantity (i++
) can be implicitly declared as a numeric sort, and so forth.
To Perl connoisseurs, this characteristic could also be often known as Autovivification. Basically, AWK is sort of unequivocally a prototype of Perl. You’ll be able to even say that Perl is a type of AWK overgrowth on steroids… Nevertheless, we deviated.
That is carried out, clearly, so as to have the ability to write probably the most compact code in one-liners, for which many people are used to utilizing AWK.
About AWK syntax/grammar
I wish to inform about a few findings I encountered whereas implementing the parser for AWK for intellij-awk mission.
$
is a unary operator
In the event you used AWK, most probably you’ve used $0
, $1
, $2
, and so forth. Some even used $NF
.
However do you know, that $
is an operator, that may apply to an expression?
So it’s completely legitimate to put in writing
{ second=2; print $second }
or
or
With the identical consequence as
Additionally, it’s attention-grabbing to notice, that $
is the one operator that’s allowed to look on the left facet of project, that’s you possibly can write
or
{ $size("xx")="hiya" }
(similar as)
Quiz. What would be the output of
echo "2 3 4 hiya" | awk '{ print $$$$1 }'
and why? Attempt to reply with out working. Strive including much more $
. Clarify the habits.
operate calling f()
doesn’t enable house earlier than (
…
… however just for user-defined features:
awk 'BEGIN { fff () } operate fff(){ }' # syntax error
awk 'BEGIN { fff() } operate fff(){ }' # OK
You’ll be able to have house for built-in features:
awk 'BEGIN { print substr ("abc",1,2) }' # OK, outputs ab
Why such unusual inconsistency? It’s due to AWK’s choice to make use of empty operator for strings concatenation
BEGIN { a = "hiya"; b = "world"; c = a b; print c } # helloworld
it implies that AWK tries to parse fff (123)
as concatenation of variable fff
and string 123
.
Clearly fff ()
is only a syntax error, the identical as fff (1,2)
.
As for built-in features, AWK is aware of beforehand that it’s not a variable identify, so it could disambiguate.
built-in features are parsed as a part of syntax
In the event you check out AWK specification at POSIX, on the Grammar section (sure, AWK grammar is part of POSIX commonplace!), you’ll discover that AWK features are a part of it. To be exact, they’re parsed at lexer step, so that they enter parser step as prepared to make use of tokens.
The implication right here is that you’re disallowed to call your individual operate of variable with the identify of any built-in operate. Will probably be a syntax error!
BEGIN { size = 1 } # syntax error
Evaluate to python:
Why is that this? For flexibility. Keep in mind, AWK’s principal aim was to be extraordinarily terse but productive language nicely suited to one-liners. So:
- it’s allowed to omit
()
for built-in features, when no arguments handed, like inecho "hiya" | awk '{ print size }'
– similar asecho "hiya" | awk '{ print(size()) }'
- similar operate can be utilized with totally different variety of arguments, like
sub(/regex/, "alternative", goal)
andsub(/regex/, "alternative")
– omittedgoal
is implied as$0
All these nuances require fairly ad-hoc parsing for built-in features. That is why they’re a part of grammar. If we take the getline
key phrase, it’s not even a operate, however reasonably a really versatile syntax construct.
ERE vs DIV lexing ambiguity
AWK ad-hoc syntax, optimized for succinct code, has some inherent ambiguities in its grammar.
The issue resides in lexing ambiguity of tokens ERE (prolonged common expression, /regex/
) vs DIV (/
). Naturally, lexer prefers the longest matching time period. This causes an issue for parsing a code like
As a result of it could parse as
as an alternative of right
This type of issues is well-known, and normally the implementation requires the Lexer hack:
The answer usually consists of feeding info from the semantic image desk again into the lexer. That’s, reasonably than functioning as a pure one-way pipeline from the lexer to the parser, there’s a backchannel from semantic evaluation again to the lexer. This mixing of parsing and semantic evaluation is usually thought to be inelegant, which is why it’s known as a “hack”.
Within the unique AWK (generally known as the One True Awk), figuring out common expressions is the job of the parser, which explicitly units the lexer into “regex mode” when it has found out that it ought to count on to learn a regex:
reg_expr:
'/' {startreg();} REGEXPR '/' { $$ = $3; }
;
(startreg()
is a operate outlined in lex.c) The reg_expr
rule itself is just ever matched in contexts the place a division operator could be invalid.
Nevertheless, in intellij-awk I managed to disambiguate this on the Lexer degree, however this required creating a (somewhat sophisticated) lexer with multiple states (word the utilization of state DIV_POSSIBLE
).
You’ll be able to verify another (Gawk-related) nuances I discovered in parser_quirks.md.
Total, I seen that many outdated programming languages have very ad-hoc syntax, and so parsing.
I feel, partially, as a result of they wished to make the programming language very versatile (PL/1, Ada, C++, AWK, Perl, shell).
Partially, as a result of some languages tried to be as near human language as potential (SQL, and even COBOL – virtually each language characteristic in them is a separate syntax assemble).
Or possibly as a result of parsing principle wasn’t that sturdy again then. So it was widespread to put in writing ad-hoc parsers as an alternative of utilizing one thing like lex + yacc.
These days, programming languages are inclined to have much more regular syntax. Probably the most distinguished instance on this regard will be Go.
In the event you seen a typo or produce other suggestions, please e mail me at xonixx@gmail.com