Packrat parser generator for C99
quxer generates dependency-free C99-compliant PEG packrat parsers from text
files in a domain-specific language similar to EBNF. Semantic actions may be
optionally embeded to construct an abstract syntax tree. The backend of the
parser writer was used to boostrap the parser, so the core of the program is
self-hosted.
Compile quxer with the standard cmake procedure:
mkdir build && cd build
CC=gcc cmake ..
make
The parsers quxer generates must also be compiled with internal.c/h
inshared
.
quxer is self-hosted, so a brief rundown of its inner workings are a good
example of its usage. Internally and prior to generation, grammars are
represented as a set of parser trees comprised of operators, terminals,
nonterminals, etc. A procedure traverses these trees with a bunch of templates
to write the resulting source file. Writing grammars symbolically in C is
tedious, so the challenge is to convert a text file into the symbolic
representation. PEGs themselves can be used to solve this problem! The following
grammar parses the grammar of PEGs and adds semantic actions to produce the
symbolic representation (technically, a wrapper around it):
~grammar = (ws r : rule)+ ws {{ result = ast_grammar(r, countr); }}
~ws = (" " / "\n" / "\t")*
~wsp = (" " / "\n" / "\t")+
~rule = "~" <name> ws "=" ws b : body {{
result = ast_rule(c(0), *b);
}}
~name = [A-Za-z0-9_]+
~body = o : alt {{ result = *o; }}
~alt = first : seq (ws "/" ws rest : seq)* {{
result = ast_Alt(first, rest, countrest + 1);
}}
~seq = first : operator (wsp rest : operator)* {{
result = ast_Seq(first, rest, countrest + 1);
}}
~operator = o : prefix (ws <code>)? {{
if (ccount) { result = ast_with_code(c(0), *o); } else { result = *o; }
}}
/ o : postfix (ws <code>)? {{
if (ccount) { result = ast_with_code(c(0), *o); } else { result = *o; }
}}
/ o : group (ws <code>)? {{
if (ccount) { result = ast_with_code(c(0), *o); } else { result = *o; }
}}
~prefix = and ws o : group {{ result = ast_And(*o); }}
/ not ws o : group {{ result = ast_Not(*o); }}
~and = "&"
~not = "!"
~postfix = o : group ws star {{ result = ast_Star(*o); }}
/ o : group ws plus {{ result = ast_Plus(*o); }}
/ o : group ws option {{ result = ast_Option(*o); }}
~star = "*"
~plus = "+"
~option = "?"
~group = "(" ws o : body ws ")" {{ result = *o; }}
/ "<" ws o : body ws ">" {{ result = ast_Capture(*o); }}
/ o : final {{ result = *o; }}
~final = o : nonterminal {{ result = *o; }}
/ o : literal {{ result = *o; }}
/ o : cclass {{ result = *o; }}
/ o : dot {{ result = *o; }}
~nonterminal = <name> ws ":" ws <name> {{
result = ast_AliasedNontm(c(1), c(0));
}}
/ <name> {{
result = ast_Nontm(c(0));
}}
~literal = "\"" <(&"\\" . . / !"\"" .)*> "\"" {{
result = ast_Literal(c(0));
}}
~cclass = "[" <(&"\\" . . / !"]" .)*> "]" {{
result = ast_CClass(c(0));
}}
~dot = "." {{ result = ast_dot(); }}
~code = "{{" (&"\\" . . / !"}}" .)* "}}"
This grammar was first written symbolically (in src/peg/pegspec.c
) but the
current parser.c
was generated directly from this text.
There are many resources for PEGs online to explain the basic stuff. Otherwise:
{{ ... }}
.~
but are referenced with their base namealias :
nonterminal
. A single nonterminal node may parse multiple times (if its*
or +
operator), so alias
is an array of the AST type. Itscountalias
. Results are stored by assigning result
. < ... >
. Captures are anonymous and doneuint8_t*
) is given by c(n)
, itss(n)
, and its end index e(n)
. The number of captures is given byccount
.parse_file(char*
filename)
. This links against internal.h
which will start parsing at the