May 24, 2024 • Arnout Roemers
This post introduces the Crustimoney library - a Clojure idiomatic “packrat” PEG parser. The initial version of Crustimoney was from 11 years ago and was incomplete. The recent rewrite of it was mostly a mental exercise, to get it feature complete and a bit beyond that. I like to think it turned out well, and perhaps you like it too. And if not, there’s always Instaparse 😊.
So, the features are as follows:
There’s a lot to discover, but to keep it short and simple, let’s implement a little calculator.
We start with the grammar itself. We can choose between combinators, clojure data or text as our format.
This time we choose the latter, as it is probably most familiar.
We will bind the following to grammar
:
Some remarks are in order.
Firstly, the (
..)
groups can (and do, in our grammar) start with a capture name.
Crustimoney expressly does not capture anything by default.
So if something needs to be in the parse result, it needs to be captured.
Secondly, the number
rule is postfixed with an equals =
sign.
This is a shortcut for the following:
Lastly, the grammar contains a >
three times.
These are soft cuts, a novel concept in Crustimoney, and PEG parsers in general.
In short, it means that the parser won’t backtrack after passing the soft cut, for the duration of the chain (i.e. a consecutive list of elements) it is in.
Take the sum
rule as an example.
If a sum-op
has been parsed, a sum
must follow, or fail otherwise.
It won’t backtrack to try the second choice, the product
.
Having soft cuts improves error messages to be more local.
For example, parsing "1+"
with the soft cut will properly fail with a message about a missing number or missing parenthesis.
Without the soft cut, it would fail with a rather generic message about an unexpected ‘+’ sign.
This is also the time to talk about hard cuts, which would be denoted by a >>
.
A hard cut won’t have the parser backtrack beyond it at all, ever.
It has the same advantage as soft cuts; better error messages.
But it also allows the packrat cache to ditch everything before the hard cut.
Normally, a major downside of packrat PEG parsers is that they are memory hungry. But if your grammar has repeating structures, you can use hard cuts. A well placed hard cut can alleviate the memory requirements to be constant!
Now that we have our grammar, we can create a parser from it and use it.
A successful parse result is in “hiccup”-style.
Only captured nodes are in the result.
Only the root node can be “nameless” nil
.
Notice that the result nodes only have the :start
(inclusive) and :end
(exclusive) indices of matches.
This is to keep the result tree and string processing to a minimum, also during the parsing.
It is up to the user - not the parser - to decide which texts are important.
If you do want all the texts, for debugging purposes or quick parses, there’s the quick/parse
function (or quick/transform
if you already have a result tree).
It takes a grammar directly and transforms the result to contain the matched texts:
Let’s have a quick look at errors too, which is a set of error records:
Crustimoney has a results
namespace to deal with the parse results.
It contains functions like success->text
and errors->line-column
.
And while it is easy to deal with hiccup-style data yourself, the namespace also comes with a “batteries included” transform
function.
It performs a postwalk, transforming the nodes by applying a function to it.
Let’s see it in action, and finish our calculator:
Each transformation function takes the success node and the full text.
The coerce
and collect
macros help with creating such functions.
The coerce
takes a function, which is applied on the node’s matched text.
The collect
takes a function, which is applied on the children of the node.
Both can also take a binding vector, like the :operation
transformation above.
And, does it work?
Yup! And there’s more to discover about Crustimoney - like built-in parsers, or lexically-scoped nested recursive grammars, or EDN support, or streaming, or why it’s called that way - but all that can be found on github.
As always, have fun!