Functional Bytes Clojure, Scala en Java specialist

Crustimoney - a Clojure PEG parser

May 24, 2024 • Arnout Roemers

This post introduces the Crustimoney library - a Clojure idiomatic “packrat” PEG parser. The initial version of Crustimoney was from 11 years ago and was incomplete. The recent rewrite of it was mostly a mental exercise, to get it feature complete and a bit beyond that. I like to think it turned out well, and perhaps you like it too. And if not, there’s always Instaparse 😊.

So, the features are as follows:

  • Create parsers from combinators
  • .. or from string-based definitions
  • .. or from data-driven definitions
  • Packrat caching, optimizing cpu usage
  • Novel concept of cuts, optimizing memory usage - and error messages
  • Focus on capture groups, resulting in shallow parse trees
  • Minimal parse tree data, fetch only what’s needed on post-processing
  • Virtual stack, preventing overflows
  • Infinite loop detection (runtime)
  • Missing rule references detection (compile time)
  • Streaming support (experimental)

There’s a lot to discover, but to keep it short and simple, let’s implement a little calculator.

The grammar

We start with the grammar itself. We can choose between combinators, clojure data or text as our format. This time we choose the latter, as it is probably most familiar. We will bind the following to grammar:

root       <- sum $
sum        <- (:operation product sum-op > sum) / product
product    <- (:operation value product-op > product) / value
value      <- number / '(' > sum? ')'
sum-op     <- (:operand [+-])
product-op <- (:operand [*/])
number=    <- [0-9]+

Some remarks are in order. Firstly, the (..) groups can (and do, in our grammar) start with a capture name. Crustimoney expressly does not capture anything by default. So if something needs to be in the parse result, it needs to be captured.

Secondly, the number rule is postfixed with an equals = sign. This is a shortcut for the following:

number <- (:number [0-9]+)

Lastly, the grammar contains a > three times. These are soft cuts, a novel concept in Crustimoney, and PEG parsers in general. In short, it means that the parser won’t backtrack after passing the soft cut, for the duration of the chain (i.e. a consecutive list of elements) it is in.

Take the sum rule as an example. If a sum-op has been parsed, a sum must follow, or fail otherwise. It won’t backtrack to try the second choice, the product. Having soft cuts improves error messages to be more local.

For example, parsing "1+" with the soft cut will properly fail with a message about a missing number or missing parenthesis. Without the soft cut, it would fail with a rather generic message about an unexpected ‘+’ sign.

This is also the time to talk about hard cuts, which would be denoted by a >>. A hard cut won’t have the parser backtrack beyond it at all, ever. It has the same advantage as soft cuts; better error messages. But it also allows the packrat cache to ditch everything before the hard cut.

Normally, a major downside of packrat PEG parsers is that they are memory hungry. But if your grammar has repeating structures, you can use hard cuts. A well placed hard cut can alleviate the memory requirements to be constant!

The parsing

Now that we have our grammar, we can create a parser from it and use it.

(require '[crustimoney.core :refer [parse]]
         '[crustimoney.string-grammar :refer [create-parser]])

(def parser (create-parser grammar))

(parse parser "1+(2-42)*8")

=> [nil {:start 0, :end 10}
    [:operation {:start 0, :end 10}
     [:number {:start 0, :end 1}]
     [:operand {:start 1, :end 2}]
     [:operation {:start 2, :end 10}
      [:operation {:start 3, :end 7}
       [:number {:start 3, :end 4}]
       [:operand {:start 4, :end 5}]
       [:number {:start 5, :end 7}]]
      [:operand {:start 8, :end 9}]
      [:number {:start 9, :end 10}]]]]

A successful parse result is in “hiccup”-style. Only captured nodes are in the result. Only the root node can be “nameless” nil.

Notice that the result nodes only have the :start (inclusive) and :end (exclusive) indices of matches. This is to keep the result tree and string processing to a minimum, also during the parsing. It is up to the user - not the parser - to decide which texts are important.

If you do want all the texts, for debugging purposes or quick parses, there’s the quick/parse function (or quick/transform if you already have a result tree). It takes a grammar directly and transforms the result to contain the matched texts:

(require '[crustimoney.quick :as quick])

(quick/parse grammar "1+(2-42)*8")

-> [nil "1+(2-42)*8"
    [:operation "1+(2-42)*8"
     [:number "1"]
     [:operand "+"]
     [:operation "(2-42)*8"
      [:operation "2-42"
       [:number "2"]
       [:operand "-"]
       [:number "42"]]
      [:operand "*"]
      [:number "8"]]]]

Let’s have a quick look at errors too, which is a set of error records:

(parse parser "1+")

=> #{ {:key :expected-literal, :at 2, :detail {:literal "("}}
      {:key :expected-match, :at 2, :detail {:regex #"[0-9]+"}} }

The transforming

Crustimoney has a results namespace to deal with the parse results. It contains functions like success->text and errors->line-column. And while it is easy to deal with hiccup-style data yourself, the namespace also comes with a “batteries included” transform function. It performs a postwalk, transforming the nodes by applying a function to it.

Let’s see it in action, and finish our calculator:

(require '[crustimoney.results :refer [transform coerce collect]])

(def transformations
  {:number    (coerce parse-long)
   :operand   (coerce {"+" + "-" - "*" * "/" /})
   :operation (collect [[v1 op v2]] (op v1 v2))
   nil        (collect first)})

(defn calculate [text]
  (-> (parse parser text)
      (transform text transformations)))

Each transformation function takes the success node and the full text. The coerce and collect macros help with creating such functions. The coerce takes a function, which is applied on the node’s matched text. The collect takes a function, which is applied on the children of the node. Both can also take a binding vector, like the :operation transformation above.

And, does it work?

(calculate "1+(2-42)*8")

=> -319

Yup! And there’s more to discover about Crustimoney - like built-in parsers, or lexically-scoped nested recursive grammars, or EDN support, or streaming, or why it’s called that way - but all that can be found on github.

As always, have fun!


Clojure - Scala - Java - JavaEE - Datomic - Reagent - Figwheel - HugSQL - JavaScript - Node.js - Maven - SBT - XML - XSD - XSLT - JSON - jQuery - HTML - HTMX - React - Redux - OAuth - REST - GraphQL - ZooKeeper - Kafka - Akka HTTP - PostgreSQL - ElasticSearch - Cassandra - Redis - Mule - RabbitMQ - MQTT - SOAP - Linux - macOS - Git - Scrum - Emacs - Docker - Kubernetes - Ansible - Terraform - Jenkins - GitHub - GitLab - Devops - Raspberry Pi - Event Sourcing - Functional Reactive Programming - Ports and Adapters (Hexagonal)


Functional Bytes, 2013-2024

Boekelo

06 267 145 02

KvK: 59562722

Algemene voorwaarden