Using Parslet to Write Your Own Language

Parsing text is such a common task that most developers encounters it every day, probably even every hour or even a minute. In this part of my series about awesome Ruby libraries we will not be diving into any theories or low level compiler stuff. Here I would like introduce a great library for parsing structured text and for interpreting it. Let’s take a look at Parslet. In the next part we will be talking about interpretations, today we will focus solely on parsing.

Introduction

Parslet is way to define PEG – Parsing Expression Grammar. You can learn more about PEG at Wikipedia.

Lot’s of theory, what’s it good for? Imagine you want to write your own new awesome language. In my example here I do use JSON for it’s simplicity. But grammars may be used for any other language. Do you know LESS or Sass? Well, they are both extensions to CSS. Do you want variables or inheritance in CSS? They will help! Now you are probably asking, why is he telling me about CSS? Well, it’s simple. Parslet would be an awesome tool to implement those. You would write your grammar for your awesome language (like LESS or Sass) using Parslet and then also a transformer that will the parsed document translate into old good CSS. In today’s article we shall talk only about the grammar and parsing part. My next article will be about transformers.

Also, I am not using Parslet to parse JSON, there are awesome libraries for that. I do use it for parsing specific inputs that I could not find parsers for. I do use JSON just as an example of a very simple language with grammar that can be covered in one blog post.

Now, let’s cover some basics.

Ambiguousness

Event though PEG resembles some similarities with CFG Context Free Grammars, PEGs can not be ambiguous. In essence that means that there is exactly one way to parse the text according to the grammar. The parser will either accept the text and parse it or will reject it. This is of course, a much more complex topic, but for our needs, let’s stick with this definition:

Parslet will either fail on parsing, or will provide a valid parse tree and the parse tree will always be the same for the same text.

Top-Down

The grammar is built from top to bottom. It starts with a root (check the example below, we start with the most generic things and we go down to the most specific ones) and then analyses the input by smaller chunks. Even though the grammar work in the top-down manner, there is no problem in writing the rules bottom-up. I prefer bottom-up, as it seems more natural to start form basic types and forms and build up more complex structures based on that.

Resources

PEGs are good for parsing structured text, but not that awesome for parsing natural language (however it’s still possible to use them for that). Also in case of Parslet, the memory resources needed for huge documents may be pretty big. I assumes this is because the parser needs to keep states in memory and that may not be the most efficient thing in Ruby.

Regular Expressions

PEGs are more powerful than regular expressions but on the other hand they require more resources. It is always best to think about your use-case and check what is the best tool for solving the problem at hand.

Interesting question being asked during the review of this blog post – why would you use PEG grammar instead of regular expressions. It depends. When your regular expression is getting too complex, it might be simplified by using grammar, but that will cost you something somewhere else.

You may remember some theory from the college, like Chomsky hierarchy. You can see Regular expression are generally regular grammars, but PEGs resemble more CFG (Context free grammars), those are more powerful and allow the user to handle more cases.

I really do not want to go much into theory in here, but if there would be demand, I could write some blog post on this topic.

How Does It Work?

Parslet is a library written in Ruby to define the grammar and then execute the parsing process on some input – usually a String. While parsing it either throws an error or provides a valid parse tree, the parse can be transformed into something useful for the application. But, let’s not spend so much time talking about theoretical stuff and let’s start build something.

JSON

JSON is extremely easy language with well defined grammar that is easy to comprehend and also write using Parslet. I have chosen JSON exactly for those reasons mentioned and also because all my readers either use it or know it.

There are more than one specification that defines valid JSON document. Based on the RFC, JSON should have Object or Array as root element. The ECMA standard does not have this restriction. I have been using the definitions on json.org as the guide for implementing the parser.

Let’s start by writing basic code (the whole project is on Github, link is provided below)

require 'parslet'
 
class JSON < Parslet::Parser
 
  root(:value)
 
  rule(:space) { (match('\s')).repeat(1) }
  rule(:space?) { space.maybe }
 
end

In the code we require the library and create new class called JSON that inherits from Parslet::Parser and will hold our grammar. The first directive we have says that the root of the grammar will be defined by a rule called value. The rule we will declare later.

In the example there are two rule definitions. First rule declares a space. It’s says to check for a a whitespace (using the RegExps) that will repeat at least once. The second rule declares a possible space. This space might be there or might not not.

What does it mean might be or migh not be? As our parser will be analysing the input, if it sees a space, it will be happy and will continue. And in case there is no space, it will not fail, as it would, in case we would not define the space as optional.

In the following text I will not be repeating the class definition. The rules will be places in a similar manner as the rules we have seen already, inside the class body.

Let’s move to see where the two rules we just define will be used

  rule(:value) { space? >>
    (
      string |
      number.as(:number) |
      object |
      array |
      str('true').as(:true) |
      str('false').as(:false) |
      str('null').as(:null)
    ) >>
    space?
  }

Complex? Probably a bit! But let’s try together to understand it. We define a new rule called value (this will be the root of our grammar, as we defined previously). It starts with a possible whitespace (space?) followed by (>>) list of possible rules that are exclusive (|), i.e. only one of those can be matched and again is followed by possible whitespace characters.

String, number, object, array are rules we will define next. These rules are not part of Parslet but will be part of our grammar when we wil implement them in a moment.

The call of as method on some rule will name the matched string in the resulting parsed tree, this will come handy in the next part when we will start transforming the tree.

str(‘true’) will match a static string. This is similar to match as we have seen before, but matches a text instead of regular expression. This way we parse true, false and null values, as those are represented by simple strings in JSON. You can use str instead of match whenever the string you need to capture is static and can be defined by non-changeable series of characters.

At this point our parser can parse booleans (true/false) and null value. Let’s move forward to add numbers

  rule(:digit) { match('[0-9]') }
 
  rule(:number) {
    str('-').maybe >>
    (str('0') | (match('[1-9]') >> digit.repeat)) >>
    (str('.') >> digit.repeat(1)).maybe >>
    (
      (str('e') | str('E')) >>
      (str('+') | str('-')).maybe >>
      digit.repeat(1)
    ).maybe
  }

First we define a rule for digits. That is straightforward and we will reuse it for numbers. We will match any character in the range from 0 to 9. Just once, no repeating and no optionality is involved.

So, what is a number? It starts with a possible minus(-) sign, followed by 0 or a digit from 1 to 9 that may be followed by any digit. We either see a zero, that is one digit 0. Or we see higher or lower number, but that can not start with 0. That gives us the rational part of the number and that may be followed by a dot (.) and at least one digit. This covers simple rational numbers. Still simple, right?

Now we will allow the floating point numbers to be written using exponents. The mantissa may be followed by delimiter in the form of case-insensitive character e (e or E) and the exponent itself which is an at least one digit prefixed by possible negative or positive sign.

And that’s it. These few lines can comprehend all the possible numbers the JSON specification allows.

Strings

Again, first let’s start with the code

  rule(:string) {
    str('"') >>
    ((str('\\') >> any) | (str('"').absent? >> any)).repeat.as(:string) >>
    str('"')
  }

It’s shorter than numbers, but more complex. At the beginning and the end we ensure the string is surrounded by quote characters (“). But, how does it work inside the quotes?

First we check if there is escape slash, when there is one, we consume the following character with it as an escaped sequence. If there is no slash, we check for a quote, if there is no quote, consume the character as part of the string. The quote will terminate the string. The crucial thing here is the absent?. It’s called negative lookahead and this one might need a bit longer explanation.

Lookaheads

The parsing process works like a simple tape. The reading head is moving along the tape which contains the characters from the parsed content one next to each other. Each simple rule we define moves the head by one character and consumes it. This is repeated again and again until the end of the tape is reached. The important thing here is that each rule consumes the character(s) by default.

Lookahead allows conditions. In case the rule is matched, the head stays at the beginning of the rule and the following rules are being used on the characters that has been already read before in the lookahead rule.

(str('"').absent? >> any)

Check the character for a quote and in case it is not, consume the character. In case of quote, the rule is not satisfied. Let’s combine it with the escaping

(str('\\') >> any) | (str('"').absent? >> any)

Check for a slash and one more character (no lookahead here) or for whatever but a quote. Simple? Yeah? Magic? Might be. Let’s build up the whole rule

((str('\\') >> any) | (str('"').absent? >> any)).repeat

Repeat and consume either escape sequence – slash and any possible character – or any possible character except qute. If there is no more escaped character or non-quote character present, it’s the end of string.

No magic, you see :)

Composed types

Our parser is now capable of parsing all basic types of our language. Let’s move to the two composed ones – Arrays and Objects.

  rule(:array) {
    str('[') >>
    (value >> (str(',') >> value).repeat.maybe).maybe.as(:array) >>
    str(']')
  }

Let’s start with arrays. Arrays are enclosed in brackets ([ and ]). Inside of these there is value possibly followed by comma and next value. The followup comma separated values may be repeated. Also the values in the array are not required to provide way to define an empty array. So we end up with a construct that allows to parse structures like

[] // empty array
[value] // array with one value
[value,value] // array with two values
[value,value,...] // any possible number of values in array
etc.

And let’s finish with the last type – an Object – the most complex one.

  rule(:member) {
    space? >>
    string.as(:name) >>
    space? >>
    str(':') >>
    value.as(:value)
  }
 
  rule(:members) { member >> (str(',') >> member).repeat }
 
  rule(:object) { str('{') >> members.maybe.as(:object) >> str('}') }

Let’s start from top to bottom here. Object start with { and ends with }. Inside might be members, or may be not. Members are composed of one or more member rules separated by comma. Do you see the similarity to arrays so far? But arrays were composed of values, object are composed of members. What are those?

Member is a string, followed by a colon, followed by a value, with possible whitespaces all over the place.

Huh, is it that simple? Yes! JSON is super simple!

Parsing

Done, we have completed our parser. Let’s use it

require 'parslet'
 
data = "some json"
 
class JSON < Parslet::Parser
  ...
end
 
begin
  puts JSON.new.parse(data).inspect
rescue Parslet::ParseFailed => e
  puts e.cause.ascii_tree
end

Create new instance of JSON parser and parse some JSON string. In case the parsing would fail, exception is raised and caught, and it prints debugging information to check where the problem was with the input JSON. We print an ascii tree of the progress of the parsing process. If we manage to parse the input, it will output the parse tree of the JSON document.

Conclusion

And that’s it. You can now parse JSON documents using your own JSON parser written in Ruby. So far it’s not that interesting, but in the next part we will use transformations to utilize the parsed documents in the applications. We will convert the JSON parse tree into native Ruby types – Hash, Array, String, Integer, Float, Boolean, Nil, etc … You can check the whole parser and also the transformer in a Github repository.

Stay tuned!

What’s Next?

Categories
News
Tags
Comments are closed.