Saturday, January 10, 2015

Using PEG.js to lex WebIDL, parsing to come later ;-)

Last weekend I spent some time with Node.js and Cheerio in order to parse HTML pages and scrape some content. Sometimes you don't need to scrape, you have the precise content and you simply need it in a form that is more malleable to the task at hand. A simple example might be a file, which you need to read line by line (this is so common, many APIs have a readLine() function already available. Other times you have XML, CSV or JSON. All of these are well known formats, but they are mainly Data Interchange Formats used to communicate between two separate entities. By themselves they tend to lack human readability and structure and may still prove hard to manipulate.

So how do we solve that? Well, we introduce DSLs or Domain Specific Languages. To the human, these contain tons of relevant detail, are easy to write, and easy to understand. They can often be modeled using natural language and remove restrictions on fixed structures, syntaxes and keywords. To create such a DSL we define its grammar, using another type of DSL ;-), and ideally the grammar will generate some processing code for the inputs, transform that into some more malleable output and our problem is solved. Humans are happy, programmers are happy and the computer is happy (unless you've errantly created a parser with an infinite loop, in which case your computer is simply very hot).

Peg.js

That first step, creating the grammar and using some library to generate a parser seems like it could be hard if we don't already have a library in mind. Thankfully we do. In my case I'm going to use PEG.js which you can play with using an online grammar compiler. If you do use the online compiler, save your work often, as you can generate bad intermediate parsers which will hang your browser and cause you to lose work.

Why did I choose this library? Well, I've been following the project for a while and it has stood the test of time. It also uses a technique called PEG or Parsing Expression Grammar which is a way to define a recursive descent parser. Recursive descent parsers are very easy for humans to wrap their heads around since it is how we might write such a parser by hand. However, due to limitations such as selecting the first match rather than the longest match of a given token sequence, requires more care when defining the order of the productions.

Installing

Installation is simple and I'll again be using Node.js for these samples. The following install command should do:
npm install pegjs -g
I use the -g so that the pegjs command can be used on the command line to convert my grammar into a parser before I run the rest of it through Node.js. Running that command on a grammar file will produce a new output file of the same name with the .js extension. Example would be WebIDL.pegjs would produce WebIDL.js. The later file can then be included by your Node.js based tokenizer. Let's put all of this together with a code sample which shows me loading the produced parser, loading a sample webIDL file, and then outputting the result in JSON format.


Lexer vs Parser

We are now setup, we have a simple tokenization test script. Now we need to figure out how far our parser is going to go. When parsing there are generally two phases of construction. The first phase identifies lexical tokens from the underlying characters in the stream. For instance, a series of digits might become an integer. From this lexical analysis we can create a one dimensional stream of tokens that by themselves have no semantic meaning. For WebIDL this construct works as there are 7 lexical tokens defined in the grammar. These tokens can then be analyzed with parser rules to determine if the tokens themselves are in the right order. For instance, a parser rule might define a list of integers as such:
list_of_integers: INTEGER (',' INTEGER)*
The production uses lexer rules in upper-case like INTEGER and combines that with further information that says I want at least one INTEGER and then I want 0 or more groups of commas followed by INTEGERs...) Technically even the ',' should be another type of lexer rule maybe COMMA, but in WebIDL this will instead be classified into the general token OTHER.

Grammar To Tokens

I'm excited and I don't want this to get too long so I'm going to jump right into the grammar conversion. I won't go into depth on the more complicated rules, but I will show how to take some of the simpler constructs in the WebIDL document and turn them into proper tokens. Let's start with OTHER which is really simple.


Since OTHER started as a simple regular expression, we just move that over into the PEG format. We also give the characters a label and a production body. Each matching group can be given a function body that will be executed on the results. Normally this produces a character, or an array, or even a complicated structure which you can then run some code on to produce something else. In this case the value is a single character, so we pass that value to a static function on token that returns a fully instantiated token with a type, a string value (raw) and a strongly typed value.

Let's look at something a little bit harder, the STRING token. This rule matches the quote characters and an array of 0 or more non-quote characters. In this case our production only names the array of characters, since we can take the quote character matches for granted. The final production with the two helpers required to make the token looks like the following:


Notice how we've used the flatten function to convert an array of characters into a single string. This is required because PEG.js has stored each match separately for us so we can do very complicated analysis and construction. In our case it is more of a hindrance so we do a simple flatten since the string is sufficient. Also notice that to keep the original string sequence we have both a value which is the string without quotes and we also keep the entire matched sequence of characters so we can re-emit the file. This is important depending on the usage of your code. If you are writing code that manipulates the file directly then this is necessary, but if you are simply producing a new output you can remove the string field and save the overhead since it won't be needed.

I want to do one more lexer rule for you here, the INTEGER rule, since the final production doesn't look very much like the initial regular expression provided in the document. PEG.js uses a forward slash character '/' to represent the alternation group. And in this context, to be exceptionally clear, it will try to process the first item in the group and if that succeeds, regardless of match length, that will be the token that it proceeds to process the input with. If in nested contexts it fails, it will then roll back and try the next. This can lead to some insane run-times, but techniques like memoization can prevent this from being a concern. For my example I noticed that the regular expression was trying to parse decimal, hex and octal values so I broke this out to simplify my matching and token construction.


In addition to how I split the rules out, this also shows how we might use optional matches. Here the opt:'-'? represents a token which may or may not be present. It can then be tested for in our function. If we do see a negative value, then we simply flip the sign of the current token. This is also interesting. Note that our variables in the INTEGER lexer have actually become the tokens that we passed back from DECINT, HEXINT, and OCTINT. When you later compose parser rules, the values you get will be lexical tokens such as these and you can now work with your results in an object oriented way no longer concerning yourself with the raw characters in the original stream.

Let's finish putting this together. The last bit you need is a start rule. By default the top level rule is your start rule. So we are going to define one called 'start' and another called 'tokens'. The 'start' rule is now a parser rule and it takes a list of tokens using tokens+. tokens is in turn our alternation group for all of our 7 lexical rules. And note, the ordering is IMPORTANT. Notice that FLOAT must precede INTEGER. That is because INTEGER OTHER INTEGER is a possible token sequence otherwise. With other parser technologies the longest possible matched token would be taken by default, but in PEG, the first possible token is taken regardless of length. With a few unit tests this is pretty trivial to detect and had you hand-written the parser you likely would have concerned yourself with similar.


Code


No comments:

Post a Comment