Recursive Descent Parsers in C#

Peter Sestoft and I have written a note about how to write scanners and parsers in C#. The note is based on earlier versions for SML and Java.

The note contains an thorough introduction to grammars on Backus–Naur form (BNF). This includes a description of properties your grammar should have so that it can be mechanically translated to a program. And also some prescriptions about how to transform your grammar so that it has the desired properties. In technical terms, the note describe how you can check that your grammar is an LL(1) grammar, and if not, some tricks that will usually transform it into one.

The parsers you write using our method are recursive descent parsers. For the scanners, however, we just use and add-hoc method. Both parsers and scanners makes good use of the .NET framework. For instance, the scanners creates a token stream from a TextReader. Hence, the scanners can be used to scan both files and strings. Likewise, a token stream is represented as IEnumerable<Token> and scanners uses yield to create this token stream. Thus, creating the token stream lazily.

To give an example, the simplest scanner presented in the note is the following scanner:

using TokenStream = System.Collections.Generic.IEnumerator<Token>;

class ZeroOneScan : IScanner {
  public TokenStream Scan(TextReader reader) {
    while ( reader.Peek() != -1 ) {
      if ( Char.IsWhiteSpace((char) reader.Peek()) )
        reader.Read();
      else
        switch(reader.Read()) {
        case '-': yield return Token.FromKind(Kind.SUB); break;
        case '0': yield return Token.FromKind(Kind.ZERO); break;
        case '1': yield return Token.FromKind(Kind.ONE); break;
        default: throw new ApplicationException("Illegal character");
        }
    }
    yield return Token.FromKind(Kind.EOF);
  }
}

I find this use of yield quite elegant.

Working with this note and getting my name on it has special meaning to me. A precursor for SML was actually the note Peter used when he taught me for the first time many years ago (on my second semester at university). Over the years, I have returned to the note many times when I have needed to parse a small language that did not warrant the use of a parser generator, or when a generated parser would have been inconvenient to use because the text to be scanned did not come from a file stream (modern parser generators will not generate parsers with this problem).

The note ends a bit to early, in my opinion. I would like extend the note to cover Extended BNF. And I would also like to cover parser combinators. Well, one day when time permits…

  1. Andrzej’s avatar

    Hi! Just a small remark: Barkus–Naur -> Backus-Naur, I believe. Would I be wrong?

    Reply

  2. Ken’s avatar

    You are right (and a tiny bit wrong).

    You are right: I misspelled Backus (now fixed). Thanks.

    However, according to http://en.wikipedia.org/wiki/Dash you should have used an en dash: “The en dash is used instead of a hyphen in compound adjectives for which neither part of the adjective modifies the other. That is, when each is modifying the noun.” ;-)

    Reply

  3. Michael’s avatar

    Hi Ken, what does C#’s yield actually do? I can’t appreciate its elegance because I don’t know what it does… (If it requires lots of explanation, write a separate log entry! :-)

    Reply

  4. Ken’s avatar

    Hi Michael

    Shortly explained, what yield do is just syntactic sugar for creating a class that implements the IEnumerable interface (or IEnumerator, it can be used for both). And IEnumerable is C#’s lazy lists (to use functional programming jargon).

    But it is a good idea to write a longer separate entry. I have a half finished one about tree traversal which I think can be used for this purpose.

    Reply

  5. Andrew’s avatar

    Good article. I have been looking for some materials regarding RDP (and not specifically C#, but it is a definite plus) and this is by far the best one. Very concise with just enough background.

    Reply

  6. Clifton’s avatar

    I much appreciated your note. I have been looking for something like this because I’m working on a scripting language for a client’s application, implemented in C#; however, it’s built on a very messy ad-hoc parser and while I’ve fixed some bugs, I expect there are likely more. I had been defining an EBNF grammar for it, and was planning on converting it to an RDP by hand, but I’ll certainly take a good close look at your work because it would eliminate a big chunk of tedious hand-coding.

    Reply

  7. Thomas’s avatar

    Thank you for your comprehensible yet succinct note.
    I am certain it will prove helpful in my upcoming RDP project!

    Reply

  8. Shannon Braun’s avatar

    Thought your notes were great. Did you ever publish the code as a project that could be downloaded. I have looked and can’t find it. Thanks.

    Reply

  9. Ken’s avatar

    Great idea.

    I’ll put the code on github in the near future.

    Reply