Peter Sestoft and I have written a note about how to write scanners and parsers in C#. The note is based on earlier versions for SML and Java.
The note contains an thorough introduction to grammars on Backus–Naur form (BNF). This includes a description of properties your grammar should have so that it can be mechanically translated to a program. And also some prescriptions about how to transform your grammar so that it has the desired properties. In technical terms, the note describe how you can check that your grammar is an LL(1) grammar, and if your grammar is not an LL(1) grammar, we give your some tricks that will usually transform the grammar into an LL(1) grammar.
The parsers you write using our method are recursive descent parsers. For the scanners, however, we just use an add-hoc method. Both parsers and scanners makes good use of the .NET framework. For instance, the scanners creates a token stream from a TextReader
. Hence, the scanners can be used to scan both files and strings. Likewise, a token stream is represented as IEnumerable<Token>
and scanners uses yield
to create this token stream. Thus, creating the token stream lazily.
To give an example, the simplest scanner presented in the note is the following scanner:
using TokenStream = System.Collections.Generic.IEnumerator<Token>; class ZeroOneScan : IScanner { public TokenStream Scan(TextReader reader) { while ( reader.Peek() != -1 ) { if ( Char.IsWhiteSpace((char) reader.Peek()) ) reader.Read(); else switch(reader.Read()) { case '-': yield return Token.FromKind(Kind.SUB); break; case '0': yield return Token.FromKind(Kind.ZERO); break; case '1': yield return Token.FromKind(Kind.ONE); break; default: throw new ApplicationException("Illegal character"); } } yield return Token.FromKind(Kind.EOF); } }
I find this use of yield
quite elegant.
Working with this note and getting my name on it has special meaning to me. A precursor note for SML was actually the note Peter used when he taught me for the first time many years ago (on my second semester at university). Over the years, I have returned to the note many times when I have needed to parse a small language that did not warrant the use of a parser generator, or when a generated parser would have been inconvenient to use because the text to be scanned did not come from a file stream (modern parser generators will not generate parsers with this problem).
The note ends a bit to early, in my opinion. I would like extend the note to cover Extended BNF. And I would also like to cover parser combinators. Well, one day when time permits…
Hi! Just a small remark: Barkus–Naur -> Backus-Naur, I believe. Would I be wrong?
You are right (and a tiny bit wrong).
You are right: I misspelled Backus (now fixed). Thanks.
However, according to http://en.wikipedia.org/wiki/Dash you should have used an en dash: “The en dash is used instead of a hyphen in compound adjectives for which neither part of the adjective modifies the other. That is, when each is modifying the noun.” 😉
Hi Ken, what does C#’s yield actually do? I can’t appreciate its elegance because I don’t know what it does… (If it requires lots of explanation, write a separate log entry! 🙂
Hi Michael
Shortly explained, what yield do is just syntactic sugar for creating a class that implements the IEnumerable interface (or IEnumerator, it can be used for both). And IEnumerable is C#’s lazy lists (to use functional programming jargon).
But it is a good idea to write a longer separate entry. I have a half finished one about tree traversal which I think can be used for this purpose.
Good article. I have been looking for some materials regarding RDP (and not specifically C#, but it is a definite plus) and this is by far the best one. Very concise with just enough background.
I much appreciated your note. I have been looking for something like this because I’m working on a scripting language for a client’s application, implemented in C#; however, it’s built on a very messy ad-hoc parser and while I’ve fixed some bugs, I expect there are likely more. I had been defining an EBNF grammar for it, and was planning on converting it to an RDP by hand, but I’ll certainly take a good close look at your work because it would eliminate a big chunk of tedious hand-coding.
Thank you for your comprehensible yet succinct note.
I am certain it will prove helpful in my upcoming RDP project!
Thought your notes were great. Did you ever publish the code as a project that could be downloaded. I have looked and can’t find it. Thanks.
Great idea.
I’ll put the code on github in the near future.
Did the C# RDP code ever make it to Github or some other location where it is downloadable? Granted this is becoming dated now but I only recently came across the parsenote and its still viable for educational projects. Would love to see the code. Thanks