C++

You are currently browsing the archive for the C++ category.

Peter Sestoft and I have written a note about how to write scanners and parsers in C#. The note is based on earlier versions for SML and Java.

The note contains an thorough introduction to grammars on Backus–Naur form (BNF). This includes a description of properties your grammar should have so that it can be mechanically translated to a program. And also some prescriptions about how to transform your grammar so that it has the desired properties. In technical terms, the note describe how you can check that your grammar is an LL(1) grammar, and if not, some tricks that will usually transform it into one.

The parsers you write using our method are recursive descent parsers. For the scanners, however, we just use and add-hoc method. Both parsers and scanners makes good use of the .NET framework. For instance, the scanners creates a token stream from a TextReader. Hence, the scanners can be used to scan both files and strings. Likewise, a token stream is represented as IEnumerable<Token> and scanners uses yield to create this token stream. Thus, creating the token stream lazily.

To give an example, the simplest scanner presented in the note is the following scanner:

using TokenStream = System.Collections.Generic.IEnumerator<Token>;

class ZeroOneScan : IScanner {
  public TokenStream Scan(TextReader reader) {
    while ( reader.Peek() != -1 ) {
      if ( Char.IsWhiteSpace((char) reader.Peek()) )
        reader.Read();
      else
        switch(reader.Read()) {
        case '-': yield return Token.FromKind(Kind.SUB); break;
        case '0': yield return Token.FromKind(Kind.ZERO); break;
        case '1': yield return Token.FromKind(Kind.ONE); break;
        default: throw new ApplicationException("Illegal character");
        }
    }
    yield return Token.FromKind(Kind.EOF);
  }
}

I find this use of yield quite elegant.

Working with this note and getting my name on it has special meaning to me. A precursor for SML was actually the note Peter used when he taught me for the first time many years ago (on my second semester at university). Over the years, I have returned to the note many times when I have needed to parse a small language that did not warrant the use of a parser generator, or when a generated parser would have been inconvenient to use because the text to be scanned did not come from a file stream (modern parser generators will not generate parsers with this problem).

The note ends a bit to early, in my opinion. I would like extend the note to cover Extended BNF. And I would also like to cover parser combinators. Well, one day when time permits…

Say you want to implement a class that implements the IEnumerable interface in C#. Then you have two choices, either to implement the old-style non-generic IEnumerable interface or you can implements the generic IEnumerable<T> interface. Given those choices we of course want to implement the new generic version of the interface. Because otherwise lots of boxing and unboxing will happen when T is a value type, with the non-generic version you will not be able to conveniently use the Current property of your enumerator, and for general type-safety goodness.

Thus, you set out to implement the generic version. For example, let say we want to implement a class that enumerates all the integers staring from a given offset. First we might try this:

using System.Collections.Generic;
class Ints : IEnumerable<int> {
    private readonly int offset;
    public Ints(int o) { offset = o; }
    public IEnumerator<int> GetEnumerator() {
        int i = offset;
        while( true ) yield return i++;
    }
}

But then the compiler complains:

error CS0535: 'Ints' does not implement interface member 'System.Collections.IEnumerable.GetEnumerator()'

Thank you for letting us know nice compiler. But we don’t want Ints to implement the non-generic interface. We want it to implement the generic interface.

Reading up on the documentation will reveal that the generic interface inherits from the non-generic interface. What a wonderful design. Thankfully we can make a general work-around for this design flaw in the library. Just add a non-generic method that calls the generic method:

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { return GetEnumerator(); }

Once the author of my favorite C# book returns from vacation. I’ll complain to him that the section about IEnumerable didn’t make it clear that if you want to implement the generic interface you also have to implement the non-generic interface.

Update: It turns out that the errata to C# Precisely already mentions this curriousity.

GADTs in C++

My good friends Claudio Russo and Andrew Kennedy have been kind enough to send me a draft paper about Generalized Algebraic Data Types (GADTs) and Object-Oriented Programming. GADTs generalize the datatypes of ML and Haskell by permitting constructors to produce different type-instantiations of the same datatype. One of Andrew and Claudio’s examples is a slightly modified version of Peter Sestoft’s example of representing abstract syntax trees for typed expressions in Generic C#. To get a better feeling of the difference between the generics in C#/Java and the generics in C++ I have rewritten the the example in C++.

To recap the example: we want to represent abstract syntax trees for a tiny language of simple expressions. Futhermore, we want representation to be type safe. That is, we do not want to be able to represent the nonsense expression “true + 42“. We shall represent the abstract syntax trees the standard OO way with an abstract base class Exp for the general type of expressions and the concreate subclasses for each node type. To enforce the type safety of our embedded language we use generics. That is, a value with type Exp<R> is an abstract syntax tree for an expression that evaluates to a value of type R.

First the base class.

Peter’s Generic C# version:

abstract class Exp<R> {   // R = result type
  abstract public R eval();
}

C++ version:

template< typename R >  // R = result type
class Exp {
public:
  virtual R eval() const = 0;
  virtual ~Exp() {}
};

Note that in C++ we need to remember to define a virtual destructor because we plan to inherit from Exp and we also want to be able to dynamically allocate subclasses of Exp.

Next up is the class for literals.

Peter’s Generic C# version:

class Lit<R> : Exp<R> {
  private readonly R v;
  public Lit(R v) {
    this.v = v;
  }
  public override R eval() {
    return v;
  }
}

C++ version:

template< typename R >
class Lit : public Exp<R> {
public:
  virtual R eval() const {
    return val;
  }
  explicit Lit(R val_) : val(val_) {}
private:
  R const val;
};

No surprises here.

But for binary expressions like plus, for instances, it starts to get a bit more tricky. Here I’ll use a simpler version of the class for representing plus nodes than the version Peter uses.

Generic C# version:

class Plus : Exp<int> {
  private readonly Exp<int> lhs, rhs;
  public Plus(Exp<int> lhs, Exp<int> rhs) {
    this.lhs = lhs; this.rhs = rhs;
  }
  public override int eval() {
    return lhs.eval() + rhs.eval();
  }
}

First cut at a C++ version:

class Plus : public Exp<int> {
public:
  virtual int eval() const {
    return lhs.eval() + rhs.eval();
  }
  explicit Plus(Exp<int> const lhs_, Exp<int> const rhs_)
    : lhs(lhs_), rhs(rhs_)
  {}
private:
  Exp<int> lhs, rhs;
};

Alas, this is not valid C++ because we are not allowed to declare fields and parameters of type Exp<int> because this class has an abstract virtual function, namely eval (and besides if we could, for example by modifying the class Exp, the code would not do what we expects). These problems has nothing to do with the generics in C++, they are just the usual C++ OO quirks. And the solution is straightforward: we must use a pointer to values of type Exp. But then we have to make decisions about ownership: should we make a deep copy of subexpressions? should we share subexpressions? and if so who should take take of freeing resources? should we use ref counting? or what? Here we shall just use one of the the wonderful smart pointers from Boost: shared_ptr which will take care of the refcounting and deletion of expressions. Thus, first we make a couple of convenient typedefs for integer expressions and shared pointers to integer expressions (and similar for boolean expressions):

typedef Exp< int > IntExp;
typedef boost::shared_ptr< IntExp > IntExpPtr;

Then the class for plus nodes looks like this:

class Plus : public IntExp {
public:
  virtual int eval() const {
    return lhs->eval() + rhs->eval();
  }
  explicit Plus(IntExpPtr const & lhs_, IntExpPtr const & rhs_)
    : lhs(lhs_), rhs(rhs_)
  {}
private:
  IntExpPtr lhs, rhs;
};

and a wrapper for constructing shared pointers for new plus expressions:

IntExpPtr plus(IntExpPtr const & lhs, IntExpPtr const & rhs) {
  return IntExpPtr( new Plus( lhs, rhs ) );
}

Similar to the Plus class we can define a class for representing condtional expressions:

Generic C# version:

class Cond<R> : Exp<R> {
  private readonly E<bool> cond;
  private readonly E<R> truth, falsehood;
  public Cond(E<bool> cond, E<R> truth, E<R> falsehood) {
    this.cond = cond; this.truth = truth; this.falsehood = falsehood;
  }
  public override R eval() {
    return cond.eval() ? truth.eval() : falsehood.eval();
  }
}

C++ version:

template < typename R >
class Cond : public Exp<R> {
public:
  typedef boost::shared_ptr< Exp<R> > SubExpPtr;
  virtual R eval() const {
    return cond->eval() ? truth->eval() : falsehood->eval();
  }
  explicit Cond(BoolExpPtr const & cond_,
                SubExpPtr const & truth_,
                SubExpPtr const & falsehood_) :
    cond(cond_), truth(truth_), falsehood(falsehood_)
  {}
private:
  BoolExpPtr cond;
  SubExpPtr truth, falsehood;
};

The interesting things to note about this class are that Cond contains subexpressions of different types, that Cond is polymorphic in the result type, that Cond enforces that the two branches of a condtional expression must have the same type, and the only one of the branchs is evaluated based on the evaluation of the condition.

Using these classes we can define a function that builds an expression for calculating the n‘th fibonacci number:

IntExpPtr fib(int n) {
  IntExpPtr fib0( lit(0) ), fib1( lit(1) );
  switch ( n ) {
  case 0: return fib0;
  case 1: return fib1;
  default:
    for(int i = 1; i != n; ++i) {
      IntExpPtr tmp( fib1 );
      fib1 = plus( fib0, fib1 );
      fib0 = tmp;
    }
    return fib1;
  }
}

This function builds an expression that takes linear, O(n), space, but it will use exponential time, O(2n), to be evaluated.

Conclusion
Non-surprising this GADT example can be translated more or less straight forward from generinc C# to C++. The things that are a bit tedious are just the normal C++ oddities and has nothing to do with generics. In fact, we would have had the exact same problems without generics.

Next up is to try and translate a few more of Claudio and Andrews examples. The statically typed printf and the typed LR parsing looks nifty.

Now with Java

Following up on the benchmark from yesterday, I’ve produced a Java version of the queens benchmark (queens.java). Below is the updated graph. I have a suspision that the reason Mono and Java is performing so bad is not only due to exceptions, but rather it has something to do with garbage collection.

Graph showing running times for the different executables

I also tried to improve the C++ version based on some hints I got from Asger and Søren (colleagues at Laerdal Sophus). The C++ improvements worked really well when we tested them at work with Visual Studio 6.0. But the improvements had absolutely no effect when compiling with gcc 3.4.1. Either gcc is very smart or very stupid. Something suggest the later as a quick test reveals that gcc version 3.4.1 is producing worse code than version 3.3.4, which in turn is producing worse code than version 3.2.3 . I guess that I’ll have to install Windows and the free Microsoft compilers, so that I can compare them against gcc and Mono.

Or: SML/Moscow ML is faster than C++/GCC

Inspired by a remark by Peter about the abysmal performance of exceptions on the .NET platform. I decided to check it for myself with a micro-benchmark. I also decided to check the performance of C++ exceptions, just for the fun of it.

The micro-benchmark is to find a solution to the N-queen problem using backtracking, with exceptions used as backtracking. I tested four different languages (SML, O’Caml, C#, and C++) and five different language implementations (MLton, Moscow ML, O’Caml, Mono, and GCC). The results are a bit surprising, and definately deserves further investigation. The graph below shows the running times for the five different executables:

Graph showing running times for the different executables

From the graph we see that O’Caml and MLton are way faster than the rest (and O’Caml has a small edge compared to MLton), Moscow ML is faster than Mono, and my C++ program is slow as a slug. I’ve tried various small tweaks to make the C++ version faster, but nothing really helps. For instance, at first we might notice that there are a tiny algorithmic difference between the C++ version and the other versions, namely the order in which a new position is compared with the already determined positions. But changing this order have no real effect (I’ve tried to change C++ program in various ways, and I tried to change the other programs). Still, the C++ performance is so bad that it looks suspicious. I’m open to suggestions for how to improve the C++ program.

Now I just have to test the SML version with the MLKit and SML/NJ, test the C# version with Microsoft’s runtime, test the C++ version with Visual C++ and Intel’s C++ compiler, and make a Java version of the benchmark and test it with … Any volunteers?

The programs used in the benchmark are here: queens.sml, queens.ml, queens.cs (ran mono with option --optimize=all) , and queens.cpp (compiled with option -O3).