A Unicode editor for Michael

Michael claims that he does not have an editor that can handle Unicode.

Thus, Henning and I whipped up an editor using mGTK that can handle Unicode. Oh, and did I mention that you can it compile with either Moscow ML or MLton without changing the source?

The real story is of course that I we were trying to build a “real” application using mGTK, and the editor example shows that the gtk+ widgets handle Unicode fine. Wereas I’m not sure that SML handles Unicode “fine”, String.size does not return the number of characters but the number of bytes. But at least TextIO does not mess up the bytes (in Moscow ML at least, didn’t test with MLton).

Oh, and I used file and gedit to check that the file I saved really was Unicode.

5 thoughts on “A Unicode editor for Michael

  1. If you just want to play with it, I think you should be able to just check out mgtk from the SF CVS. Then, go to mgtk/src/defs2sml/release/mgtk and do a make, after that go back to mgtk/examples and do a make editor. Some murking with makefiles might be needed.

    However, there will be a official release shortly. Meanwhile just use gedit.

  2. I didn’t know about gedit. Thanks. I’m a bit confused about how I’m supposed to input non-ASCII
    characters into it though. Perhaps what I should have said in my ‘log entry is that “Emacs doesn’t
    support Unicode properly”, and this makes life painful. Apparently Emacs-22 will be better.

    As for programming in SML, presumably the “right” way to do this is to use a WideChar type, and
    to hope that the 4 byte characters in memory get written out to disk in nice compact UTF-8 form.
    (Or is there some other model implicit in the revised Basis?)

  3. Regarding non-ASCII input, I’d use gucharmap (unless it is latin1-letters, in which case I’d just use the keybord) or the “Character Palette” applet (assuming Gnome). But I know that there are other input methods available.

    As for programming in SML. I don’t know what the “right” way to handle Unicode is. WideChar seem like an adobtion of C’s wchar_t (BTW where did you get the ‘4 bytes’ from? I don’t that WideChar specifies how many bytes should be used) which is inadequate IMNSHO for handling Unicode, because you really need to deal with the encoding directly. And for a great many purposes will a 4 byte encoding be terrible wastefull. Thanksfully, SML’s normal (and required) CharVectors (i.e., String) would be perfect for representing UTF-8 strings. Thus, you could implement a nice library in pure SML.

  4. Well, I want to write my programs using the String signature, so that I can ignore the encoding
    issues entirely. Whether or not the underlying representation uses a vector of 32 bit words is
    irrelevant. The question is what the return type of
    String.sub
    should be. The WideString signature says it has to be the WideChar type. The demands of Unicode
    basically require this to be a type capable of representing 2**32 values. (Or maybe it’s 2**20,
    but whatever 🙂 So, WideChar needs to be four bytes. If you use a nice underlying UTF-8
    representation underneath the String signature, that’s good implementation, but I don’t want
    to see that as a user of the API.

Leave a Reply

Your email address will not be published. Required fields are marked *