Michael claims that he does not have an editor that can handle Unicode.
Thus, Henning and I whipped up an editor using mGTK that can handle Unicode. Oh, and did I mention that you can it compile with either Moscow ML or MLton without changing the source?
The real story is of course that I we were trying to build a “real” application using mGTK, and the editor example shows that the gtk+ widgets handle Unicode fine. Wereas I’m not sure that SML handles Unicode “fine”, String.size does not return the number of characters but the number of bytes. But at least TextIO does not mess up the bytes (in Moscow ML at least, didn’t test with MLton).
Oh, and I used file
and gedit
to check that the file I saved really was Unicode.
Cool! What do I do to download it and the library? Or should I wait for the official mGTK release?
If you just want to play with it, I think you should be able to just check out mgtk from the SF CVS. Then, go to
mgtk/src/defs2sml/release/mgtk
and do amake
, after that go back tomgtk/examples
and do amake editor
. Some murking with makefiles might be needed.However, there will be a official release shortly. Meanwhile just use gedit.
I didn’t know about gedit. Thanks. I’m a bit confused about how I’m supposed to input non-ASCII
characters into it though. Perhaps what I should have said in my ‘log entry is that “Emacs doesn’t
support Unicode properly”, and this makes life painful. Apparently Emacs-22 will be better.
As for programming in SML, presumably the “right” way to do this is to use a WideChar type, and
to hope that the 4 byte characters in memory get written out to disk in nice compact UTF-8 form.
(Or is there some other model implicit in the revised Basis?)
Regarding non-ASCII input, I’d use gucharmap (unless it is
latin1
-letters, in which case I’d just use the keybord) or the “Character Palette” applet (assuming Gnome). But I know that there are other input methods available.As for programming in SML. I don’t know what the “right” way to handle Unicode is.
WideChar
seem like an adobtion of C’swchar_t
(BTW where did you get the ‘4 bytes’ from? I don’t that WideChar specifies how many bytes should be used) which is inadequate IMNSHO for handling Unicode, because you really need to deal with the encoding directly. And for a great many purposes will a 4 byte encoding be terrible wastefull. Thanksfully, SML’s normal (and required)CharVector
s (i.e.,String
) would be perfect for representing UTF-8 strings. Thus, you could implement a nice library in pure SML.Well, I want to write my programs using the String signature, so that I can ignore the encoding
issues entirely. Whether or not the underlying representation uses a vector of 32 bit words is
irrelevant. The question is what the return type of
String.sub
should be. The WideString signature says it has to be the WideChar type. The demands of Unicode
basically require this to be a type capable of representing 2**32 values. (Or maybe it’s 2**20,
but whatever 🙂 So, WideChar needs to be four bytes. If you use a nice underlying UTF-8
representation underneath the String signature, that’s good implementation, but I don’t want
to see that as a user of the API.