A Development Note about Internationalization

I just sent this to a mailing list that I am on, but I figured I’d post it publicly too since there is some useful information here. However, it is probably very dry if you are not a programmer.

Internationalization is a requirement for XBLA (and probably any other console deployment)…. Well, I have finally started on that process.

Someone before mentioned gnu gettext() and after reading up on the file format it seemed simple and like the right thing. There is a text file format (suffix “.po”) that you can type stuff into, and then if you want there are tools that convert that into a binary format (suffix “.mo”) that you can just load as one chunk of memory and use (it even has a precomputed hash table inside the file). I didn’t want to link to a big library, and it seemed simple enough to implement, so I just wrote a reader for the binary file format. This did not take long; it was 166 lines of code, 20 of which are just the hashpjw function which I pasted in after looking up the license terms.

Unfortunately it didn’t work and it wasn’t obvious why not, it seemed like the hash function was wrong or something. So then I spent a few hours trying to compile gettext() so I could run through the hello_world program and see how it goes. Big mistake. The problem with gettext() is there is a massive amount of code there to do, as nearly as I can tell, nearly nothing. After spending a couple of hours I was not even able to get it to compile under cygwin (using the native cygwin packages and the new auto-installer-and-updater that cygwin has… I did get it to partially compile, and that took 15 minutes (!!) of constant compiling, for a program that basically is just for string lookups in a hash table. What the fuck, people). So I said fuck that, and just looked at the source some more and within 5 minutes saw the bug (which had to do with the documentation not being very clear; though I am disappointed I didn’t figure it out as soon as it started happening).

So all in all, the programming part of this was a very quick task, if you don’t count the hours I spent trying to deal with crappy open source. (The RAD Game Tools model of packing everything into one file is definitely, definitely the way to go if you ever want anyone to use your code).

My 166 lines of code assumes that the hash table is present in the .mo file, but doesn’t care about the endianness of it. If anyone wants it, just ask.

I am currently using a gui tool called poedit to compile the .po files into .mo, but I think it just calls the command-line tool ‘msgfmt‘.

So my strings are now all externalized and happy in a standard format that any contracted translation company can deal with.

gettext(), and programs that people have built around it, tend to provide some tools for scanning through your program and extracting all the strings to build the initial .po file, but this did not seem very useful to me… if I even had the confidence I could figure out how to make it run in a reasonable amount of time. The majority of strings in the Braid source are not text displayed to the end-user (they’re stuff like asset IDs), so such a generated file would have a lot of junk in it. But also, I think it’s important to pick a separate label for the string lookup than the actual English string, because that helps emphasize / document that there is an external data file that you need to change… otherwise you go and fix a typo or change some punctuation in the English text in the program, and then all your translations mysteriously break. Which would be lame.

I haven’t done the rendering part of this yet, though. That’s next. Right now my plan is to use freetype2 to generate font glyphs into a texture on-demand after startup. Does anyone have experience using freetype at all?

8 thoughts on “A Development Note about Internationalization”

  1. I’ve used FreeType. It is quite nice and works well for what you’re proposing, just make sure you completely ignore the Fontconfig stuff that tends to get lumped in with FreeType. You don’t need it and if you think trying to get gettext compiled was a chore, you ain’t seen nothing yet (this applies to Fontconfig, not the base freetype lib).

  2. FreeType is a PITA to compile, but the results are nice.

    I’m not sure if you need to support Kanji or Korean, but there are a lot of characters and you may need to pack glyphs dynamically.

    Multisample will blur your characters, so don’t render two triangles cutting across a quad.

    You may need to reduce VRAM usage by using shader trickery.

    That’s all of the subtle details I can think of off the top of my head.

  3. As far as I know, text extraction is normally done by parsing the _() macro; that way you end with a catalogue (.po) containing only the relevant strings.
    I’m not sure if it’s a good idea to rewrite the strings in a code/label fashion; actually the interesting part is to write everything in one language and have it automagically set for translation. Having the strings in English will be more friendly for the translators too. It also helps with printf-like format specifiers.
    You don’t have to worry if some strings get changed in the code, because you can update the catalogues any time. Poedit has a function that scans a group of files (or a directory, can’t remember) and updates the catalogue. It shows a summary with the new and obsolete strings (unchanged strings remain translated).

  4. That makes sense, but the situation is more complicated than that. For example, the game levels have strings in them (for stuff like the name of the level, which gets drawn as a caption when the player goes there). The levels are stored in their own binary format, so these strings obviously are not going to be caught by any kind of preprocessing tool. So any time one wants to change one of these strings, changes have to be made in multiple places (the level itself, then in the translation files).

    I would rather confine the changes just to the translation files. This is a general rule of good software design: Don’t make it so that small changes in one place will magically break something in another place, without ever notifying you, unless you notice by happening to run that one code path and paying attention.

    So, if there is some automated process for updating things, but it only works sometimes on some subset of the project, then I view that as a liability and would rather not use it. Because the existence of that thing makes the project more complicated, and then you have to have all this specialized knowledge about what gets tracked and what doesn’t. Whereas if nothing ever gets tracked, that is easy, and less complicated.

    About translator convenience; I have been envisioning that they would just start with the default.po (which is in English) and replace the translations. So the English would still be right there. That does override some of the features of poedit (the way it tracks untranslated strings and highlights them for you) but I don’t necessarily think that is very useful anyway, for a project like this one.

  5. Parveen: Freetype2 compiled for me immediately today, without problems, on Windows at least. (They had a Visual Studio project file folder, and everything). I have yet to try this stuff on the 360.

    I’ve got it rendering individual glyphs and then displaying them in-game, so now all I have to do is the packing. For anyone who has done this kind of thing — I need to make an index that goes from character codes to (application-specific) glyph data records. Am I better off using a hash table keyed on the utf32 character code, or else using sequential utf8 characters to navigate a trie or some other structure like that?

    The hash table would be a lot easier programming-wise, since I can just plug one in and go. But… it seems kind of bleah.

  6. One thing you might want to look out for is that the font file can be quite large — we didn’t have time or money to license fonts, so we used the Microsoft-provided Asian font.. the .ttf file was something like 50mb!

    So, if you’re only using a couple of different font sizes it might make more sense to pre-bake the textures and ship those rather than doing the baking on startup. You could also try to hack at the .ttf file — there are thousands of characters you don’t need since it includes dozens of Asian alphabets, of which you probably only need 3-4.

    Or if you’re not incredibly cheap like us, you could just license proper Asian fonts 😉

  7. Ooh, y’know, I suppose I should have looked into the size of Asian fonts.

    I am really loving the direct-ttf-usage for day-to-day work, though… I feel that I attained good visual quality on the style of the game much more easily than I would have using my previous pre-baked-font system.

    Anyway, we’ll see how it works out… of course, I will do whatever is necessary. I have yet to talk to the internationalization guys at all so I have no idea.

Leave a Reply

Your email address will not be published. Required fields are marked *