Header graphic 2 of 8



Other stuff

Other sites

I wish this site were powered by Django

August 08th, 2006

UTF8-encoded Unicode support

Filed under: Attitude,Cutting the crap,Django,Java,Python,Technology — jm @ 02:15

From this post at MySQL DBA:

The utf8 spec says that a utf8 character can take up to 4 bytes, mySQL currently only supports up to 3 bytes.

……holy crap. Let’s summarize how different languages and frameworks support Unicode at the moment:

For some reasons, someone over at Sun decided that Java’s char-data type should have 2 bytes and be represented in modifed UTF8, so that certain characters that would normally require 4 bytes in normal UTF8, now require 6. This actually makes sense to keep compatibility with C programs (modified UTF8 has no “0-bytes” in strings), but makes it hard to support Unicode’s supplementary plane, where characters can have up to 4 bytes. The details can be found in JSR-204. Ever since then… string operations in Java cannot reliably calculate string length, because these methods do actually count valid UTF-16 characters.

PHP… don’t even get me started. PHP just sucks.

Ruby apparently also has problems with moving to Unicode (and this language was designed in Japan)?

So while I’m very glad that I decided to focus on Django and Python, which have has excellent Unicode support, I really feel, more than ever, the need to get the word out that text processing is incredibly hard and that there is no excuse for so many developers and teachers not caring about it.

Update (08/31/2006)

Django’s unicode support apparently also sucks. They try to do the right thing in their MySQL driver (using SET NAMES 'utf8'), but fail to set the connection character set properly, so even if the model is in UTF8, MySQL will treat every incoming string as latin1. This leads to ugly reencoding errors. It seems to work better with PostgreSQL, but it’s still a huge fucking bug. The developers try to get their act together, though and the “unicodification” will probably be done before they hit 1.0. #1356, #1355 and #952 read very badly. At least I have now pointers on what to fix (follow the links), but out-of-the-box you’re fucked if you want to connect your legacy Windows-1252 database to the web using UTF8 with django.

This reminds me a bit of the sad situation with Typo3. At least, with django, it’s not the programming language that’s the core problem.

Update (02/15/2008)

Unlike Java and PHP, Django has come a long way since this post was written. I wrote an update on Django’s Unicode capabilities that can be found here.

One Response

  1. Jonas Maurus’ maurus.net » Django 0.95 has unicode problems, too

    […] I had to revise my post titled “UTF8-encoded Unicode support“, because I found out that django’s unicode support has it’s own problems. Some are, of course, connected to the character-set handling of their database code, but generally they currently handle all strings as binary, assuming that everything works within the DEFAULT_ENCODING setting. […]