logo
Header graphic 1 of 9

Categories

Archives

Other stuff

Other sites

I wish this site were powered by Django

August 08th, 2006

New features

Filed under: General — jm @ 18:48

Finally I had time to port some features from my maurus.net staging host. For some time now, I’ve been experimenting with all kinds of JavaScript toolkits (Prototype in particular). Last week I integrated the excellent WordPress Widgets Plug-In in my theme and rolled out the changeable header graphic (you’ve noticed that, didn’t you? ;-)). Today, in this code update for maurus.net, there’s new JavaScript functionality.

I tend to clutter my articles with lots of notes that are important to create context, but might turn a reader away. So go and meet the new hide-the-notes button.

I know that it’s not particularly impressive, but with all dynamic JavaScript-based features, I think a lot of thought has to go into their usability and graceful fallback abilities, so I’m not very eager to add such things before testing them extensively.

UTF8-encoded Unicode support

Filed under: Attitude, Cutting the crap, Django, Java, Python, Technology — jm @ 02:15

From this post at MySQL DBA:

The utf8 spec says that a utf8 character can take up to 4 bytes, mySQL currently only supports up to 3 bytes.

……holy crap. Let’s summarize how different languages and frameworks support Unicode at the moment:

For some reasons, someone over at Sun decided that Java’s char-data type should have 2 bytes and be represented in modifed UTF8, so that certain characters that would normally require 4 bytes in normal UTF8, now require 6. This actually makes sense to keep compatibility with C programs (modified UTF8 has no “0-bytes” in strings), but makes it hard to support Unicode’s supplementary plane, where characters can have up to 4 bytes. The details can be found in JSR-204. Ever since then… string operations in Java cannot reliably calculate string length, because these methods do actually count valid UTF-16 characters.

PHP… don’t even get me started. PHP just sucks.

Ruby apparently also has problems with moving to Unicode (and this language was designed in Japan)?

So while I’m very glad that I decided to focus on Django and Python, which have has excellent Unicode support, I really feel, more than ever, the need to get the word out that text processing is incredibly hard and that there is no excuse for so many developers and teachers not caring about it.

Update (08/31/2006)

Django’s unicode support apparently also sucks. They try to do the right thing in their MySQL driver (using SET NAMES 'utf8'), but fail to set the connection character set properly, so even if the model is in UTF8, MySQL will treat every incoming string as latin1. This leads to ugly reencoding errors. It seems to work better with PostgreSQL, but it’s still a huge fucking bug. The developers try to get their act together, though and the “unicodification” will probably be done before they hit 1.0. #1356, #1355 and #952 read very badly. At least I have now pointers on what to fix (follow the links), but out-of-the-box you’re fucked if you want to connect your legacy Windows-1252 database to the web using UTF8 with django.

This reminds me a bit of the sad situation with Typo3. At least, with django, it’s not the programming language that’s the core problem.

Update (02/15/2008)

Unlike Java and PHP, Django has come a long way since this post was written. I wrote an update on Django’s Unicode capabilities that can be found here.