Thursday, March 19, 2009

UTF-8 automatic detection

If you have ever worked with an environment that mixed utf-8 and the 8-bit default character set in Windows, you may have run into the desire to autodetect utf-8 text. This is actually very easy, because there are a lot of illegal byte sequences in utf-8, which usually appear in other character sets.

For instance, a Danish word like "Øl" is encoded $D8 $6C using the Danish local character set. However, $D8 (binary 11011000) indicates that it is the start of a 2-byte sequence, where the next byte is in the range $80-$BF, which it is not. In other words, even this tiny 2-byte text can be clearly identified as not being utf-8.

Originally, the main method to autodetect utf-8 is to see if the byte sequence conforms to the utf-8 method of indicating the number of bytes in a character:

* The range $80-$BF is used for bytes which are not the first in a character
* The range $C0-$DF is used for 2-byte characters
* The range $E0-$EF is used for 3-byte characters
* The range $F0-$F7 is used for 4-byte characters
* The range $F8-$FB is used for 5-byte characters
* The range $FC-$FD is used for 6-byte characters
etc.

However, there are more mechanisms that you can use:

* 5-byte and 6-byte characters are not used, even though they would be technically possible. If you experience a valid 5-byte or 6-byte combination, which is usually unlikely, you can and may detect it as being an invalid sequence.
* It is incorrect to use more bytes than necessary. For instance, if you want to encode the character 'A' (codepoint 65=$41), it is ok to encode it using 1 byte ($41) but not ok to use 2 bytes ($C0 $41).
* If your application knows that some unicode values cannot be generated by the creator of the text, you can make an application-specific exclusion of these values, too.

One of the things that makes this autodetection so beautiful, is that it almost works with 100% accuracy. In Denmark, we use the Windows-1252, which is related to Ansi, ISO-8859-1 and ISO-8859-15:

* Byte values in the $00-$7F range can be detected as either ansi or utf-8, and it doesn't matter, because these two character encoding standards are identical for these values.
* The $80-$BF range contains a lot of strange symbols not used inside words, and the $C0-$F6 range contains almost only letters. In other words, in order to have an ansi text with a valid non-ascii utf-8 byte sequence, you would need to have a strange letter followed by the right number of symbols.
* $C0-$DF range: The most likely would be a 2-byte sequence, that starts at the end of an uppercase word with an Æ or an Ø, followed by a sign like "™", something like "TRÆ™". The last two bytes would be encoded $C6 $99 in ANSI, which is a valid utf-8 combination with the unicode value $0199. However, this is a "Latin small letter k with hook", which in most Danish applications is not a valid character. This way, this text can be rejected as being utf-8 with great certainty.
* $E0-$F7 range: Here it gets very unlikely to get the right utf-8 byte sequences, but even if it happens, the encoded value would probably end up being regarded as illegal in the specific application. Remember, many applications only accept text, that can be converted to the local 8-bit character set, because it is integrated with other systems or needs to be able to save all files in 8-bit character set text files or databases.

No comments: