Thursday, March 19, 2009

Don't use TStringList for machine-readable text

TStringList is one of the most used classes in Delphi. It is very convenient for storing strings for the user (TMemo.Lines), storing parameters for components (TIBDatabase.Params), objects and many other items.

However, there are several problems with it. It's slow when sorting data (it uses Win32 API for comparing strings), but the dangerous part is that sorting and indexing is localized. This means that this code fails on my computer, but works on an American PC:

sl:=TStringList.Create;
sl.Sorted:=True;
sl.Add ('AA');
sl.Add ('AB');
Assert (sl.Strings[0]='AA');

The reason is simple. This is the Danish alphabet:

ABCDEFGHIKLMNOPQRSTUVXYZÆØÅ

By tradition, the last letter Å can also be written AA, and you can see how these two ways of spelling are mixed well on the homepage of the city of Århus. The correct sorter in Danish language is therefore:

Aachen
Aalto
Berlin
Copenhagen
Dresden
Essen
Frederikshavn
Aabenraa
Aalborg
Aarhus

In the first two words, AA means A and then A. In the last three words, AA means Å, which is the last letter in the alphabet. However, Windows doesn't know when AA means Å and when it means A A, so it always assumes that AA means Å, and always puts AA last.

Let's assume that you want to use a TStringList to save some kinds of codes in a specific order, like ATC codes. The first codes are:

A01AA01 Sodium fluoride
A01AA02 Sodium monofluorophosphate
A01AA03 Olaflur
A01AA04 Stannous fluoride
A01AA30 Combinations
A01AA51 Sodium fluoride, combinations
A01AB02 Hydrogen peroxide
A01AB03 Chlorhexidine
A01AB04 Amphotericin B
A01AB05 Polynoxylin

This is the Danish TStringList (and Windows) sort order:

A01AB04 Amphotericin B
A01AC02 Dexamethasone
A01AA30 Combinations
A02AB03 Aluminium phosphate
A02BA05 Niperotidine
A02AA05 Magnesium silicate

If you want to avoid that, then don't use TStringList.

No comments: