Collation

Compare text for sorting.

Enumerations 🔗

enum unistrength: Collation comparison levels.

enum uniweighting: Collation weighting algorithm.

Functions 🔗

unistat uni_collate( const void *s1, unisize s1_len, uniattr s1_attr, const void *s2, unisize s2_len, uniattr s2_attr, uniweighting weighting, unistrength strength, int32_t *result): Compare strings for sorting.

unistat uni_sortkeymk( const void *text, unisize text_len, uniattr text_attr, uniweighting weighting, unistrength strength, uint16_t *sortkey, size_t *sortkey_cap): Make sort key.

unistat uni_sortkeycmp( const uint16_t *sk1, size_t sk1_len, const uint16_t *sk2, size_t sk2_len, int32_t *result): Compare sort keys.

Discussion 🔗

Collation is the process and function of determining the sorting order of strings. Collation varies according to language and culture: Germans, French and Swedes sort the same characters differently. Unicorn is not an internationalization library therefore it does not contain language specific rules for sorting. Instead, it uses the Default Collation Element Table (DUCET), which is a data table specifying the “default” collation order for all Unicode characters. The goal of the DUCET is to provide a reasonable default ordering for all scripts, regardless of language.

To address the complexities of sorting, a multilevel comparison algorithm is employed. In comparing two strings, the most important feature is the identity of the base letters. For example, the difference between an A and a B. Accent differences are typically ignored, if the base letters differ. Case differences (uppercase versus lowercase) are typically ignored, if the base letters or their accents differ. The number of levels that are considered in comparison is known as the strength and are programmatically indicated by the elements of the unistrength enumeration.

Sort Keys 🔗

Collation is one of the most performance-critical features in a system. Consider the number of comparison operations that are involved in sorting or searching large databases. When comparing strings multiple times it’s recommended to generate and cache a sort key. A sort key is an array of unsigned 16-bit integers that is generated from a single string in combination with other collation settings. Sort keys must be generated with the same settings for their order to make sense. Two sort keys can be compared to produce either a less than, greater than, or equal to result. This result can be used with a sorting algorithm, like merge sort, to sort a collection of strings.

Sort keys are generated with uni_sortkeymk. This function accepts a string, collation settings, and produces the 16-bit sort key.

const char *string = "The quick brown fox jumps over the lazy dog.";
uint16_t sortkey[16];
size_t sortkey_len = 16;
uni_sortkeymk(string, -1, UNI_UTF8, UNI_PRIMARY, UNI_SHIFTED, NULL, &sortkey_len);

Once two sort keys are generated, they can be compared with uni_sortkeycmp.

In the case of comparing one-off pairs of strings, generating a sort key makes less sense. For these cases, a one-off comparison can be made using uni_collate. This function is conceptually like the strcoll function provided by the C standard library.

Normalization 🔗

The Unicode Collation Algorithm provides a complete, unambiguous, specified ordering for all characters. Canonical decomposition is performed as part of the algorithm therefore whether strings are normalized or not is irrelevant.

Version Compatibility 🔗

The relationship between two strings (two sort keys) is stable between versions of Unicode, however, the 16-bit values of a sort key may change. If sort keys are retained in persistent storage, it is recommended to store the Unicode version they were generated against. If the current version of the standard does not match what is stored, then all sort keys must be regenerated. The version of the Unicode Standard can be obtained with uni_getucdversion.

Manual