Code Examples 🔗

This section contains code examples that demonstrate how to perform common Unicode operations with Unicorn.

Unicorn only has one header file named unicorn.h which you must include to retrieve its definitions. The inclusion of this header and the main function are omitted from the examples for brevity.

Text Encodings 🔗

The Unicode Standard defines several encoding forms for representing code points in memory. The most common encoding forms are UTF-8, UTF-16, and UTF-32. Unicorn defines several functions for decoding and converting between these encoding forms which are demonstrated below.

Decoding Text 🔗

This example decodes the Unicode scalar values of a null terminated UTF-8 encoded string. The string is declared with the u8 string literal syntax introduced in C11. The UNI_UTF8 constant is passed to uni_next to indicate the input string is encoded as UTF-8.

UTF-16 and UTF-32 are processed identically, except you’d use the u or U string literal syntax to declare the string as UTF-16 or UTF-32, respectively, and you’d pass UNI_UTF16 or UNI_UTF32. By default, Unicorn assumes native byte order, but you can pass UNI_BIG or UNI_LITTLE to specify big or little endian.

const char str[] = u8"I 🕵️."; // I spy
unisize i = 0;
for (;;) {
    unichar cp = 0x0;
    unistat r = uni_next(str, -1, UNI_UTF8, &i, &cp);
    if (r == UNI_DONE) {
        break;
    } else if (r == UNI_BAD_ENCODING) {
        // malformed character
    } else {
        printf("U+%04X\n", cp); // print scalar
    }
}

Decoding Text in Reverse 🔗

This example decodes the Unicode scalar values in reverse starting from the last character of the string.

const char str[] = u8"I 🕵️."; // I spy
unisize i = strlen(str);
for (;;) {
    unichar cp;
    unistat r = uni_prev(str, -1, UNI_UTF8, &i, &cp);
    if (r == UNI_DONE) {
        break;
    } else if (r == UNI_BAD_ENCODING) {
        // malformed character
    } else {
        printf("U+%04X\n", cp); // print scalar
    }
}

Encoding Scalar Values 🔗

Unicode scalar values can be encoded into any character encoding form. This example encodes HAMBURGER 🍔 (U+1F354) as UTF-8.

The minimum size of the output buffer depends on which encoding form is being targeted. Here, the output buffer size is 4 because that’s the longest code unit sequence for a Unicode scalar value in the UTF-8 encoding form. The table here lists the minimum buffer sizes for each encoding.

uint8_t dest[4] = {0};
unisize dest_len = 4;
uni_encode(U'🍔', dest, &dest_len, UNI_UTF8);

Converting Encoding Forms 🔗

You can convert between encoding forms using the uni_convert function. The “input” string is specified by the first three arguments and the “destination” the last three. In this example, the input string is null terminated so a length of -1 is given. The destination buffer will be null terminated because the UNI_NULIFY flag is given. The re-encoded text can be iterated as shown in this example.

const char src[] = u8"👨🏻‍🚀🧑🏼‍🚀 landed on the 🌕 in 1969.";
uint16_t dest[64] = {0};
unisize dest_len = 64;
uni_convert(src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF16 | UNI_NULIFY);

If the size of the destination buffer is unknown, then you can compute it by calling uni_convert with a NULL destination buffer argument and zero destination buffer length. The implementation will write the number of code units needed to the dest_len argument. Once the size is known, you can dynamically allocate a buffer large enough to accommodate the encoded text.

// Call uni_convert() once to compute the number of code units needed.
const char src[] = u8"👨🏻‍🚀🧑🏼‍🚀 landed on the 🌕 in 1969.";
unisize dest_len = 0;
uni_convert(src, -1, UNI_UTF8, NULL, &dest_len, UNI_UTF16 | UNI_NULIFY);

// Allocate a sufficiently-sized buffer.
uint16_t *dest = calloc(dest_len, sizeof(dest[0]));

// Call uni_convert() again with the sufficiently-sized buffer.
uni_convert(src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF16 | UNI_NULIFY);

Normalization 🔗

Unicode normalization transforms a string into a form for testing equivalence. Unicorn supports canonical equivalence testing. Strings are said to be canonically equivalent if they have the same visual appearance (i.e. the same grapheme clusters).

Canonical Equivalence 🔗

Strings can be compared for canonical equivalence with uni_normcmp. In the following code snippet is_equal is set to true or false depending on if s1 has the same grapheme clusters as s2. In this example, they do have the same grapheme clusters so the implementation will set is_equal to true.

const char *s1 = u8"å";       // precomposed 'a' with ring above
const char *s2 = u8"a\u030A"; // decomposed 'a' with ring above
bool is_equal = false;
uni_normcmp(s1, -1, UNI_UTF8, s2, -1, UNI_UTF8, &is_equal);

The implementation of uni_normcmp incrementally normalizes the input strings and compares them. If you intend to compare the same strings multiple times, then it’s recommended to normalize the input strings first with uni_norm and then use memcmp for a faster comparison. See Normalize Your Strings for details.

Normalizing Text 🔗

Strings can be normalized to either Normalization Form C or D with uni_norm. The following code snippet normalizes a null terminated UTF-8 string to Normalization Form C.

const char *src = u8"Åström";
char dest[16] = {0};
unisize dest_len = 16;
uni_norm(UNI_NFD,                                 // normalization form
         src, -1, UNI_UTF8,                       // input
         dest, &dest_len, UNI_UTF8 | UNI_NULIFY); // output

If the size of the destination buffer is not known ahead of time, then compute it by calling uni_norm with a NULL destination buffer argument and zero destination buffer length. The implementation will write the number of code units needed to the dest_len argument. Once the size is known, you can dynamically allocate a buffer large enough to accommodate the normalized text.

// Call uni_norm() once to compute the number of code units needed.
const char *src = u8"Åström";
unisize dest_len = 0;
uni_norm(UNI_NFD, src, -1, UNI_UTF8, NULL, &dest_len, UNI_UTF8 | UNI_NULIFY);

// Allocate a sufficiently-sized buffer.
char *dest = malloc(dest_len);

// Call uni_norm() again with the sufficiently-sized buffer.
uni_norm(UNI_NFD, src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF8 | UNI_NULIFY);

Case Mapping 🔗

Case mapping transforms characters from one case to another.

Case Conversion 🔗

Case conversion transforms text into a particular cased form for display to an end-user. The following code snippet demonstrates how to transform text to uppercase.

const char *src = "The quick brown fox jumps over the lazy dog.";
char dest[64] = {0};
unisize dest_len = 64;
uni_caseconv(UNI_TITLE, src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF8 | UNI_NULIFY);

Case Conversion and folding can change the length of the string. If the new size is not known ahead of time, then compute it by calling uni_caseconv with a NULL destination buffer and zero destination buffer length.

// Call uni_caseconv() once to compute the number of code units needed.
const char *src = "The quick brown fox jumps over the lazy dog.";
unisize dest_len = 0;
uni_caseconv(UNI_TITLE, src, -1, UNI_UTF8, NULL, &dest_len, UNI_UTF8 | UNI_NULIFY);

// Allocate a sufficiently-sized buffer.
char *dest = malloc(dest_len);

// Call uni_caseconv() again with the sufficiently-sized buffer.
uni_caseconv(UNI_TITLE, src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF8 | UNI_NULIFY);

Caseless Matching 🔗

Perform caseless matching with uni_casefoldcmp. Caseless matching checks for either default caseless equivalence or canonical caseless equivalence depending upon if the first argument is UNI_DEFAULT or UNI_CANONICAL.

const char *s1 = u8"å";      // precomposed 'a' with ring above
const char *s2 = u8"A\u030A"; // decomposed 'A' with ring above
bool is_equal = false;
uni_casefoldcmp(UNI_CANONICAL, s1, -1, UNI_UTF8, s2, -1, UNI_UTF8, &is_equal);

In this example, s1 will be canonically compared with s2 and the implementation will set is_equal to true since both are a canonical caseless match. If UNI_DEFAULT had been used instead of UNI_CANONICAL then is_equal would be false because precomposed and decomposed characters are not binary equivalent.

Text Segmentation 🔗

Text segmentation is the process of determining the boundaries of text elements. Unicorn supports determining the boundaries of extended graphemes clusters, words, and sentences.

Boundaries are found with the uni_nextbrk and uni_prevbrk functions. The text element to segment on is specified by one of the following constants:

Iterate Graphemes 🔗

This following code snippet demonstrates how to iterate the extended grapheme clusters of a string. More specifically, it prints the break positions of the graphemes.

const char *string = u8"Hi, 世界";
unisize index = 0;
while (uni_nextbrk(UNI_GRAPHEME, string, -1, UNI_UTF8, &index) == UNI_OK) {
    printf("%d\n", index); // prints '1', '2', '3', '4', 7', '10'
}

Iterate Graphemes in Reverse 🔗

This example demonstrates how to iterate the extended grapheme clusters of a string starting from the end of a string. More specifically, it prints the break positions of the graphemes in reverse.

const char *string = u8"Hi, 世界";
unisize index = strlen(string);
while (uni_prevbrk(UNI_GRAPHEME, string, -1, UNI_UTF8, &index) == UNI_OK) {
    printf("%d\n", index); // prints '7', '4', '3', '2', 1', '0'
}

Iterate Words 🔗

Iterating words is identical to iterating graphemes. The only difference is UNI_GRAPHEME becomes UNI_WORD.

const char *string = u8"Hello, 世界";
unisize index = 0;
while (uni_nextbrk(UNI_WORD, string, -1, UNI_UTF8, &index) == UNI_OK) {
    printf("%d\n", index); // prints '5', '6', '7', '10', '13'
}

Iterate Sentences 🔗

Iterating sentences is identical to iterating words. The only difference is UNI_WORD becomes UNI_SENTENCE.

const char *string = u8"Hello, 世界. こんにちは, World!";
unisize index = 0;
while (uni_nextbrk(UNI_SENTENCE, string, -1, UNI_UTF8, &index) == UNI_OK) {
    printf("%d\n", index); // prints '15', '38'
}

Collation 🔗

Collation determines the sorting order of strings.

Comparing Strings for Sorting 🔗

The following code snippet compares strings s1 and s2 with uni_collate. The implementation of uni_collate will write either -1, 0, or 1 to result depending on if s1 < s2, s1 = s2, or s1 > s2. This value can then be used with a sorting algorithm, like merge sort, to sort a collection of strings.

const char *s1 = "zebra";
const char *s2 = "zoo";
int result = 0;
uni_collate(s1, -1, UNI_UTF8,
            s2, -1, UNI_UTF8,
            UNI_NON_IGNORABLE, UNI_TERTIARY, &result);

Constructing and Comparing Sort Keys 🔗

When comparing strings multiple times, it’s recommended to generate and cache a sort key. A sort key is an array of unsigned 16-bit integers that can be cheaply compared against another sort key.

const char *s = "Hi";
uint16_t sortkey[16] = {0};
size_t sortkey_len = 16;
uni_sortkeymk(s, -1, UNI_UTF8, UNI_SHIFTED, UNI_PRIMARY, sortkey, &sortkey_len);

Once two sort keys are generated, they can be compared with uni_sortkeycmp. The following code snippet assumes sortkey1 and sortkey2 were generated with uni_sortkeymk as demonstrated in the previous snippet. The implementation of uni_sortkeycmp will write either -1, 0, or 1 to result depending on if sortkey1 < sortkey2, sortkey1 = sortkey2, or sortkey1 > sortkey2. It’s like uni_collate except, because the sort keys are prebuilt, it’s much faster.

int result = 0;
uni_sortkeycmp(sortkey1, sortkey1_len, sortkey2, sortkey2_len, &result);

Short String (De)Compression 🔗

Short string compression and decompression are performed by calling the uni_compress and uni_decompress functions, respectively. The input to the compressors is Unicode encoded text and the output is a sequence of bytes that represent the compressed text.

Compress Text 🔗

The following code snippet compresses UTF-16 encoded text.

The uni_compress function accepts Unicode text (e.g. UTF-8, 16, or 32) and produces a sequence of bytes representing the compression of the original text. The input encoding does not affect the compressed representation.

// Uncompressed text.
const uint16_t in[] = u"こんにちは世界！";

// Buffer to store the compressed text.
uint8_t buf[64] = {0};
size_t buflen = sizeof(buf);
uni_compress(in, sizeof(in)/sizeof(in[0]), UNI_UTF16, buf, &buflen);

Decompress Text 🔗

Text can be decompressed with the uni_decompress function as demonstrated in the snippet below. Note that the snippet references the comp and complen variables from the previous example.

char text[64] = {0};
unisize textlen = sizeof(text);
uni_decompress(buf, buflen, text, &textlen, UNI_UTF8);

Observe how the text is decompressed as UTF-8 despite originally being encoded as UTF-16. The input and output encodings do not need to match because the compressed stream of bytes is neither UTF-8, 16, nor 32 but rather its own compressed byte encoding. This differs from general purpose compressors which do not understand the Unicode charset and always decompress to the same encoding as the original text.

Manual