Code Examples π
This section contains code examples that demonstrate how to perform common Unicode operations with Unicorn.
Unicorn only has one header file named unicorn.h
which you must include to retrieve its definitions. The inclusion of this header and the main
function are omitted from the examples for brevity.
Text Encodings π
The Unicode Standard defines several encoding forms for representing code points in memory. The most common encoding forms are UTF-8, UTF-16, and UTF-32. Unicorn defines several functions for decoding and converting between these encoding forms which are demonstrated below.
Decoding Text π
This example decodes the Unicode scalar values of a null terminated UTF-8 encoded string. The string is declared with the u8
string literal syntax introduced in C11. The UNI_UTF8 constant is passed to uni_next to indicate the input string is encoded as UTF-8.
UTF-16 and UTF-32 are processed identically, except youβd use the u
or U
string literal syntax to declare the string as UTF-16 or UTF-32, respectively, and youβd pass UNI_UTF16 or UNI_UTF32. By default, Unicorn assumes native byte order, but you can pass UNI_BIG or UNI_LITTLE to specify big or little endian.
const char str[] = u8"I π΅οΈ."; // I spy
unisize i = 0;
for (;;) {
unichar cp = 0x0;
unistat r = uni_next(str, -1, UNI_UTF8, &i, &cp);
if (r == UNI_DONE) {
break;
} else if (r == UNI_BAD_ENCODING) {
// malformed character
} else {
printf("U+%04X\n", cp); // print scalar
}
}
Decoding Text in Reverse π
This example decodes the Unicode scalar values in reverse starting from the last character of the string.
const char str[] = u8"I π΅οΈ."; // I spy
unisize i = strlen(str);
for (;;) {
unichar cp;
unistat r = uni_prev(str, -1, UNI_UTF8, &i, &cp);
if (r == UNI_DONE) {
break;
} else if (r == UNI_BAD_ENCODING) {
// malformed character
} else {
printf("U+%04X\n", cp); // print scalar
}
}
Encoding Scalar Values π
Unicode scalar values can be encoded into any character encoding form. This example encodes HAMBURGER π (U+1F354) as UTF-8.
The minimum size of the output buffer depends on which encoding form is being targeted. Here, the output buffer size is 4
because thatβs the longest code unit sequence for a Unicode scalar value in the UTF-8 encoding form. The table here lists the minimum buffer sizes for each encoding.
uint8_t dest[4] = {0};
unisize dest_len = 4;
uni_encode(U'π', dest, &dest_len, UNI_UTF8);
Converting Encoding Forms π
You can convert between encoding forms using the uni_convert function. The βinputβ string is specified by the first three arguments and the βdestinationβ the last three. In this example, the input string is null terminated so a length of -1
is given. The destination buffer will be null terminated because the UNI_NULIFY flag is given. The re-encoded text can be iterated as shown in this example.
const char src[] = u8"π¨π»βππ§πΌβπ landed on the π in 1969.";
uint16_t dest[64] = {0};
unisize dest_len = 64;
uni_convert(src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF16 | UNI_NULIFY);
If the size of the destination buffer is unknown, then you can compute it by calling uni_convert with a NULL
destination buffer argument and zero destination buffer length. The implementation will write the number of code units needed to the dest_len
argument. Once the size is known, you can dynamically allocate a buffer large enough to accommodate the encoded text.
// Call uni_convert() once to compute the number of code units needed.
const char src[] = u8"π¨π»βππ§πΌβπ landed on the π in 1969.";
uint16_t *dest = NULL;
unisize dest_len = 0;
uni_convert(src, -1, UNI_UTF8, NULL, &dest_len, UNI_UTF16 | UNI_NULIFY);
// Allocate a sufficiently-sized buffer.
dest = calloc(dest_len, sizeof(dest[0]));
// Call uni_convert() again with the sufficiently-sized buffer.
uni_convert(src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF16 | UNI_NULIFY);
Normalization π
Unicode normalization transforms a string into a form for testing equivalence. Unicorn supports canonical equivalence testing. Strings are said to be canonically equivalent if they have the same visual appearance (i.e. the same extended grapheme clusters).
Canonical Equivalence π
Strings can be compared for canonical equivalence with uni_normcmp. In the following code snippet is_equal
is set to true
or false
depending on if s1
has the same grapheme clusters as s2
. In this example, they do have the same grapheme clusters so the implementation will set is_equal
to true
.
const char *s1 = u8"Γ₯"; // precomposed 'a' with ring above
const char *s2 = u8"a\u030A"; // decomposed 'a' with ring above
bool is_equal = false;
uni_normcmp(s1, -1, UNI_UTF8, s2, -1, UNI_UTF8, &is_equal);
Normalizing Text π
Strings can be normalized to either Normalization Form C or D with uni_norm. The following code snippet normalizes a null terminated UTF-8 string to Normalization Form C.
const char *src = u8"Γ
strΓΆm";
char dest[16] = {0};
unisize dest_len = 16;
uni_norm(UNI_NFD, // normalization form
src, -1, UNI_UTF8, // input
dest, &dest_len, UNI_UTF8 | UNI_NULIFY); // output
If the size of the destination buffer is not known ahead of time, then compute it by calling uni_norm with a NULL
destination buffer argument and zero destination buffer length. The implementation will write the number of code units needed to the dest_len
argument. Once the size is known, you can dynamically allocate a buffer large enough to accommodate the normalized text.
// Call uni_norm() once to compute the number of code units needed.
const char *src = u8"Γ
strΓΆm";
char *dest = NULL;
unisize dest_len = 0;
uni_norm(UNI_NFD, src, -1, UNI_UTF8, NULL, &dest_len, UNI_UTF8 | UNI_NULIFY);
// Allocate a sufficiently-sized buffer.
dest = malloc(dest_len);
// Call uni_norm() again with the sufficiently-sized buffer.
uni_norm(UNI_NFD, src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF8 | UNI_NULIFY);
Case Mapping π
Case mapping transforms characters from one case to another.
Case Conversion π
Case conversion transforms text into a particular cased form for display to an end-user. The following code snippet demonstrates how to transform text to uppercase.
const char *src = "The quick brown fox jumps over the lazy dog.";
char dest[64] = {0};
unisize dest_len = 64;
uni_caseconv(UNI_TITLE, src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF8 | UNI_NULIFY);
Case Conversion and folding can change the length of the string. If the new size is not known ahead of time, then compute it by calling uni_caseconv with a NULL
destination buffer and zero destination buffer length.
// Call uni_caseconv() once to compute the number of code units needed.
const char *src = "The quick brown fox jumps over the lazy dog.";
char *dest = NULL;
unisize dest_len = 0;
uni_caseconv(UNI_TITLE, src, -1, UNI_UTF8, NULL, &dest_len, UNI_UTF8 | UNI_NULIFY);
// Allocate a sufficiently-sized buffer.
dest = malloc(dest_len);
// Call uni_caseconv() again with the sufficiently-sized buffer.
uni_caseconv(UNI_TITLE, src, -1, UNI_UTF8, dest, &dest_len, UNI_UTF8 | UNI_NULIFY);
Caseless Matching π
Perform caseless matching with uni_casefoldcmp. Caseless matching checks for either default caseless equivalence or canonical caseless equivalence depending upon if the first argument is UNI_DEFAULT or UNI_CANONICAL.
const char *s1 = u8"Γ₯"; // precomposed 'a' with ring above
const char *s2 = u8"A\u030A"; // decomposed 'A' with ring above
bool is_equal = false;
uni_casefoldcmp(UNI_CANONICAL, s1, -1, UNI_UTF8, s2, -1, UNI_UTF8, &is_equal);
In this example, s1
will be canonically compared with s2
and the implementation will set is_equal
to true
since both are a canonical caseless match. If UNI_DEFAULT had been used instead of UNI_CANONICAL then is_equal
would be false
because precomposed and decomposed characters are not binary equivalent.
Text Segmentation π
Text segmentation is the process of determining the boundaries of text elements. Unicorn supports determining the boundaries of extended graphemes clusters, words, and sentences.
Boundaries are found with the uni_nextbrk and uni_prevbrk functions. The text element to segment on is specified by one of the following constants:
Iterate Graphemes π
This following code snippet demonstrates how to iterate the extended grapheme clusters of a string. More specifically, it prints the break positions of the graphemes.
const char *string = u8"Hi, δΈη";
unisize index = 0;
while (uni_nextbrk(UNI_GRAPHEME, string, -1, UNI_UTF8, &index) == UNI_OK) {
printf("%d\n", index); // prints '1', '2', '3', '4', 7', '10'
}
Iterate Graphemes in Reverse π
This example demonstrates how to iterate the extended grapheme clusters of a string starting from the end of a string. More specifically, it prints the break positions of the graphemes in reverse.
const char *string = u8"Hi, δΈη";
unisize index = strlen(string);
while (uni_prevbrk(UNI_GRAPHEME, string, -1, UNI_UTF8, &index) == UNI_OK) {
printf("%d\n", index); // prints '7', '4', '3', '2', 1', '0'
}
Iterate Words π
Iterating words is identical to iterating graphemes. The only difference is UNI_GRAPHEME becomes UNI_WORD.
const char *string = u8"Hello, δΈη";
unisize index = 0;
while (uni_nextbrk(UNI_WORD, string, -1, UNI_UTF8, &index) == UNI_OK) {
printf("%d\n", index); // prints '5', '6', '7', '10', '13'
}
Iterate Sentences π
Iterating sentences is identical to iterating words. The only difference is UNI_WORD becomes UNI_SENTENCE.
const char *string = u8"Hello, δΈη. γγγ«γ‘γ―, World!";
unisize index = 0;
while (uni_nextbrk(UNI_SENTENCE, string, -1, UNI_UTF8, &index) == UNI_OK) {
printf("%d\n", index); // prints '15', '38'
}
Collation π
Collation determines the sorting order of strings.
Comparing Strings for Sorting π
The follow code snippet compares strings s1
and s2
with uni_collate. The implementation of uni_collate will write either -1, 0, or 1 to result
depending on if s1 < s2
, s1 = s2
, or s1 > s2
. This value can then be used with a sorting algorithm, like merge sort, to sort a collection of strings.
const char *s1 = "zebra";
const char *s2 = "zoo";
int result = 0;
uni_collate(s1, -1, UNI_UTF8,
s2, -1, UNI_UTF8,
UNI_NON_IGNORABLE, UNI_TERTIARY, &result);
Constructing and Comparing Sort Keys π
When comparing strings multiple times, itβs recommended to generate and cache a sort key. A sort key is an array of unsigned 16-bit integers that can be cheaply compared against another sort key.
const char *s = "Hi";
uint16_t sortkey[16] = {0};
size_t sortkey_len = 16;
uni_sortkeymk(s, -1, UNI_UTF8, UNI_SHIFTED, UNI_PRIMARY, sortkey, &sortkey_len);
Once two sort keys are generated, they can be compared with uni_sortkeycmp. The following code snippet assumes sortkey1
and sortkey2
were generated with uni_sortkeymk as demonstrated in the previous snippet. The implementation of uni_sortkeycmp will write either -1, 0, or 1 to result
depending on if sortkey1 < sortkey2
, sortkey1 = sortkey2
, or sortkey1 > sortkey2
. Itβs like uni_collate except, because the sort keys are prebuilt, itβs much faster.
int result = 0;
uni_sortkeycmp(sortkey1, sortkey1_len, sortkey2, sortkey2_len, &result);
Short String (De)Compression π
Short string compression and decompression are performed by calling the uni_compress and uni_decompress functions, respectively. The input to the compressors is Unicode encoded text and the output is a sequence of bytes that represent the compressed text.
Compress Text π
The following code snippet compresses UTF-16 encoded text.
The uni_compress function accepts Unicode text (e.g. UTF-8, 16, or 32) and produces a sequence of bytes representing the compression of the original text. The input encoding does not affect the compressed representation.
// Uncompressed text.
const uint16_t in[] = u"γγγ«γ‘γ―δΈηοΌ";
// Buffer to store the compressed text.
uint8_t buf[64] = {0};
size_t buflen = sizeof(buf);
uni_compress(in, sizeof(in)/sizeof(in[0]), UNI_UTF16, buf, &buflen);
Decompress Text π
Text can be decompressed with the uni_decompress function as demonstrated in the snippet below. Note that the snippet references the comp
and complen
variables from the previous example.
char text[64] = {0};
unisize textlen = sizeof(text);
uni_decompress(buf, buflen, text, &textlen, UNI_UTF8);
Observe how the text is decompressed as UTF-8 despite originally being encoded as UTF-16. The input and output encodings do not need to match because the compressed stream of bytes is neither UTF-8, 16, nor 32 but rather its own compressed byte encoding. This differs from general purpose compressors which do not understand the Unicode charset and always decompress to the same encoding as the original text.