Function

uni_normcmp

Compare strings for canonical equivalence.

Since v1.0
unistat uni_normcmp(
const void *s1, unisize s1_len, uniattr s1_attr, const void *s2, unisize s2_len, uniattr s2_attr, bool *result)

Parameters 🔗

s1 in

First source text.

s1_len in

Number of code units in s1 or -1 if s1 is null-terminated.

s1_attr in

Attributes of s1.

s2 in

Second source text.

s2_len in

Number of code units in s2 or -1 if s2 is null-terminated.

s2_attr in

Attributes of s2.

result out

Set to true if s1 and s2 are canonically equivalent; else false.

Return Value 🔗

UNI_OK

If the string was normalized successfully.

UNI_BAD_OPERATION

If s1 or s2 are null, if s1_len or s2_len are negative, or if result is null.

UNI_BAD_ENCODING

If s1 or s2 is not well-formed (checks are omitted if the corresponding uniattr has UNI_TRUST).

UNI_NO_MEMORY

If dynamic memory allocation failed.

Discussion 🔗

This function checks if s1 and s2 are canonically equivalent. That is, it checks if the graphemes of both strings are the same. The behavior of this function is identical to calling uni_norm with UNI_NFD on both strings followed by a code point comparison.

The implementation is optimized for normalizing the strings incrementally while simultaneously comparing them. This is a more optimal approach when it’s unknown whether input strings are normalized or not. If it’s known in advance that the strings are both normalized, then they can be compared directly with memcmp or strcmp.

The implementation strives to be highly performant and avoid dynamic memory allocation when possible. Typically, memory allocation will only be performed for unnaturally long combining character sequences, like Zalgo text. It’s rare for real-world text to trigger memory allocation.

Examples 🔗

This example compares two strings for canonical equivalence. Conceptually, the implementation normalizes both strings, performs the comparison, and reports the result. This approach is recommended when strings are compared for one-off equality. If strings are compared repeatedly, then it’s recommended to normalize them with uni_norm and cache the result for the comparisons.

#include <unicorn.h>
#include <stdio.h>
#include <stdbool.h>

int main(void)
{
    const char *s1 = u8"ma\u0301scara"; // 'a' + U+0301 = á (decomposed)
    const char *s2 = u8"m\u00E1scara";  //       U+00E1 = á (precomposed)
    bool is_equal;

    if (uni_normcmp(s1, -1, UNI_UTF8,
                    s2, -1, UNI_UTF8, &is_equal) != UNI_OK)
    {
        puts("failed to normalize and compare strings");
        return 1;
    }

    printf("%s", is_equal ? "equal" : "not equal");
    return 0;
}