Unicorn: Embeddable Unicode® Algorithms

Essential Algorithms

Unicorn implements the most essential Unicode algorithms:

Normalization (NFC, NFD)
Case conversion
Case folding
Collation (via the DUCET)
Grapheme, word, and sentence segmentation
BOCU-1 short string compression
UTF-8, 16, and 32 decoders, encoders, and validators

Fully Customizable

Unicorn is fully customizable. You can choose which Unicode algorithms and character properties to include or exclude. Learn more here.

Ultra Portable

Unicorn does not require an FPU or 64-bit integers. It is written in C99 and only requires a few features from libc which are listed in the following table.

Header	Types	Macros	Functions
stdint.h	`int8_t`, `int16_t`, `int32_t` `uint8_t`, `uint16_t`, `uint32_t`
string.h			`memcpy`, `memset`, `memcmp`
stddef.h	`size_t`	`NULL`
stdbool.h		`bool`, `true`, `false`
assert.h		`assert`

MISRA C:2012 Compliant

Unicorn honors all Mandatory, most Required, and most Advisory rules defined by MISRA C:2012. Deviations are documented here. You are encouraged to audit Unicorn and verify its level of conformance is acceptable.

Thread Safe

Unicorn is thread-safe except for the following caveats:

Functions that allocate memory are only as thread-safe as the allocator itself.
The configuration API is not thread-safe, however, in typical usage it’s only invoked at application startup and only if the default configuration is unsatisfactory.

Atomic Operations

All operations in Unicorn are atomic. That means either an operation occurs or nothing occurs at all. This guarantees errors, such as out-of-memory errors, never corrupt internal state. This also means if an error occurs, like an out of memory error, then you can recover (free up memory) and try the operation again.

Extensively Tested

100% branch test coverage
Official Unicode conformance tests
Manually written tests
Out-of-memory tests
Fuzz tests
Static analysis
Valgrind analysis
Code sanitizers (UBSAN, ASAN, and MSAN)
Extensive use of assert() and run-time checks

Out-of-Memory Tests

All out-of-memory conditions are tested by running each test case in a loop, counting upwards from zero, with a custom allocator that fails on the N^th allocation. The test passes when the implementation no longer returns an out-of-memory error, meaning all out-of-memory paths have been tested. Code coverage is used to verify all branches are taken.

Feature Combination Testing

Unicorn is highly configurable. The test suite explores combinations of Unicode features to verify correctness.

Encoding Compatible

All functions that operate on text can accept UTF-8, UTF-16, UTF-32, or Unicode scalar values. UTF-16 and UTF-32 are supported in big endian, little endian, and native byte orders.

The implementation performs runtime safety checks by default to guard against malformed or maliciously encoded text. If you know text isn’t malformed you can optionally skip these checks to improve processing time.