Feature Customization 🔗

Unicorn makes it easy to customize which Unicode algorithms and character properties it’s built with. Removing unused features can dramatically reduce the size of the compiled code. This is especially important for resource constrained systems, like embedded systems and IoT devices.

Unicorn is customized by modifying a JSON file named features.json found in its source distribution. This file defines the features Unicorn is built with. By default, it is configured to enable all features of the library. You are encouraged to disable the features you don’t need.

Once you’ve customized features.json you can run generate.pyz which is also found in the source distribution. This script will generate the unicorn.c and unicorn.h source files which you can compile with your C project. When Unicorn is built with a provided build system (e.g. CMake), the script is executed automatically as part of the build process. Note that running this Python script might take a few seconds depending on your hardware and requires up to 256 MB of free RAM.

The schema for features.json is explored in the remainder of this document.

Root Interface 🔗

The Root interface describes the top-level JSON object.

interface Root {
    version: string;
    endian?: Endian;
    hasStandardAllocators?: boolean;
    characterStorage?: string;
    optimizeFor?: OptimizationPriority;
    stackBufferSize?: number;
    excludeCharacterBlocks?: Block[];
    algorithms?: Algorithms;
    characterProperties?: CharacterProperty[];
    encodingForms?: EncodingForm[];
}

version 🔗

The schema version. This must be '1.0'.

endian 🔗

This option refers to the endianness of the target hardware. This is not necessarily the same as the hardware Unicorn is compiled on. Example: a developer uses a little endian Windows system to cross-compile for a big endian embedded architecture. In this example, this property value would be 'big'.

When the value is 'native' the endian of the target hardware is assumed to be identical to that of the hardware Unicorn is compiled on.

The default value is 'native'.

hasStandardAllocators 🔗

Indicates if the libc runtime implements the 'realloc' and 'free' functions and provides their prototypes via the 'stdlib.h' header file.

Unicorn does not require usage of the C standard memory allocation routines. These routines may not be available for resource constrained devices. If they are unavailable, then this property should be set to 'false'. Regardless, users always have the option to implement their own via custom memory routines.

The default value is 'true'.

characterStorage 🔗

The underlying storage type of unichar. The integer type can be changed to whatever the target hardware supports. Changing the storage type can help simplify integration with existing software. Unicorn does not care whether the type is signed or unsigned, but it must be large enough to accommodate the entire Unicode code space.

The default value is 'uint32_t'.

optimizeFor 🔗

Controls how Unicorn optimizes the data structures it generates.

The default value is 'speed'.

stackBufferSize 🔗

Controls the number of code points that the stack buffer can accommodate.

Unicorn strives to reduce the need for dynamic memory allocation by using fixed-sized stack allocated storage buffers where possible. Dynamic memory allocation is only performed if the stack buffer is full.

The option controls the number of code points that the stack buffer can accommodate. If this value is too large then the program may fail to compile or crash on systems with limited stack memory. If this value is too small then dynamic memory allocation is more likely.

Regardless of how large the stack buffer is, it’s always possible to construct text that forces dynamic memory allocation. For example, text with unnaturally long combining character sequences could trigger allocation although it’s rare for real world text to be formatted this way.

The default value is '32'.

excludeCharacterBlocks 🔗

List of Unicode blocks, specified as glob patterns, whose characters will be excluded from the build. Excluding characters can dramatically reduce the size of the compiled library.

Excluded characters will be treated identically to unassigned characters by the Unicode algorithms. Likewise their properties will be identical to that of an unassigned character.

The default value is '[]'.

algorithms 🔗

Describes which Unicode algorithms that Unicorn is built with. See the Algorithms interface for details.

The default value is '{}'.

characterProperties 🔗

Describes which Unicode character properties that Unicorn is built with.

The default value is '[]'.

encodingForms 🔗

List of Unicode encoding forms to include. The Unicode scalar value is always included.

The default value is '[]'.

Algorithms Interface 🔗

The Algorithms interface describes which Unicode algorithms Unicorn is built with.

interface Algorithms {
    normalization?: NormalizationForm[];
    normalizationQuickCheck?: boolean;
    caseConversion?: CaseConvert[];
    caseFolding?: CaseFold[];
    segmentation?: Segmentation[];
    compression?: boolean;
    collation?: boolean;
}

normalization 🔗

Toggles support for the Normalization API.

The default value is '[]'.

normalizationQuickCheck 🔗

Toggles support for the normalization quick check routine.

The default value is 'false'.

caseConversion 🔗

Toggles support for the Case Convert API.

The default value is '[]'.

caseFolding 🔗

Toggles support for the Case Fold API.

The default value is '[]'.

segmentation 🔗

Toggles support for the Text Segmentation API.

The default value is '[]'.

compression 🔗

Toggles support for the Compression API.

The default value is 'false'.

collation 🔗

Toggles support for the Collation API.

The default value is 'false'.

Types 🔗

The JSON interfaces make use of various types which are defined in this section.

Block 🔗

A glob pattern for one or more Unicode character blocks. For example, the glob pattern Latin* will match against Latin-1 Supplement, Latin Extended-A, but not Basic Latin.

Glob patterns are case-insensitive. That means both Basic Latin and basic latin are treated identically.

type Block = string;

NormalizationForm 🔗

Corresponds to the normalization forms.

type NormalizationForm = "nfd" | "nfc";

Segmentation 🔗

Corresponds to a text boundary.

type Segmentation = "grapheme" | "word" | "sentence";

EncodingForm 🔗

Corresponds to a Unicode character encoding form.

Enabling either the UTF-16 or UTF-32 encoding form implicitly enables both big and little endian variants. Observe that the Unicode scalar value is not a possible value as it’s always implicitly enabled.

type EncodingForm = "utf-8" | "utf-16" | "utf-32";

CaseConvert 🔗

Corresponds to a casing form.

type CaseConvert = "lower" | "upper" | "title";

CaseFold 🔗

Corresponds to a case fold form.

type CaseFold = "default" | "canonical";

CharacterProperty 🔗

A Unicode character property.

type CharacterProperty = string;

Where the character property string is one of the following:

Endian 🔗

The endianness of the target architecture. Defaults to “native” which indicates the target architecture is identical to the architecture Unicorn is built on. The other options are “little” for little endian and “big” for big endian.

This option is provided for users cross-compiling Unicorn for an architecture with a different endianness than the architecture it’s being compiled on.

type Endian =  "little" | "big" | "native";

OptimizationPriority 🔗

Whether Unicorn generates C source code optimized for speed or size.

type Endian =  "speed" | "size";

Implicit Features 🔗

Certain Unicode algorithms depend on other algorithms and character properties. For example, the canonical composition algorithm depends upon the canonical decomposition algorithm. Unicorn will implicitly enable dependent features as needed. Similarly, algorithms that depend on a character property that is not explicitly enabled by the user will be implicitly enabled.

To illustrate, given the following JSON, the Python scripts would implicitly add NFD to the normalization array because it understands that NFC depends upon it. It would also implicitly add any character properties that the algorithms need.

{
    "version": "1.0",
    "algorithms": {
        "normalization": [
            "NFC",
        ]
    }
}

Manual

Feature Customization 🔗

Root Interface 🔗

version 🔗

endian 🔗

hasStandardAllocators 🔗

characterStorage 🔗

optimizeFor 🔗

stackBufferSize 🔗

excludeCharacterBlocks 🔗

algorithms 🔗

characterProperties 🔗

encodingForms 🔗

Algorithms Interface 🔗

normalization 🔗

normalizationQuickCheck 🔗

caseConversion 🔗

caseFolding 🔗

segmentation 🔗

compression 🔗

collation 🔗

Types 🔗

Block 🔗

NormalizationForm 🔗

Segmentation 🔗

EncodingForm 🔗

CaseConvert 🔗

CaseFold 🔗

CharacterProperty 🔗

Endian 🔗

OptimizationPriority 🔗

Implicit Features 🔗

On This Page