Feature Customization πŸ”—

Unicorn makes it easy to customize which Unicode algorithms and character properties it’s built with. Removing unused features can dramatically reduce the size of the compiled code. This is especially important for resource constrained systems, like embedded systems and IoT devices.

Unicorn is customized by modifying a JSON file named features.json found in its source distribution. This file defines the features Unicorn is built with. By default, it is configured to enable all features of the library. You are encouraged to remove the features you don’t need.

The logic to parse the JSON file and generate the amalgamation is handled by the generate.pyz Python script. Running this Python script might take a few seconds depending on your hardware and requires up to 256 MB of free RAM.

This remainder of this document explores the schema for features.json.

Root Interface πŸ”—

The Root interface describes the top-level JSON object.

interface Root {
    version: string;
    endian?: Endian;
    hasStandardAllocators?: boolean;
    characterStorage?: string;
    optimizeFor?: OptimizationPriority;
    stackBufferSize?: number;
    excludeCharacterBlocks?: Block[];
    algorithms?: Algorithms;
    characterProperties?: CharacterProperty[];
    encodingForms?: EncodingForm[];
}

version πŸ”—

The schema version. This must be '1.0'.

endian πŸ”—

This option refers to the endianness of the target hardware. This is not necessarily the same as the hardware Unicorn is compiled on. Example: a developer uses a little endian Windows system to cross-compile for a big endian embedded architecture. In this example, this property value would be 'big'.

When the value is 'native' the endian of the target hardware is assumed to be identical to that of the hardware Unicorn is compiled on.

The default value is 'native'.

hasStandardAllocators πŸ”—

Indicates if the libc runtime implements the 'realloc' and 'free' functions and provides their prototypes via the 'stdlib.h' header file.

Unicorn does not require usage of the C standard memory allocation routines. These routines may not be available for resource constrained devices. If they are unavailable, then this property should be set to 'false'. Regardless, users always have the option to implement their own via custom memory routines.

The default value is 'true'.

characterStorage πŸ”—

The underlying storage type of unichar. The integer type can be changed to whatever the target hardware supports. Changing the storage type can help simplify integration with existing software. Unicorn does not care whether the type is signed or unsigned, but it must be large enough to accommodate the entire Unicode code space.

The default value is 'uint32_t'.

optimizeFor πŸ”—

Controls how Unicorn optimizes the data structures it generates.

The default value is 'speed'.

stackBufferSize πŸ”—

Controls the number of code points that the stack buffer can accommodate.

Unicorn strives to reduce the need for dynamic memory allocation by using fixed-sized stack allocated storage buffers where possible. Dynamic memory allocation is only performed if the stack buffer is full.

The option controls the number of code points that the stack buffer can accommodate. If this value is too large then the program may fail to compile or crash on systems with limited stack memory. If this value is too small then dynamic memory allocation is more likely.

Regardless of how large the stack buffer is, it’s always possible to construct text that forces dynamic memory allocation. For example, text with unnaturally long combining character sequences could trigger it although it’s rare for real world text to be formatted this way.

The default value is '32'.

excludeCharacterBlocks πŸ”—

List of Unicode blocks, specified as glob patterns, whose characters will be excluded from the build. Excluding characters can dramatically reduce the size of the compiled library.

Excluded characters will be treated identically to unassigned characters by the Unicode algorithms. Likewise their properties will be identical to that of an unassigned character.

The default value is '[]'.

algorithms πŸ”—

Describes which Unicode algorithms that Unicorn is built with. See the Algorithms interface for details.

The default value is '{}'.

characterProperties πŸ”—

Describes which Unicode character properties that Unicorn is built with.

The default value is '[]'.

encodingForms πŸ”—

List of Unicode encoding forms to include. The Unicode scalar value is always included.

The default value is '[]'.

Algorithms Interface πŸ”—

The Algorithms interface describes which Unicode algorithms Unicorn is built with.

interface Algorithms {
    normalization?: NormalizationForm[];
    normalizationQuickCheck?: boolean;
    caseConversion?: CaseConvert[];
    caseFolding?: CaseFold[];
    segmentation?: Segmentation[];
    compression?: boolean;
    collation?: boolean;
    encodingConvert?: boolean;
}

normalization πŸ”—

Toggles support for the Normalization API.

The default value is '[]'.

normalizationQuickCheck πŸ”—

Toggles support for the normalization quick check routine.

The default value is 'false'.

caseConversion πŸ”—

Toggles support for the Case Convert API.

The default value is '[]'.

caseFolding πŸ”—

Toggles support for the Case Fold API.

The default value is '[]'.

segmentation πŸ”—

Toggles support for the Text Segmentation API.

The default value is '[]'.

compression πŸ”—

Toggles support for the Compression API.

The default value is 'false'.

collation πŸ”—

Toggles support for the Collation API.

The default value is 'false'.

encodingConvert πŸ”—

Toggles support for encoding forms conversion routines.

The default value is 'false'.

Types πŸ”—

The JSON interfaces make use of various types which are defined in this section.

Block πŸ”—

A glob pattern for one or more Unicode character blocks. For example, the glob pattern Latin* will match against Latin-1 Supplement, Latin Extended-A, but not Basic Latin.

Glob patterns are case-insensitive. That means both Basic Latin and basic latin are treated identically.

type Block = string;

NormalizationForm πŸ”—

Corresponds to the normalization forms.

type NormalizationForm = "nfd" | "nfc";

Segmentation πŸ”—

Corresponds to a text boundary.

type Segmentation = "grapheme" | "word" | "sentence";

EncodingForm πŸ”—

Corresponds to a Unicode character encoding form.

Enabling either the UTF-16 or UTF-32 encoding form implicitly enables both big and little endian variants. Observe that the Unicode scalar value is not a possible value as it’s always implicitly enabled.

type EncodingForm = "utf-8" | "utf-16" | "utf-32";

CaseConvert πŸ”—

Corresponds to a casing form.

type CaseConvert = "lower" | "upper" | "title";

CaseFold πŸ”—

Corresponds to a case fold form.

type CaseFold = "default" | "canonical";

CharacterProperty πŸ”—

A Unicode character property.

type CharacterProperty = string;

Where the character property string is one of the following:

Endian πŸ”—

The endianness of the target architecture. Defaults to β€œnative” which indicates the target architecture is identical to the architecture Unicorn is built on. The other options are β€œlittle” for little endian and β€œbig” for big endian.

This option is provided for users cross-compiling Unicorn for an architecture with a different endianness than the architecture it’s being compiled on.

type Endian =  "little" | "big" | "native";

OptimizationPriority πŸ”—

Whether Unicorn generates C source code optimized for speed or size.

type Endian =  "speed" | "size";

Implicit Features πŸ”—

Certain Unicode algorithms depend on other algorithms and character properties. For example, the canonical composition algorithm depends upon the canonical decomposition algorithm. Unicorn will implicitly enable dependent features as needed. Similarly, algorithms that depend on a character property that is not explicitly enabled by the user will be implicitly enabled.

To illustrate, given the following JSON, the Python scripts would implicitly add NFD to the normalization array because it understands that NFC depends upon it. It would also implicitly add any character properties that the algorithms need.

{
    "version": "1.0",
    "algorithms": {
        "normalization": [
            "NFC",
        ]
    }
}