Text Encodings
Manipulate encoding forms.
Functions 🔗
Decode a scalar value.
Decode the previous scalar value.
Encode a scalar value.
Convert encoding forms.
Validate text.
Discussion 🔗
The Unicode Standard defines plain text as a sequence of Unicode scalar values. A Unicode scalar value is any code point excluding surrogate characters (surrogates are used exclusively for encoding UTF-16). A code point is a 21-bit integer in the Unicode code space used to identify a character. The Unicode code space defines the range of integers code points can be allocated from. It starts at integer 0 and goes up to and includes the integer 1114111. The Unicode Standard version 16.0 defines 155,063 encoded characters.
In Unicode a character represents what your machine (i.e. your computer) considers to be a character. It’s not necessarily what you, the human, think to be a character. For example, the END OF LINE (U+000A) character is a control character denoting a line boundary. A human would never think of this character as being a “character” but the machine does. What humans perceive as a character is called a grapheme. Graphemes can be composed of one or more Unicode characters. The segmentation of characters into graphemes can be performed with the segmentation interface.
Encoding Forms 🔗
An encoding form defines how a code point is represented in memory. The Unicode Standard defines three encoding forms for encoding Unicode characters: UTF-8, UTF-16, and UTF-32. An encoding form uses a bit combination to represent a code point. The smallest unit of this bit combination is referred to as a code unit. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. For example, in UTF-8 the code point U+1F6F8
is represented by four code units, in UTF-16 it’s represented by two units, and in UTF-32 it’s represented by one code unit.
UTF-32 is the odd encoding form out because it doesn’t really “encode” anything. It’s value is always identical to the value of the code point. It might be tempting to think of UTF-32 and code points as synonymous, but that would be semantically incorrect: a code point is a 21-bit integer in the Unicode codespace and UTF-32 is a storage standard that says “use a 32-bit integer as the storage for the 21-bit code point.”
Encodings in Practice 🔗
Unicorn implements the function uni_convert for converting between Unicode encoding forms. It also defines the functions uni_next, uni_prev, and uni_encode for decoding and encoding Unicode scalar values.
Support for each encoding form is enabled individually in the JSON configuration file. Where string values “UTF-8”, “UTF-16”, and “UTF-32” correspond to the UNI_UTF8, UNI_UTF16, and UNI_UTF32 constants, respectively.
{
"encodingForms": [
"UTF-8",
"UTF-16",
"UTF-32"
]
}