Character Properties

Query character properties.

Enumerations 🔗

enum unigc: General category.

enum unibp: Binary properties.

Functions 🔗

unigc uni_gc( unichar c): General category.

uint8_t uni_ccc( unichar c): Canonical combining class.

bool uni_is( unichar c, unibp p): Binary property value.

const char * uni_numval( unichar c): Numeric property.

unichar uni_tolower( unichar c): Simple lower case mapping.

unichar uni_totitle( unichar c): Simple title case mapping.

unichar uni_toupper( unichar c): Simple upper case mapping.

Discussion 🔗

The Unicode Character Database defines a large repertoire of character properties. Most character properties are only applicable to specific applications, i.e. text shaping or rendering. Other properties are informational, for example a character’s name or the version it was introduced into the Unicode Standard. Other properties are only relevant when implementing various Unicode algorithms. The properties supported by Unicorn are those that are relevant when parsing plain text.

Character properties have a value associated with them. For example, there are binary character properties which have the boolean values true and false, enumeration character properties that have one of a fixed set of values, and string character properties which map to one or more code points. Most Unicode character properties are binary properties.

All character properties have a default value. The default value is the value a character property takes for an unassigned code point. For example, the default value of a binary Unicode character property is always false.

Property Stability 🔗

Updates to character properties in the Unicode Character Database may be required for any of three reasons:

To cover new characters added to the standard.
To add new character properties to the standard.
To change the assigned values for a property for some characters already in the standard.

While the Unicode Consortium tries to keep the values of all character properties as stable as possible between versions, occasionally circumstances may arise which require changing them.

Character Classification 🔗

The C standard library includes the ctype.h header which provides character classification functions. These functions operate on byte-oriented character encodings and not Unicode characters. For compatibility with Unicode, this subsection defines classification functions designed to operate on individual code points that are equivalent to their ctype.h counterpart.

The following table shows the recommended character classification functions that are compatible with their C/POSIX definitions. These mappings conform to the Standard Recommendations in Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions.

Note that the standard mappings do not strictly conform to POSIX in all cases. For example, POSIX does not allow more than 20 characters to be categorized as digits, whereas there are more than 20 digit characters in Unicode. This is reflected in the POSIX definition of isdigit and isxdigit. Another difference is the POSIX ispunct function. POSIX defines ispunct to include symbols. This is not recommended by UTS #18.

Blank values in the 'POSIX Conforming' column inherit their value from the 'Standard' column, otherwise they overwrite it.

Property	Standard	POSIX Conforming
isalpha	UNI_ALPHABETIC
islower	UNI_LOWERCASE
isupper	UNI_UPPERCASE
ispunct	UNI_*_PUNCTUATION	UNI__PUNCTUATION UNI__SYMBOL not alpha
isdigit	UNI_DECIMAL_NUMBER	`[0...9]`
isxdigit	UNI_DECIMAL_NUMBER UNI_HEX_DIGIT	`[0-9 A-F a-f]`
isalnum	isalpha isdigit
isspace	UNI_WHITE_SPACE
isblank	UNI_SPACE_SEPARATOR `U+0009`
iscntrl	UNI_CONTROL
isgraph	not UNI_CONTROL not UNI_SURROGATE not UNI_UNASSIGNED not isspace
isprint	(isgraph or isblank) and not iscntrl
tolower	uni_tolower
toupper	uni_toupper

The isgraph property is meant to include characters that have a visible representation when rendered on a screen or printed, however, Unicode includes characters with general category Cf which do not have a glyph representation. The reason for their inclusion was because they serve as formatting or control characters (e.g. zero-width joiners, non-joiners, bidirectional markers) and affect the layout or presentation of adjacent characters. For example, zero-width joiners (U+200D) and zero-width non-joiners (U+200C) impact how adjacent characters combine in scripts like Arabic or Devanagari. The inclusion of these characters acknowledges their contextual significance.

Manual