Character Properties

Query character properties.

Enumerations πŸ”—

enum unigc

General category.

enum unibp

Binary properties.

Functions πŸ”—

General category.

uint8_t 

Canonical combining class.

bool 
unichar c, unibp p)

Binary property value.

const char *

Numeric property.

Simple lower case mapping.

Simple title case mapping.

Simple upper case mapping.

Discussion πŸ”—

The Unicode Character Database defines a large repertoire of character properties. Most characters properties are only applicable to specific applications, i.e. text shaping or rendering. Other properties are informational, for example a character’s name or the version it was introduced into the Unicode Standard. Other properties are only relevant when implementing various Unicode algorithms. The properties supported by Unicorn are those that are relevant when parsing plain text.

Character properties have a value associated with them. For example, there are binary character properties which have the boolean values true and false, enumeration character properties that have one of a fixed set of values, and string character properties which map to one or more code points. Most Unicode characters properties are binary properties.

All character properties have a default value. The default value is the value a character property takes for an unassigned code point. For example, the default value of a binary Unicode character property is always false.

Property Stability πŸ”—

Updates to character properties in the Unicode Character Database may be required for any of three reasons:

  • To cover new characters added to the standard.
  • To add new character properties to the standard.
  • To change the assigned values for a property for some characters already in the standard.

While the Unicode Consortium tries to keep the values of all character properties as stable as possible between versions, occasionally circumstances may arise which require changing them.

Character Classification πŸ”—

The C standard library includes the ctype.h header which provides character classification functions. These functions operate on byte-oriented character encodings and not Unicode characters. For compatibility with Unicode, this subsection defines classification functions designed to operate on individual code points that are equivalent to their ctype.h counterpart.

The following table shows the recommended character classification functions that are compatible with their C/POSIX definitions. These mappings conform to the Standard Recommendations in Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions.

Note that the standard mappings do not strictly conform to POSIX in all cases. For example, POSIX does not allow more than 20 characters to be categorized as digits, whereas there are more than 20 digit characters in Unicode. This is reflected in the POSIX definition of isdigit and isxdigit. Another difference is the POSIX ispunct function. POSIX defines ispunct to include symbols. This is not recommended by UTS #18.

Blank values in the 'POSIX Conforming' column inherit their value from the 'Standard' column, otherwise they overwrite it.

Property Standard POSIX Conforming
isalpha UNI_ALPHABETIC
islower UNI_LOWERCASE
isupper UNI_UPPERCASE
ispunct UNI_*_PUNCTUATION UNI_*_PUNCTUATION
UNI_*_SYMBOL
not alpha
isdigit UNI_DECIMAL_NUMBER [0...9]
isxdigit UNI_DECIMAL_NUMBER
UNI_HEX_DIGIT
[0-9 A-F a-f]
isalnum isalpha
isdigit
isspace UNI_WHITE_SPACE
isblank UNI_SPACE_SEPARATOR
U+0009
iscntrl UNI_CONTROL
isgraph not UNI_CONTROL
not UNI_SURROGATE
not UNI_UNASSIGNED
not isspace
isprint (isgraph or isblank) and not iscntrl
tolower uni_tolower
toupper uni_toupper

The isgraph property is meant to include characters that have a visible representation when rendered on a screen or printed, however, Unicode includes characters with general category Cf which do not have a glyph representation. The reason for their inclusion was because they serve as formatting or control characters (e.g. zero-width joiners, non-joiners, bidirectional markers) and affect the layout or presentation of adjacent characters. For example, zero-width joiners (U+200D) and zero-width non-joiners (U+200C) impact how adjacent characters combine in scripts like Arabic or Devanagari. The inclusion of these characters acknowledges their contextual significance.