Character Properties
Query character properties.
Enumerations π
- enum unigc
General category.
- enum unibp
Binary properties.
Functions π
General category.
Canonical combining class.
Binary property value.
Numeric property.
Simple lower case mapping.
Simple title case mapping.
Simple upper case mapping.
Discussion π
The Unicode Character Database defines a large repertoire of character properties. Most characters properties are only applicable to specific applications, i.e. text shaping or rendering. Other properties are informational, for example a characterβs name or the version it was introduced into the Unicode Standard. Other properties are only relevant when implementing various Unicode algorithms. The properties supported by Unicorn are those that are relevant when parsing plain text.
Character properties have a value associated with them. For example, there are binary character properties which have the boolean values true
and false
, enumeration character properties that have one of a fixed set of values, and string character properties which map to one or more code points. Most Unicode characters properties are binary properties.
All character properties have a default value. The default value is the value a character property takes for an unassigned code point. For example, the default value of a binary Unicode character property is always false
.
Property Stability π
Updates to character properties in the Unicode Character Database may be required for any of three reasons:
- To cover new characters added to the standard.
- To add new character properties to the standard.
- To change the assigned values for a property for some characters already in the standard.
While the Unicode Consortium tries to keep the values of all character properties as stable as possible between versions, occasionally circumstances may arise which require changing them.
Character Classification π
The C standard library includes the ctype.h
header which provides character classification functions. These functions operate on byte-oriented character encodings and not Unicode characters. For compatibility with Unicode, this subsection defines classification functions designed to operate on individual code points that are equivalent to their ctype.h
counterpart.
The following table shows the recommended character classification functions that are compatible with their C/POSIX definitions. These mappings conform to the Standard Recommendations in Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions.
Note that the standard mappings do not strictly conform to POSIX in all cases. For example, POSIX does not allow more than 20 characters to be categorized as digits, whereas there are more than 20 digit characters in Unicode. This is reflected in the POSIX definition of isdigit
and isxdigit
. Another difference is the POSIX ispunct
function. POSIX defines ispunct
to include symbols. This is not recommended by UTS #18.
Blank values in the 'POSIX Conforming' column inherit their value from the 'Standard' column, otherwise they overwrite it.
Property | Standard | POSIX Conforming |
---|---|---|
isalpha | UNI_ALPHABETIC | |
islower | UNI_LOWERCASE | |
isupper | UNI_UPPERCASE | |
ispunct | UNI_*_PUNCTUATION | UNI_*_PUNCTUATION UNI_*_SYMBOL not alpha |
isdigit | UNI_DECIMAL_NUMBER | [0...9]
|
isxdigit | UNI_DECIMAL_NUMBER UNI_HEX_DIGIT | [0-9 A-F a-f]
|
isalnum | isalpha isdigit | |
isspace | UNI_WHITE_SPACE | |
isblank | UNI_SPACE_SEPARATOR
U+0009
|
|
iscntrl | UNI_CONTROL | |
isgraph | not UNI_CONTROL not UNI_SURROGATE not UNI_UNASSIGNED not isspace | |
isprint | (isgraph or isblank) and not iscntrl | |
tolower | uni_tolower | |
toupper | uni_toupper |
The isgraph
property is meant to include characters that have a visible representation when rendered on a screen or printed, however, Unicode includes characters with general category Cf which do not have a glyph representation. The reason for their inclusion was because they serve as formatting or control characters (e.g. zero-width joiners, non-joiners, bidirectional markers) and affect the layout or presentation of adjacent characters. For example, zero-width joiners (U+200D) and zero-width non-joiners (U+200C) impact how adjacent characters combine in scripts like Arabic or Devanagari. The inclusion of these characters acknowledges their contextual significance.