charisma.h
Unicode character decoder.
Encoders ๐
Decode UTF-8.
Decode UTF-16 (native byte order).
Decode UTF-16 (big endian).
Decode UTF-16 (little endian).
Decode UTF-32 (native byte order).
Decode UTF-32 (big endian).
Decode UTF-32 (little endian).
Decoders ๐
Encode to UTF-8.
Encode to UTF-16 (native byte order).
Encode to UTF-16 (big endian).
Encode to UTF-16 (little endian).
Encode to UTF-32 (native byte order).
Encode to UTF-32 (big endian).
Encode to UTF-32 (little endian).
Types ๐
- typedef uint32_t uchar
Unicode scalar value.
Discussion ๐
Charisma is a Unicodeยฎ character decoder and encoder library written in C99 with no dependencies. It provides functions for decoding and encoding characters safely in UTF-8, UTF-16, and UTF-32 (big or little endian byte order). It can recover from malformed characters, allowing decoding to continue.
Charisma conforms to the MISRA C:2012 coding standard.
Decoding functions ๐
The utf*_decode() functions accept four arguments: (1) a pointer to a Unicode character encoded string; (2) the length of the string in code units or -1 if the string is null terminated; (3) a code unit index to an encoded character in the string; (4) a pointer to memory where the decoded Unicode scalar value will be written.
These functions return an integer 'n' which is one of three possible values: (1) n > 0, where 'n' is the number of code units in the encoded character; (2) n = 0, if the code unit index is at the end of the string; (3) n < 0, if a malformed character is found.
The encoding of the string is specified by the prefix of the function, e.g. "utf8_" indicates a UTF-8 encoded string, "utf32be" indicates a UTF-32 big endian string. Functions without an explicit endian in their name assume native byte order.
Encoding functions ๐
The utf*_encode() functions accept a Unicode scalar value, encode it in the associated Unicode encoding form and write the results to the buffer pointed to by the second argument. The number of code units written is returned. If the input character is not a Unicode scalar valid, then -1 is returned. Note that a null terminator is never written to the buffer.
The encoding of the buffer is specified by the prefix of the function, e.g. "utf8_" indicates a UTF-8 encoded string, "utf32be" indicates a UTF-32 big endian string. Functions without an explicit endian in their name assume native byte order.