charisma.h

Unicode character decoder.

Encoders 🔗

int32_t utf8_decode( const uint8_t *text, int32_t length, int32_t *index, uchar *c): Decode UTF-8.

int32_t utf16_decode( const uint16_t *text, int32_t length, int32_t *index, uchar *c): Decode UTF-16 (native byte order).

int32_t utf16be_decode( const uint16_t *text, int32_t length, int32_t *index, uchar *c): Decode UTF-16 (big endian).

int32_t utf16le_decode( const uint16_t *text, int32_t length, int32_t *index, uchar *c): Decode UTF-16 (little endian).

int32_t utf32_decode( const uint32_t *text, int32_t length, int32_t *index, uchar *c): Decode UTF-32 (native byte order).

int32_t utf32be_decode( const uint32_t *text, int32_t length, int32_t *index, uchar *c): Decode UTF-32 (big endian).

int32_t utf32le_decode( const uint32_t *text, int32_t length, int32_t *index, uchar *c): Decode UTF-32 (little endian).

Decoders 🔗

int32_t utf8_encode( uchar c, uint8_t *buf): Encode to UTF-8.

int32_t utf16_encode( uchar c, uint16_t *buf): Encode to UTF-16 (native byte order).

int32_t utf16be_encode( uchar c, uint16_t *buf): Encode to UTF-16 (big endian).

int32_t utf16le_encode( uchar c, uint16_t *buf): Encode to UTF-16 (little endian).

int32_t utf32_encode( uchar c, uint32_t *buf): Encode to UTF-32 (native byte order).

int32_t utf32be_encode( uchar c, uint32_t *buf): Encode to UTF-32 (big endian).

int32_t utf32le_encode( uchar c, uint32_t *buf): Encode to UTF-32 (little endian).

Types 🔗

typedef uint32_t uchar: Unicode scalar value.

Discussion 🔗

Charisma is a Unicode® character decoder and encoder library written in C99 with no dependencies. It provides functions for decoding and encoding characters safely in UTF-8, UTF-16, and UTF-32 (big or little endian byte order). It can recover from malformed characters, allowing decoding to continue.

Charisma conforms to the MISRA C:2012 coding standard.

Decoding functions 🔗

The utf*_decode() functions accept four arguments: (1) a pointer to a Unicode character encoded string; (2) the length of the string in code units or -1 if the string is null terminated; (3) a code unit index to an encoded character in the string; (4) a pointer to memory where the decoded Unicode scalar value will be written.

These functions return an integer 'n' which is one of three possible values: (1) n > 0, where 'n' is the number of code units in the encoded character; (2) n = 0, if the code unit index is at the end of the string; (3) n < 0, if a malformed character is found.

The encoding of the string is specified by the prefix of the function, e.g. "utf8_" indicates a UTF-8 encoded string, "utf32be" indicates a UTF-32 big endian string. Functions without an explicit endian in their name assume native byte order.

Encoding functions 🔗

The utf*_encode() functions accept a Unicode scalar value, encode it in the associated Unicode encoding form and write the results to the buffer pointed to by the second argument. The number of code units written is returned. If the input character is not a Unicode scalar valid, then -1 is returned. Note that a null terminator is never written to the buffer.

The encoding of the buffer is specified by the prefix of the function, e.g. "utf8_" indicates a UTF-8 encoded string, "utf32be" indicates a UTF-32 big endian string. Functions without an explicit endian in their name assume native byte order.

Manual