charisma.h

Unicode character decoder.

Encoders ๐Ÿ”—

int32_t 
const uint8_t *text, int32_t length, int32_t *index, uchar *c)

Decode UTF-8.

int32_t 
const uint16_t *text, int32_t length, int32_t *index, uchar *c)

Decode UTF-16 (native byte order).

int32_t 
const uint16_t *text, int32_t length, int32_t *index, uchar *c)

Decode UTF-16 (big endian).

int32_t 
const uint16_t *text, int32_t length, int32_t *index, uchar *c)

Decode UTF-16 (little endian).

int32_t 
const uint32_t *text, int32_t length, int32_t *index, uchar *c)

Decode UTF-32 (native byte order).

int32_t 
const uint32_t *text, int32_t length, int32_t *index, uchar *c)

Decode UTF-32 (big endian).

int32_t 
const uint32_t *text, int32_t length, int32_t *index, uchar *c)

Decode UTF-32 (little endian).

Decoders ๐Ÿ”—

int32_t 
uchar c, uint8_t *buf)

Encode to UTF-8.

int32_t 
uchar c, uint16_t *buf)

Encode to UTF-16 (native byte order).

int32_t 
uchar c, uint16_t *buf)

Encode to UTF-16 (big endian).

int32_t 
uchar c, uint16_t *buf)

Encode to UTF-16 (little endian).

int32_t 
uchar c, uint32_t *buf)

Encode to UTF-32 (native byte order).

int32_t 
uchar c, uint32_t *buf)

Encode to UTF-32 (big endian).

int32_t 
uchar c, uint32_t *buf)

Encode to UTF-32 (little endian).

Types ๐Ÿ”—

typedef uint32_t uchar

Unicode scalar value.

Discussion ๐Ÿ”—

Charisma is a Unicodeยฎ character decoder and encoder library written in C99 with no dependencies. It provides functions for decoding and encoding characters safely in UTF-8, UTF-16, and UTF-32 (big or little endian byte order). It can recover from malformed characters, allowing decoding to continue.

Charisma conforms to the MISRA C:2012 coding standard.

Decoding functions ๐Ÿ”—

The utf*_decode() functions accept four arguments: (1) a pointer to a Unicode character encoded string; (2) the length of the string in code units or -1 if the string is null terminated; (3) a code unit index to an encoded character in the string; (4) a pointer to memory where the decoded Unicode scalar value will be written.

These functions return an integer 'n' which is one of three possible values: (1) n > 0, where 'n' is the number of code units in the encoded character; (2) n = 0, if the code unit index is at the end of the string; (3) n < 0, if a malformed character is found.

The encoding of the string is specified by the prefix of the function, e.g. "utf8_" indicates a UTF-8 encoded string, "utf32be" indicates a UTF-32 big endian string. Functions without an explicit endian in their name assume native byte order.

Encoding functions ๐Ÿ”—

The utf*_encode() functions accept a Unicode scalar value, encode it in the associated Unicode encoding form and write the results to the buffer pointed to by the second argument. The number of code units written is returned. If the input character is not a Unicode scalar valid, then -1 is returned. Note that a null terminator is never written to the buffer.

The encoding of the buffer is specified by the prefix of the function, e.g. "utf8_" indicates a UTF-8 encoded string, "utf32be" indicates a UTF-32 big endian string. Functions without an explicit endian in their name assume native byte order.