Compression

Compress and decompress short strings.

Functions 🔗

const void *text, unisize text_len, uniattr text_attr, uint8_t *buffer, size_t *buffer_length)

Compress text.

const uint8_t *buffer, size_t buffer_length, void *text, unisize *text_len, uniattr text_attr)

Decompress text.

Discussion 🔗

Lossless data compression algorithms, like LZW, perform compression by deduplicating repetitions within a data stream. These algorithms are general purpose and are designed to operate on “bags of bytes” without any prior knowledge or assumptions about the structure or semantics of the data.

Unicode Technical Report #34 defines two algorithms designed specifically for the compression Unicode encoded text. These algorithms are Standard Compression Scheme for Unicode (SCSU) and Binary Ordered Compression for Unicode (BOCU). These algorithm leverages knowledge of the Unicode character set and beat general purpose compression algorithms for shorter strings. Both algorithms are intended for short to medium length Unicode strings (about several hundred characters in length). Once strings becomes longer and include many repetitions, then a general purpose compressor is preferable.

Support for compression must be enabled in the JSON configuration file.

{
    "algorithms": {
        "compression": true
    }
}

About BOCU-1 🔗

Unicorn implements the BOCU-1 algorithm. The BOCU-1 algorithm is deterministic: the same input text is encoded identically by all encoders which makes it suitable for interchange. BOCU-1 is a MIME compatible Unicode compression scheme. That means it can be used directly in emails and similar protocols. It is ideal for transmitting short messages over a network or writing to persistent storage.

With BOCU-1, text in languages with Latin-based scripts take about the same amount of space as with UTF-8, while texts in all other languages take about 25%-60% less space. Compared to UTF-16, texts in all languages with small character repertoires take approximately half as much space in BOCU-1. For large character sets, i.e. Chinese/Japanese/Korean, texts are about the same size for UTF-16 and BOCU-1.

BOCU-1 is an IANA registered charset and has its own BOM: 0xFB 0xEE 0x28. Unicorn’s implementation does not prepend the BOM and its decompressor will ignore it.

The BOCU-1 algorithm is documented in Unicode Technical Note #6.