Segmentation

Text segmentation.

Enumerations 🔗

enum unibreak

Detectable text elements.

Functions 🔗

unibreak boundary, const void *text, unisize text_len, uniattr text_attr, unisize *index)

Compute next boundary.

unibreak boundary, const void *text, unisize text_len, uniattr text_attr, unisize *index)

Compute preceding boundary.

Discussion 🔗

A string of Unicode-encoded text often needs to be broken into text elements programmatically. Common examples of text elements include user-perceived characters, words, and sentences. Where these text elements begin and end is called the boundary and the process of boundary determination is called segmentation.

Unicorn supports grapheme, word, and sentence segmentation. These text elements are identified by the following constants:

The documentation associated with each constant defines how to enable it in the JSON configuration.

The algorithms for word and sentence segmentation are intended for languages that use white space to delimit words. Thai, Lao, Khmer, Myanmar, and ideographic scripts such as Japanese and Chinese do not typically use spaces between words and require language-specific break rules. Unicorn is not an internationalization library and therefore does not include rules specific to these languages.