Segmentation

Text segmentation.

Enumerations 🔗

enum unibreak

Detectable text elements.

Functions 🔗

unibreak boundary, const void *text, unisize text_len, uniattr text_attr, unisize *index)

Compute next boundary.

unibreak boundary, const void *text, unisize text_len, uniattr text_attr, unisize *index)

Compute preceding boundary.

Discussion 🔗

A string of Unicode-encoded text often needs to be broken into text elements programmatically. Common examples of text elements include user-perceived characters, words, and sentences. Where these text elements begin and end is called the boundary and the process of boundary determination is called segmentation.

The precise determination of text elements varies according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period character (U+002E FULL STOP) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is to not surprise the user.

Unicorn supports grapheme, word, and sentence segmentation. These text elements are identified by the following constants:

The algorithms for word and sentence segmentation are intended for languages that use white space to delimit words. Thai, Lao, Khmer, Myanmar, and ideographic scripts such as Japanese and Chinese do not typically use spaces between words and require language-specific break rules. Unicorn is not an internationalization library and therefore does not include rules specific to these languages.