Constant

UNI_WORD

Word breaks.

Since v1.0
enum unibreak {
    UNI_WORD,
}

Discussion 🔗

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection or “move to next word” control-arrow keys) and “whole word search” for search and replace. They are also used in database queries and regular expressions, to determine whether elements are within a certain number of words of one another.

The default algorithm for detecting word boundaries primarily intended for languages that use white space to delimit words. Unfortunately, some scripts, like Thai, Lao, Khmer, and Myanmar, do not use spaces between words. Ideographic scripts such as Japanese and Chinese are even more complex. It is therefore not possible to provide an algorithm that correctly detects word boundaries across languages. These languages require special handling by a more sophisticated word break detection algorithm that understands the rules of the language.

Support for word break detection can be enabled in the JSON configuration file as shown below:

{
    "algorithms": {
        "segmentation": [
            "word"
        ]
    }
}