pycantonese.segment

pycantonese.segment(unsegmented, cls=None)[source]

Segment the unsegmented input.

The word segmentation model is the longest string matching approach, trained by (i) the HKCanCor corpus included in this library and (ii) the rime-cantonese data. The segmented sentence does not contain words longer than five characters.

New in version 2.4.0.

Changed in version 3.0.0: Added the keyword argument cls to allow a customized segmenter.

Parameters
unsegmentedstr

Unsegmented input.

cls: Segmenter, optional

A custom Segmenter class object for setting the maximal word length (default = 5) and words to allow or disallow. If not provided, a default segmenter is used, with maximum word length = 5.

Returns
list[str]

Examples

>>> segment("廣東話容唔容易學?")  # "Is Cantonese easy to learn?"
['廣東話', '容', '唔容易', '學', '?']
>>>
>>> # Customizing the segmentation behavior.
>>> from pycantonese.word_segmentation import Segmenter
>>> segmenter = Segmenter(allow={"容唔容易"})
>>> segment("廣東話容唔容易學?", cls=segmenter)
['廣東話', '容唔容易', '學', '?']