pycantonese.segment

pycantonese.segment(unsegmented: str, cls: Optional[pycantonese.word_segmentation.Segmenter] = None) → List[str][source]

Segment the unsegmented input.

The word segmentation model is the longest string matching approach, trained by (i) the HKCanCor corpus included in this library and (ii) the rime-cantonese data. The segmented sentence does not contain words longer than five characters.

Parameters

unsegmentedstr: Unsegmented input.
cls: Segmenter, optional: A custom Segmenter instance for setting the maximal word length (default = 5) and words to allow or disallow. If not provided, a default segmenter is used, with maximum word length = 5.

Returns

List[str]

Examples

>>> segment("廣東話容唔容易學？")  # "Is Cantonese easy to learn?"
['廣東話', '容', '唔', '容易', '學', '？']
>>>
>>> # Customizing the segmentation behavior.
>>> from pycantonese.word_segmentation import Segmenter
>>> segmenter = Segmenter(allow={"容唔容易"})
>>> segment("廣東話容唔容易學？", cls=segmenter)
['廣東話', '容唔容易', '學', '？']