pycantonese.segment

pycantonese.segment(unsegmented: str, cls: Optional[pycantonese.word_segmentation.Segmenter] = None) List[str][source]

Segment the unsegmented input.

The word segmentation model is the longest string matching approach, trained by (i) the HKCanCor corpus included in this library and (ii) the rime-cantonese data. The segmented sentence does not contain words longer than five characters.

Parameters
unsegmentedstr

Unsegmented input.

cls: Segmenter, optional

A custom Segmenter instance for setting the maximal word length (default = 5) and words to allow or disallow. If not provided, a default segmenter is used, with maximum word length = 5.

Returns
List[str]

Examples

>>> segment("廣東話容唔容易學?")  # "Is Cantonese easy to learn?"
['廣東話', '容', '唔', '容易', '學', '?']
>>>
>>> # Customizing the segmentation behavior.
>>> from pycantonese.word_segmentation import Segmenter
>>> segmenter = Segmenter(allow={"容唔容易"})
>>> segment("廣東話容唔容易學?", cls=segmenter)
['廣東話', '容唔容易', '學', '?']