pycantonese.segment
- pycantonese.segment(unsegmented: str, cls: Optional[pycantonese.word_segmentation.Segmenter] = None) List[str] [source]
Segment the unsegmented input.
The word segmentation model is the longest string matching approach, trained by (i) the HKCanCor corpus included in this library and (ii) the rime-cantonese data. The segmented sentence does not contain words longer than five characters.
- Parameters
- unsegmentedstr
Unsegmented input.
- cls: Segmenter, optional
A custom
Segmenter
instance for setting the maximal word length (default = 5) and words to allow or disallow. If not provided, a default segmenter is used, with maximum word length = 5.
- Returns
- List[str]
Examples
>>> segment("廣東話容唔容易學?") # "Is Cantonese easy to learn?" ['廣東話', '容', '唔', '容易', '學', '?'] >>> >>> # Customizing the segmentation behavior. >>> from pycantonese.word_segmentation import Segmenter >>> segmenter = Segmenter(allow={"容唔容易"}) >>> segment("廣東話容唔容易學?", cls=segmenter) ['廣東話', '容唔容易', '學', '?']