pycantonese.segment¶
-
pycantonese.
segment
(unsegmented, cls=None)[source]¶ Segment the unsegmented input.
The word segmentation model is the longest string matching approach, trained by (i) the HKCanCor corpus included in this library and (ii) the rime-cantonese data. The segmented sentence does not contain words longer than five characters.
New in version 2.4.0.
Changed in version 3.0.0: Added the keyword argument
cls
to allow a customized segmenter.- Parameters
- unsegmentedstr
Unsegmented input.
- cls: Segmenter, optional
A custom Segmenter class object for setting the maximal word length (default = 5) and words to allow or disallow. If not provided, a default segmenter is used, with maximum word length = 5.
- Returns
- list[str]
Examples
>>> segment("廣東話容唔容易學?") # "Is Cantonese easy to learn?" ['廣東話', '容', '唔', '容易', '學', '?'] >>> >>> # Customizing the segmentation behavior. >>> from pycantonese.word_segmentation import Segmenter >>> segmenter = Segmenter(allow={"容唔容易"}) >>> segment("廣東話容唔容易學?", cls=segmenter) ['廣東話', '容唔容易', '學', '?']