Quickstart

After you have downloaded and installed PyCantonese (see Download and Install), import the package pycantonese in your Python interpreter:

>>> import pycantonese

No errors? Great! Now you’re ready to proceed.

Caution

PyCantonese uses parallelized code to speed things up. For Windows users, you may need to put your code under the if __name__ == "__main__": idiom in a script to avoid running into an error. For reference, please see the “safe importing of main module” section for parallelization from the official Python documentation.

Word segmentation

>>> pycantonese.segment("廣東話好難學？")  # Is Cantonese difficult to learn?
['廣東話', '好', '難', '學', '？']

Conversion from Cantonese characters to Jyutping

>>> pycantonese.characters_to_jyutping('香港人講廣東話')  # Hongkongers speak Cantonese
[('香港人', 'hoeng1gong2jan4'), ('講', 'gong2'), ('廣東話', 'gwong2dung1waa2')]

Finding all verbs in the HKCanCor corpus

In this example, we search for the regular expression '^V' for all words whose part-of-speech tag begins with “V” in the original HKCanCor annotations:

>>> corpus = pycantonese.hkcancor() # get HKCanCor
>>> all_verbs = corpus.search(pos='^V')
>>> len(all_verbs)  # number of all verbs
29726
>>> all_verbs[:10]  # print 10 results
[Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
 Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
 Token(word='有冇', pos='V1', jyutping='jau5mou5', mor=None, gloss=None, gra=None),
 Token(word='要', pos='VU', jyutping='jiu3', mor=None, gloss=None, gra=None),
 Token(word='有得', pos='VU', jyutping='jau5dak1', mor=None, gloss=None, gra=None),
 Token(word='冇得', pos='VU', jyutping='mou5dak1', mor=None, gloss=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
 Token(word='係', pos='V', jyutping='hai6', mor=None, gloss=None, gra=None),
 Token(word='係', pos='V', jyutping='hai6', mor=None, gloss=None, gra=None)]

Parsing Jyutping for the onset, nucleus, coda, and tone

>>> pycantonese.parse_jyutping('gwong2dung1waa2')  # 廣東話
[Jyutping(onset='gw', nucleus='o', coda='ng', tone='2'),
 Jyutping(onset='d', nucleus='u', coda='ng', tone='1'),
 Jyutping(onset='w', nucleus='aa', coda='', tone='2')]

Looking for more functionality? Please use the documentation menu on the left.