PyCantonese: Cantonese Linguistics and NLP in Python

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features (more to come!):

  • Accessing and searching corpus data

  • Parsing and conversion tools for Jyutping romanization

  • Stop words

  • Word segmentation

  • Part-of-speech tagging

Quick Examples

With PyCantonese imported:

>>> import pycantonese as pc
  1. Word segmentation

>>> pc.segment("廣東話好難學?")  # Is Cantonese difficult to learn?
['廣東話', '好', '難', '學', '?']
  1. Conversion from Cantonese characters to Jyutping

>>> pc.characters_to_jyutping('香港人講廣東話')  # Hongkongers speak Cantonese
[("香港人", "hoeng1gong2jan4"), ("講", "gong2"), ("廣東話", "gwong2dung1waa2")]
  1. Finding all verbs in the HKCanCor corpus

    In this example, we search for the regular expression '^V' for all words whose part-of-speech tag begins with “V” in the original HKCanCor annotations:

>>> corpus = pc.hkcancor() # get HKCanCor
>>> all_verbs = corpus.search(pos='^V')
>>> len(all_verbs)  # number of all verbs
29012
>>> from pprint import pprint
>>> pprint(all_verbs[:10])  # print 10 results
[('去', 'V', 'heoi3', ''),
 ('去', 'V', 'heoi3', ''),
 ('旅行', 'VN', 'leoi5hang4', ''),
 ('有冇', 'V1', 'jau5mou5', ''),
 ('要', 'VU', 'jiu3', ''),
 ('有得', 'VU', 'jau5dak1', ''),
 ('冇得', 'VU', 'mou5dak1', ''),
 ('去', 'V', 'heoi3', ''),
 ('係', 'V', 'hai6', ''),
 ('係', 'V', 'hai6', '')]
  1. Parsing Jyutping for (onset, nucleus, coda, tone)

>>> pc.parse_jyutping('gwong2dung1waa2')  # 廣東話
[('gw', 'o', 'ng', '2'), ('d', 'u', 'ng', '1'), ('w', 'aa', '', '2')]

Download and Install

PyCantonese requires Python 3.6 or above. To download and install the stable, most recent version:

$ pip install --upgrade pycantonese

To test your installation in the Python interpreter:

>>> import pycantonese as pc
>>> pc.__version__  # show version number

How to Cite

PyCantonese is authored and mainteined by Jackson L. Lee.

A talk introducing PyCantonese:

Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data. Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015. Notes+slides

License

MIT License. Please see LICENSE.txt in the GitHub source code for details.

The HKCanCor dataset included in PyCantonese is substantially modified from its source in terms of format. The original dataset has a CC BY license. Please see pycantonese/data/hkcancor/README.md in the GitHub source code for details.

The rime-cantonese data (release 2020.09.09) is incorporated into PyCantonese for word segmentation and characters-to-Jyutping conversion. This data has a CC BY 4.0 license. Please see pycantonese/data/rime_cantonese/README.md in the GitHub source code for details.

Acknowledgments

Individuals who have contributed feedback, bug reports, etc. (in alphabetical order of last names if known):

  • @cathug

  • Litong Chen

  • @g-traveller

  • Rachel Han

  • Ryan Lai

  • Charles Lam

  • Hill Ma

  • @richielo

  • @rylanchiu

  • Stephan Stiller

  • Tsz-Him Tsui

Logo design by albino.snowman (Instagram handle).