PyCantonese: Cantonese Linguistics and NLP in Python
PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features (more to come!):
Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Parsing Cantonese text
Stop words
Word segmentation
Part-of-speech tagging
Download and Install
To download and install the stable, most recent version:
$ pip install --upgrade pycantonese
Ready for more? Check out the Quickstart page.
Consulting
If your team would like professional assistance in using PyCantonese, freelance consulting and training services are available for both academic and commercial groups. Please email Jackson L. Lee.
Support
If you have found PyCantonese useful and would like to offer support, buying me a coffee would go a long way!
Links
Source code: https://github.com/jacksonllee/pycantonese
Bug tracker: https://github.com/jacksonllee/pycantonese/issues
How to Cite
PyCantonese is authored and maintained by Jackson L. Lee.
Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022. PyCantonese: Cantonese Linguistics and NLP in Python. Proceedings of the 13th Language Resources and Evaluation Conference.
@inproceedings{lee-etal-2022-pycantonese,
title = "PyCantonese: Cantonese Linguistics and NLP in Python",
author = "Lee, Jackson L. and
Chen, Litong and
Lam, Charles and
Lau, Chaak Ming and
Tsui, Tsz-Him",
booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
month = june,
year = "2022",
publisher = "European Language Resources Association",
language = "English",
}
License
MIT License. Please see LICENSE.txt
in the GitHub source code for details.
The HKCanCor dataset included in PyCantonese is substantially modified from
its source in terms of format. The original dataset has a CC BY license.
Please see pycantonese/data/hkcancor/README.md
in the GitHub source code for details.
The rime-cantonese data (release 2021.05.16) is
incorporated into PyCantonese for word segmentation and
characters-to-Jyutping conversion.
This data has a CC BY 4.0 license.
Please see pycantonese/data/rime_cantonese/README.md
in the GitHub source code for details.
Logo
The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).
Acknowledgments
Wonderful resources with a permissive license that have been incorporated into PyCantonese:
HKCanCor
rime-cantonese
Individuals who have contributed feedback, bug reports, etc. (in alphabetical order of last names):
@cathug
Jenny Chim
@g-traveller
Rachel Han
Ryan Lai
Hill Ma
@richielo
@rylanchiu
Stephan Stiller
Robin Yuen
Table of Contents
- Quickstart
- Corpus Data
- Corpus Reader Methods
- Corpus Search Queries
- Parsing Cantonese Text
- Jyutping Romanization
- Stop Words
- Word Segmentation
- Part-of-Speech Tagging
- API Reference
- Changelog
- [Unreleased]
- [3.4.0] - 2021-12-28
- [3.3.1] - 2021-05-14
- [3.3.0] - 2021-05-14
- [3.2.4] - 2021-05-07
- [3.2.3] - 2021-04-12
- [3.2.2] - 2021-03-23
- [3.2.1] - 2021-03-21
- [3.2.0] - 2021-03-20
- [3.1.1] - 2021-03-18
- [3.1.0] - 2021-02-21
- [3.0.0] - 2020-10-25
- [2.4.1] - 2020-10-10
- [2.4.0] - 2020-10-10
- [2.3.0] - 2020-07-24
- [2.2.0] - 2018-06-30
- [2.1.0] - 2018-06-11
- [2.0.0] - 2016-02-06
- [1.0] - 2015-09-06
- [1.0dev] - 2015-09-02
- [0.2.1] - 2015-01-25
- [0.2] - 2015-01-22
- [0.1] - 2014-12-17
- Archives