PyCantonese: Cantonese linguistic research in the age of big data

Talk by Jackson Lee on 2015-09-15 at the Childhood Bilingualism Research Centre, the Chinese University of Hong Kong. [Slides] [Handout]

This page provides additional notes, some of which are in response to the questions and comments from the audience.

[Page created on 2015-09-22]


I mentioned in the talk that the CHAT transcription format as used in the CHILDES project should be preferred for various reasons. One of them is that there is an associated XML format for CHAT data files, and NLTK has the CHILDESCorpusReader class for reading CHILDES data in the XML format. I didn't have time during the talk to show how this works and how PyCantonese can interface and work on something Cantonese-specific. Here is a quick demonstration.

In [1]:
# Make sure you have NLTK installed.

import nltk
In [2]:
# The YipMatthews XML data are from here:
# Unzip it and put the entire "YipMatthews/" folder
# inside "corpora/childes/data-xml/Biling/"
# (usually in the "nltk_data/" folder on Linux or Mac OS)

# Let's say we are interested in Alicia's Cantonese data...

from nltk.corpus.reader import CHILDESCorpusReader
corpus_root ='corpora/childes/data-xml/Biling/')
alicia_can = CHILDESCorpusReader(corpus_root, 'YipMatthews/Can/AliciaCan/.*.xml')
In [3]:
# Check out what the data in "alicia_can" look like.
# We print the first 5 utterances in two different ways
# (including part-of-speech tags in both cases):
# 1. Chinese characters
for sent in alicia_can.tagged_sents()[:5]:
    for x in sent:
        print(x, end=" ")
('你', 'pro') ('著', 'v') ('住', 'asp') ('條', 'cl') ('裙裙', 'n') ('呀', 'sfp') ('幾', 'adv') ('靚', 'adj') ('噃', 'sfp') 
('係', 'v') ('呀', 'sfp') 
('裙裙', 'n') ('邊個', 'wh') ('買', 'v') ('架', 'sfp') 
('hm4', 'co') ('裙裙', 'n') ('婆婆', 'n') 
('係', 'v') ('咪', 'n') ('婆婆', 'n') ('買', 'v') ('俾', 'prep') ('你', 'pro') ('架', 'sfp') 
In [4]:
# 2. Jyutping (note the "stem" paramter)
for sent in alicia_can.tagged_sents(stem=True)[:5]:
    for x in sent:
        print(x, end=" ")
('nei5', 'pro') ('zoek3', 'v') ('zyu6', 'asp') ('tiu4', 'cl') ('kwan4-DIM', 'n') ('aa3', 'sfp') ('gei2', 'adv') ('leng3', 'adj') ('bo3', 'sfp') 
('hai6', 'v') ('aa3', 'sfp') 
('kwan4-DIM', 'n') ('bin1go3', 'wh') ('maai5', 'v') ('gaa3', 'sfp') 
('hm4', 'co') ('kwan4-DIM', 'n') ('po4-DIM', 'n') 
('hai6', 'v') ('maik1', 'n') ('po4-DIM', 'n') ('maai5', 'v') ('bei2', 'prep') ('nei5', 'pro') ('gaa3', 'sfp') 
In [5]:
# Now that we can access the Jyutping romanizations of the individual words,
# we can use the "jyutping" function in PyCantonese to parse Jyutping
# (if you are interested in phonological development in Cantonese-English
# bilinguals, for example).

import pycantonese as pc

sent = alicia_can.tagged_sents(stem=True)[0] # just use the first sentence for demo
for word, tag in sent:
    if "-" in word:
        # get rid of things like "-DIM" as seen above
        word = word.split("-")[0]
[('n', 'e', 'i', '5')]
[('z', 'oe', 'k', '3')]
[('z', 'yu', '', '6')]
[('t', 'i', 'u', '4')]
[('kw', 'a', 'n', '4')]
[('', 'aa', '', '3')]
[('g', 'e', 'i', '2')]
[('l', 'e', 'ng', '3')]
[('b', 'o', '', '3')]

Bug fix regarding the Yale romanization of Jyutping "eu"

A bug regarding the Yale romanization of Jyutping "eu" (as in deu6 'throw') was identified (thanks to Stephan Stiller). It has now been fixed in the current development version of 1.1-alpha.1, and can be accessed through the GitHub source. (Fixing this bug led to the discovery of another bug, which has also been fixed.) Example:

In [6]:
import pycantonese as pc
pc.__version__ # make sure you are using version 1.1-alpha.1 (or any future versions) for this fix
In [7]:
pc.yale("deu6laap6saap3") # 'throw-away trash'
['dewh', 'laahp', 'saap']

The next stable release (version 1.1) together with other new features (and bug fixes, if any) is scheduled for late December 2015.

How is X transcribed or annotated in a given corpus? Can we make changes if the way X is handled is undesirable for some reason?

The way something is handled and presented in a particular corpus is entirely dependent on the creator of the corpus. As an interface tool, PyCantonese aims at providing means for accessing Cantonese corpus data in a flexible way, but not modifying the original data in an unrecoverable way. In other words, an important principle is that PyCantonese must be able to present the original corpus data in the way intended by the corpus creator, without making implicit judgments or changes.

That said, PyCantonese is also an annotation tool. If a corpus is publicly available and incorporated into PyCantonese (currently, there's only one: HKCanCor), then it is entirely possible to add additional tiers depending on your research needs and interests.

Both hard-wired and dynamic annotations of corpus data through PyCantonese are anticipated; some of the ongoing work was mentioned in the talk. An example of hard-wired annotations is an additional tier of part-of-speech tags in terms of the Google universal tagset by Petrov et al 2011; this would facilitate work that involves multiple corpora. As for dynamic annotations, they are some sort of classifiers which add annotations based on some training data. Among the to-do list is conversion between Chinese characters and Jyutping (e.g., if you have a Cantonese corpus transcribed as Chinese characters only, you may want to add a tier of Jyutping romanization -- a non-trivial task!).

How can one search for X?

Whether a search for X can be done is highly dependent on whether X is already annotated in some way in the corpus dataset.

If X is directly annotated in the corpus data, lucky you! Simply write the code needed to extract whatever you need. For HKCanCor incorporated into PyCantonese, what is directly annotated and searchable includes Jyutping, Chinese characters, and part-of-speech tags. Please see the examples here.

But what if what you want to search for is not directly annotated in the corpus data?

The answer depends on whether what you want is indirectly encoded in the dataset in some way. If so, then it is a matter of coming up with the appropriate code to get whatever you want. An example from the talk is on extracting verb+noun word pairs.

The most challenging situation: What you want is not even indirectly available from the corpus data. For instance, HKCanCor currently does not come with phrase/clause-level parsing. This means that it is impossible to extract NPs, VPs, subjects, or any other constituents and structures above the word level in an obvious and straightforward way.

The last paragraph is meant to induce encouragement rather than disappointment There are lots of opportunities for strongly empirically based Cantonese linguistic research!

Technical support etc.

The full documentation of PyCantonese provides details of functions and features in the latest stable release.

If you have any questions/comments/bug reports/feature requests, need help with anything, or would like to bring up anything regarding PyCantonese, please do not hesitate to contact Jackson Lee at

The technically minded are encouraged to use the GitHub page to raise issues etc.

Which Python programming environment is recommended? (= How is this page created?)

To write quick code snippets (like what I did during the talk to demonstrate how things work), the IPython Notebook is highly recommended. It is freely available, compatible with Unicode (important for dealing with Chinese characters) and Python 3.x (required by PyCantonese), and has lots of other handy features that make your life as a linguist/computer programmer easier. Note that this current page is also written in IPython Notebook. You can see that code snippets together with explanatory notes and comments can be interleaved in an organized way.