Corpus Reader Methods

After you have created a corpus reader (see Corpus Data), the headers, transcriptions, and annotations are all accessible through the methods of the corpus reader object.

Let’s say we have a corpus reader with the built-in HKCanCor data.

>>> import pycantonese
>>> corpus = pycantonese.hkcancor()

Headers

A CHAT data file typically has metadata around the top of the file, with lines that begin with @. The metadata include participants’ demographics (age, gender, etc), date of recording, and languages used in the data.

Specifically for HKCanCor, the participants in all the data files are anonymous. In PyCantonese’s rendition of HKCanCor, their names are simply placeholders such as A, B, etc, and their corresponding three-letter codes are XXA, XXB, etc. In contrast, many CHILDES and TalkBank datasets have their participants identified. By convention, the target child’s code is CHI, the child’s mother’s MOT, and the child’s father’s FAT.

Since PyCantonese uses PyLangAcq to parse CHAT data files, the way in which header information is accessed is identical between the two packages. Please see PyLangAcq’s documentation on headers.

To see how a header from HKCanCor translates to its representation in PyCantonese, here is the header from FC-001_v2.cha, the first (by filename) of the 58 CHAT files:

@UTF8
@Begin
@Languages: yue , eng
@Participants:      XXA A Adult , XXB B Adult
@ID:        yue , eng|HKCanCor|XXA|34;|female|||Adult||origin:HK|
@ID:        yue , eng|HKCanCor|XXB|37;|female|||Adult||origin:HK|
@Date:      30-APR-1997
@Tape Number:       001

In this example, this recording session was between two Hong Kong female speakers (ages 34 and 37), recorded on April 30th, 1997. The languages in this data file are both Cantonese and English (in that order of usage frequency; the ordering in yue , eng is meaningful).

Through the corpus reader object corpus we’ve just created, we see the same information by calling the method headers() (which returns a list of dicts; [0] gets the first dict that corresponds to FC-001_v2.cha):

>>> corpus.headers()[0]
{'UTF8': '',
 'Languages': ['yue', 'eng'],
 'Participants': {'XXA': {'name': 'A',
                          'language': 'yue , eng',
                          'corpus': 'HKCanCor',
                          'age': '34;',
                          'sex': 'female',
                          'group': '',
                          'ses': '',
                          'role': 'Adult',
                          'education': '',
                          'custom': 'origin:HK'},
                  'XXB': {'name': 'B',
                          'language': 'yue , eng',
                          'corpus': 'HKCanCor',
                          'age': '37;',
                          'sex': 'female',
                          'group': '',
                          'ses': '',
                          'role': 'Adult',
                          'education': '',
                          'custom': 'origin:HK'}},
 'Date': {datetime.date(1997, 4, 30)},
 'Tape Number': '001'}

Here are the currently implemented methods for header information:

ages([participant, months])

Return the ages of the given participant in the data.

dates_of_recording([by_files])

Return the dates of recording.

headers()

Return the headers.

languages([by_files])

Return the languages in the data.

participants([by_files])

Return the participants (e.g., CHI, MOT).

Whenever it is appropriate, these header information methods, as well as the data access methods to be introduced below, have optional arguments to (i) control the output data structure (by_utterances and by_files) and (ii) filter by participants (participants and exclude). The methods’ hyperlinks point to more detailed documentation.

Transcriptions and Annotations

A PyCantonese corpus reader is an instance of the CHATReader class. While this class inherits the CHAT handling capabilities from the underlying PyLangAcq package, CHATReader has several additional functionality to deal with Cantonese-specific elements, particularly Jyutping romanization and Chinese characters.

CHATReader has convenience methods to give you an overview of the data in the reader.

info([verbose])

Print a summary of this Reader's data.

head([n, participants, exclude])

Return the first several utterances.

tail([n, participants, exclude])

Return the last several utterances.

>>> corpus.info()
58 files
16162 utterances
153654 words
      Utterance Count    Word Count  File Path
--  -----------------  ------------  --------------
#1                245          1998  FC-001_v2.cha
#2                134          2581  FC-005a_v2.cha
#3                172          2721  FC-005b_v2.cha
#4                111          1208  FC-009b_v.cha
#5                139          1308  FC-011_v.cha
...
(set `verbose` to True for all the files)
>>> corpus.head()
*XXA:  喂      遲     啲     去       唔    去       旅行           啊     ?
%mor:  E|wai3  A|ci4  U|di1  V|heoi3  D|m4  V|heoi3  VN|leoi5hang4  Y|aa3  ?

*XXA:  你      老公         有冇         平       機票        啊     ?
%mor:  R|nei5  N|lou5gung1  V1|jau5mou5  A|peng4  N|gei1piu3  Y|aa3  ?

*XXB:  平       機票        要       淡季           先      有得         平       𡃉      喎     .
%mor:  A|peng4  N|gei1piu3  VU|jiu3  AN|daam6gwai3  D|sin1  VU|jau5dak1  A|peng4  Y|gaa3  Y|wo3  .

*XXB:  而家       旺       -   .
%mor:  T|ji4gaa1  A|wong6  -|  .

*XXA:  冇得         去       嗱      .
%mor:  VU|mou5dak1  V|heoi3  Y|laa4  .

Here are the major CHATReader methods to access data at different levels of data structure:

words([participants, exclude, ...])

Return the words.

tokens([participants, exclude, ...])

Return the tokens.

utterances([participants, exclude, by_files])

Return the utterances.

Words are the usual text strings. Think of tokens as words but with annotations (part-of-speech tags, morphological information, etc). An utterance is a a list of tokens plus associated information (the participant of the utterance, time markers if there are associated audio-visual materials, etc).

>>> corpus.words()[:10]
['喂', '遲', '啲', '去', '唔', '去', '旅行', '啊', '?', '你']
>>>
>>> corpus.tokens()[:10]
[Token(word='喂', pos='E', jyutping='wai3', mor=None, gloss=None, gra=None),
 Token(word='遲', pos='A', jyutping='ci4', mor=None, gloss=None, gra=None),
 Token(word='啲', pos='U', jyutping='di1', mor=None, gloss=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
 Token(word='唔', pos='D', jyutping='m4', mor=None, gloss=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
 Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
 Token(word='啊', pos='Y', jyutping='aa3', mor=None, gloss=None, gra=None),
 Token(word='?', pos='?', jyutping=None, mor=None, gloss=None, gra=None),
 Token(word='你', pos='R', jyutping='nei5', mor=None, gloss=None, gra=None)]
>>>
>>> corpus.utterances()[:1]
[Utterance(participant='XXA',
           tokens=[Token(word='喂', pos='E', jyutping='wai3', mor=None, gloss=None, gra=None),
                   Token(word='遲', pos='A', jyutping='ci4', mor=None, gloss=None, gra=None),
                   Token(word='啲', pos='U', jyutping='di1', mor=None, gloss=None, gra=None),
                   Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
                   Token(word='唔', pos='D', jyutping='m4', mor=None, gloss=None, gra=None),
                   Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
                   Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
                   Token(word='啊', pos='Y', jyutping='aa3', mor=None, gloss=None, gra=None),
                   Token(word='?', pos='?', jyutping=None, mor=None, gloss=None, gra=None)],
           time_marks=None,
           tiers={'XXA': '喂 遲 啲 去 唔 去 旅行 啊 ?',
                  '%mor': 'e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?'})]

PyCantonese has an augmented representation of tokens, where Jyutping romanization and glosses have their own dedicated attributes.

Jyutping Romanization

Tokens, as annotated words, are instances of the Token class. A Token instance has the PyCantonese-specific attribute jyutping to accommodate Jyutping romanization.

To illustrate, below is the first utterance in FC-001_v2.cha, where Jyutping romanization is found in the %mor tier:

*XXA:       喂 遲 啲 去 唔 去 旅行 啊 ?
%mor:       e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?

Here are the corresponding tokens from PyCantonese, where the data in CHAT format has been parsed into Token objects, with the attribute jyutping storing Jyutping romanization:

>>> some_tokens = corpus.tokens(by_utterances=True)[0]
>>> some_tokens
[Token(word='喂', pos='E', jyutping='wai3', mor=None, gloss=None, gra=None),
 Token(word='遲', pos='A', jyutping='ci4', mor=None, gloss=None, gra=None),
 Token(word='啲', pos='U', jyutping='di1', mor=None, gloss=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
 Token(word='唔', pos='D', jyutping='m4', mor=None, gloss=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
 Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
 Token(word='啊', pos='Y', jyutping='aa3', mor=None, gloss=None, gra=None),
 Token(word='?', pos='?', jyutping=None, mor=None, gloss=None, gra=None)]
>>> for token in some_tokens:
...     print(token.jyutping)
...
wai3
ci4
di1
heoi3
m4
heoi3
leoi5hang4
aa3
None

Given the ubiquitous status of Jyutping in the study of Cantonese, the jyutping() method is also defined for convenience:

>>> corpus.jyutping(by_utterances=True)[0]
['wai3', 'ci4', 'di1', 'heoi3', 'm4', 'heoi3', 'leoi5hang4', 'aa3', None]

For further processing Jyutping romanization, please see the Jyutping Romanization page.

Chinese Characters

Corpus data in the CHAT format is word-segmented, and the same word segmentation is preserved in the output of the CHATReader methods words(), tokens(), and utterances(). For Cantonese data, a (segmented) word can be, say, 廣東話 (“Cantonese”) with three Chinese characters. To work with data at the character level, characters() is available:

>>> corpus.characters(by_utterances=True)[0]
['喂', '遲', '啲', '去', '唔', '去', '旅', '行', '啊', '?']

If you independently have Cantonese data in Chinese characters, PyCantonese has tools for word segmentation and part-of-speech tagging.

Word Frequencies and Ngrams

For word counts in various flavors, use the methods word_frequencies() and word_ngrams():

>>> word_freq = corpus.word_frequencies()  # A collections.Counter object
>>> word_freq.most_common(10)
[('.', 13251),
 (',', 9282),
 ('係', 5019),
 ('啊', 4110),
 ('?', 2911),
 ('我', 2755),
 ('噉', 2741),
 ('呢', 2734),
 ('你', 2570),
 ('佢', 2259)]
>>>
>>> trigrams = corpus.word_ngrams(3)  # A collections.Counter object
>>> trigrams.most_common(10)
[(('係', '啊', '.'), 527),
 ((',', '誒', ','), 520),
 (('呢', ',', '就'), 219),
 (('係', '啊', ','), 209),
 (('係', '囖', '.'), 202),
 (('吖', '嗎', '.'), 202),
 (('𡃉', '喎', '.'), 186),
 (('𠺢', '嗎', '.'), 167),
 (('係', '喇', '.'), 140),
 (('係', '喇', ','), 134)]