Corpus reader methods

The sections The representation of “words” and A note on the access methods provide background information for how to use the methods listed in Metadata methods and Data methods.

Warning

If you are running a script on Windows, be sure to put all your code under the scope of if __name__ == '__main__':. (PyCantonese uses the multiprocessing module to read data files.)

The representation of “words”

The representation of “words” in PyCantonese comes in two flavors (similar to NLTK):

  1. The “simple” representation as a string, which is what appears as a word token (in Chinese characters) in a transcription line starting with * in the CHAT transcript.

  2. The “tagged” representation as a tuple of (word, pos, jyutping, rel), which contains information from the transcription line and its %-tiers:

    word (str) – Chinese character(s)

    pos (str) – part-of-speech tag from %mor (All PoS tags are rendered in uppercase.)

    jyutping (str) – Jyutping romanization from %mor (plus inflectional information, if any)

    rel – dependency and grammatical relation from %gra (no known datasets have made used of %gra yet)

To illustrate, let us consider the following CHAT utterance with its %mor tier:

*XXA:       喂 遲 啲 去 唔 去 旅行 啊 ?
%mor:       e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?

The list of “simple” words from this utterance are the list of word token strings:

['喂', '遲', '啲', '去', '唔', '去', '旅行', '啊', '?']

The list of “tagged” words from this utterance are a list of 4-tuples:

[('喂', 'E', 'wai3', ''),
 ('遲', 'A', 'ci4', ''),
 ('啲', 'U', 'di1', ''),
 ('去', 'V', 'heoi3', ''),
 ('唔', 'D', 'm4', ''),
 ('去', 'V', 'heoi3', ''),
 ('旅行', 'VN', 'leoi5hang4', ''),
 ('啊', 'Y', 'aa3', ''),
 ('?', '?', '', '')]

The distinction of “simple” versus “tagged” words is reflected in the data access methods listed in Data methods below.

A note on the access methods

>>> import pycantonese as pc
>>> corpus = pc.hkcancor()

A corpus object, such as corpus as shown just above, has an array of methods X(). An example is number_of_files():

>>> corpus.number_of_files()
58

Many of these methods together with their documentation notes are programmatically inherited from the library PyLangAcq for language acquisition research. A few remarks here are necessary to avoid confusion.

Many methods have the optional parameter participant, which may safely be ignored in PyCantonese. The parameter participant specifies which participant(s) are of interest. This is important in the context of language acquisition: 'CHI' for the target child, 'MOT' for the mother, and so forth. In the CHAT format of HKCanCor that PyCantonese includes, the participants are rendered as codes such 'XXA', 'XXB' etc based on the original HKCanCor files. When participant is not specified, all participants are automatically included.

Another optional parameter of interest is by_files. Typically, a corpus comes in the form of multiple CHAT files. If a method X() has by_files, this parameter is set to be False by default, so that X() returns whatever it is for all the files without the file structure. If you are interested in results for individual files, set by_files to be True and the return object is dict(absolute-path filename: X() for that file) instead.

Metadata methods

filenames([sorted_by_age]) Return the set of absolute-path filenames, or a list sorted by the target child’s age if sorted_by_age is True.
find_filename(file_basename) Return the absolute-path filename of file_basename.
number_of_files() Return the number of files.
number_of_utterances([participant, by_files]) Return the number of utterances for participant in all files.

Data methods

utterances([participant, clean, by_files]) Return a list of (participant, utterance) pairs from all files.
words([participant, by_files]) Return a list of words by participant in all files.
tagged_words([participant, by_files]) Return a list of tagged words by participant in all files.
sents([participant, by_files]) Return a list of sents by participant in all files.
tagged_sents([participant, by_files]) Return a list of tagged sents by participant in all files.
jyutpings([participant, by_files]) Return a list of jyutping strings by participant in all files.
jyutping_sents([participant, by_files]) Return a list of sents of jyutping strings by participant in all files.
characters([participant, by_files]) Return a list of Chinese characters by participant in all files.
character_sents([participant, by_files]) Return a list of sents of Chinese characters by participant in all files.
part_of_speech_tags([participant, by_files]) Return the part-of-speech tags in the data for participant.
word_frequency([participant, keep_case, ...]) Return a Counter of word frequency dict for participant in all files.
word_ngrams(n[, participant, keep_case, ...]) Return a Counter of word n-grams by participant in all files.
search([onset, nucleus, coda, tone, ...]) Search for the specified element(s).
update(reader) Combine the current CHAT Reader instance with reader.
add(*filenames) Add one or multiple CHAT files to the current reader by filenames.
remove(*filenames) Remove one or multiple CHAT files from the current reader by filenames.
clear() Clear everything and reset as an empty Reader instance.

Full reader API

class pycantonese.corpus.CantoneseCHATReader(*filenames, encoding='utf8')

Bases: pylangacq.chat.Reader

A class for reading Cantonese CHAT corpus files.

add(*filenames)
Add one or multiple CHAT files to the current reader by filenames.
filenames may take glob patterns with wildcards * and ?.
age(participant='CHI', month=False)

Return a dict mapping an absolute-path filename to the participant‘s age in the form of (years, months, days).

Parameters:
  • participant – The specified participant; defaults to 'CHI'
  • month – If True (default: False), return a float as age in months.
Return type:

dict(str: tuple(int, int, int)) or dict(str: float)

character_sents(participant='**ALL**', by_files=False)

Return a list of sents of Chinese characters by participant in all files.

Parameters:
  • participant – Specify the participant(s); defaults to all participants.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(list(str)), or dict(str: list(list(str)))

characters(participant='**ALL**', by_files=False)

Return a list of Chinese characters by participant in all files.

Parameters:
  • participant – Specify the participant(s); defaults to all participants.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(str), or dict(str: list(str))

clear()

Clear everything and reset as an empty Reader instance.

date_of_birth()

Return a dict mapping an absolute-path filename to the date-of-birth dict for that file.

Return type:dict(str: dict(str: tuple(int, int, int)))
date_of_recording()

Return a dict mapping an absolute-path filename to the date of recording in the form of (year, month, day).

Return type:dict(str: tuple(int, int, int))
filenames(sorted_by_age=False)

Return the set of absolute-path filenames, or a list sorted by the target child’s age if sorted_by_age is True.

Return type:set(str) or list(str)
find_filename(file_basename)

Return the absolute-path filename of file_basename.

Parameters:file_basename – CHAT file basename such as eve01.cha
headers()

Return a dict mapping an absolute-path filename to the headers of that file.

Return type:dict(str: dict)
index_to_tiers()

Return a dict mapping an absolute-path filename to the file’s index_to_tiers dict.

Return type:dict(str: dict)
jyutping_sents(participant='**ALL**', by_files=False)

Return a list of sents of jyutping strings by participant in all files.

Parameters:
  • participant – Specify the participant(s); defaults to all participants.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(list(str)), or dict(str: list(list(str)))

jyutpings(participant='**ALL**', by_files=False)

Return a list of jyutping strings by participant in all files.

Parameters:
  • participant – Specify the participant(s); defaults to all participants.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(str), or dict(str: list(str))

languages()

Return a dict mapping an absolute-path filename to the list of languages used.

Return type:dict(str: list(str))
number_of_files()

Return the number of files.

number_of_utterances(participant='**ALL**', by_files=False)

Return the number of utterances for participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

int, or dict(str: int)

part_of_speech_tags(participant='**ALL**', by_files=False)

Return the part-of-speech tags in the data for participant.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

set or dict(str: set)

participant_codes(by_files=False)

Return the participant codes (e.g., {'CHI', 'MOT'}) from all files.

Parameters:by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:set(str), or dict(str: set(str))
participants()

Return a dict mapping an absolute-path filename to the file’s participant info dict.

Return type:dict(str: dict)
remove(*filenames)

Remove one or multiple CHAT files from the current reader by filenames. filenames may take glob patterns with wildcards * and ?.

search(onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), sent_range=(0, 0), tagged=True, sents=False, participant='**ALL**', by_files=False)

Search for the specified element(s).

Jyutping elements

Parameters are onset, nucleus, coda, tone, initial, final, jyutping. If jyutping is used, none of the other Jyutping elements can be. If final is used, neither nucleus nor coda can be. onset and initial cannot conflict, unless one or both of them are None. Regular expression matching applies to onset, nucleus, coda, tone, and initial.

Chinese character

Parameter: character (only one is allowed)

Part-of-speech tag

Parameter: pos

Regular expression matching applies.

Word or sentence range

word_range: specify the span of words to the left and right of a match word; defaults to (0, 0).

sent_range: specify the span of sents preceding and following the sent containing a match word; defaults to (0, 0).

If sent_range is used, word_range is ignored.

Output formatting

If sents is True (the default), sents containing a match word are returned; otherwise just a word instead.

If tagged is True (the default), words are tagged in the form of (word, pos, jyutping, rel); otherwise just word token strings.

by_files: If False (the default), the return object is a list encompassing search results for all files. If True, the return object is dict(absolute-path filename: list of search results for that file) instead.

Others

participant: specify the participant(s) (default: all participants).

sents(participant='**ALL**', by_files=False)

Return a list of sents by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(list(str)), or dict(str: list(list(str)))

tagged_sents(participant='**ALL**', by_files=False)

Return a list of tagged sents by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(list(tuple)), or dict(str: list(list(tuple)))

tagged_words(participant='**ALL**', by_files=False)

Return a list of tagged words by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(tuple), or dict(str: list(tuple))

update(reader)

Combine the current CHAT Reader instance with reader.

Parameters:reader – a Reader instance
utterances(participant='**ALL**', clean=True, by_files=False)

Return a list of (participant, utterance) pairs from all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • clean – Whether to filter away the CHAT annotations in the utterance; defaults to True.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(str), or dict(str: list(str))

word_frequency(participant='**ALL**', keep_case=True, by_files=False)

Return a Counter of word frequency dict for participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • keep_case – If keep_case is True (the default), case distinctions are kept and word tokens like “the” and “The” are treated as distinct types. If keep_case is False, all case distinctions are collapsed, with all word tokens forced to be in lowercase.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

Counter, or dict(str: Counter)

word_ngrams(n, participant='**ALL**', keep_case=True, by_files=False)

Return a Counter of word n-grams by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • keep_case – If keep_case is True (the default), case distinctions are kept and word tokens like “the” and “The” are treated as distinct types. If keep_case is False, all case distinctions are collapsed, with all word tokens forced to be in lowercase.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

Counter, or dict(str: Counter)

words(participant='**ALL**', by_files=False)

Return a list of words by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(str), or dict(str: list(str))