Corpus reader methods

The sections The representation of “words” and A note on the access methods provide background information for how to use the methods listed in Metadata methods and Data methods.

The representation of “words”

The representation of “words” in PyCantonese comes in two flavors (similar to NLTK):

  1. The “simple” representation as a string, which is what appears as a word token (in Chinese characters) in a transcription line starting with * in the CHAT transcript.

  2. The “tagged” representation as a tuple of (word, pos, jyutping, rel), which contains information from the transcription line and its %-tiers:

    word (str) – Chinese character(s)

    pos (str) – part-of-speech tag from %mor (All PoS tags are rendered in uppercase.)

    jyutping (str) – Jyutping romanization from %mor (plus inflectional information, if any)

    rel – dependency and grammatical relation from %gra (no known datasets have made used of %gra yet)

To illustrate, let us consider the following CHAT utterance with its %mor tier:

*XXA:       喂 遲 啲 去 唔 去 旅行 啊 ?
%mor:       e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?

The list of “simple” words from this utterance are the list of word token strings:

['喂', '遲', '啲', '去', '唔', '去', '旅行', '啊', '?']

The list of “tagged” words from this utterance are a list of 4-tuples:

[('喂', 'E', 'wai3', ''),
 ('遲', 'A', 'ci4', ''),
 ('啲', 'U', 'di1', ''),
 ('去', 'V', 'heoi3', ''),
 ('唔', 'D', 'm4', ''),
 ('去', 'V', 'heoi3', ''),
 ('旅行', 'VN', 'leoi5hang4', ''),
 ('啊', 'Y', 'aa3', ''),
 ('?', '?', '', '')]

The distinction of “simple” versus “tagged” words is reflected in the data access methods listed in Data methods below.

A note on the access methods

>>> import pycantonese as pc
>>> corpus = pc.hkcancor()

A corpus object, such as corpus as shown just above, has an array of methods X(). An example is number_of_files():

>>> corpus.number_of_files()
58

Many of these methods together with their documentation notes are programmatically inherited from the library PyLangAcq for language acquisition research. A few remarks here are necessary to avoid confusion.

Many methods have the optional parameter participant, which may safely be ignored in PyCantonese. The parameter participant specifies which participant(s) are of interest. This is important in the context of language acquisition: 'CHI' for the target child, 'MOT' for the mother, and so forth. In the CHAT format of HKCanCor that PyCantonese includes, the participants are rendered as codes such 'XXA', 'XXB' etc based on the original HKCanCor files. When participant is not specified, all participants are automatically included.

Another optional parameter of interest is by_files. Typically, a corpus comes in the form of multiple CHAT files. If a method X() has by_files, this parameter is set to be False by default, so that X() returns whatever it is for all the files without the file structure. If you are interested in results for individual files, set by_files to be True and the return object is dict(absolute-path filename: X() for that file) instead.

Metadata methods

filenames([sorted_by_age]) Return the set of absolute-path filenames.
find_filename
number_of_files() Return the number of files.
number_of_utterances(*args, **kwargs)

Data methods

utterances(*args, **kwargs)
words(*args, **kwargs)
tagged_words(*args, **kwargs)
sents(*args, **kwargs)
tagged_sents(*args, **kwargs)
jyutpings([participant, exclude, by_files]) Return a list of jyutping strings by participant in all files.
jyutping_sents([participant, exclude, by_files]) Return a list of sents of jyutping strings by participant in all files.
characters([participant, exclude, by_files]) Return a list of Chinese characters by participant in all files.
character_sents([participant, exclude, by_files]) Return a list of sents of Chinese characters by participant in all files.
part_of_speech_tags(*args, **kwargs)
word_frequency(*args, **kwargs)
word_ngrams(*args, **kwargs)
search([onset, nucleus, coda, tone, …]) Search for the specified element(s).
update(reader) Combine the current CHAT Reader instance with reader.
add(*filenames) Add one or more CHAT filenames to the current reader.
remove(*filenames) Remove one or more CHAT filenames from the current reader.
clear() Clear everything and reset as an empty Reader instance.

Full reader API

class pycantonese.corpus.CantoneseCHATReader(*filenames, **kwargs)

Bases: pylangacq.chat.Reader

A class for reading Cantonese CHAT corpus files.

IPSyn(participant='CHI')

Return a map from a file path to the file’s IPSyn.

IPSyn = index of productive syntax

participant : str, optional
The specified participant (default to 'CHI').

dict(str: int)

MLU(participant='CHI')

Return a map from a file path to the file’s MLU by morphemes.

MLU = mean length of utterance. This method is identical to MLUm.

participant : str, optional
The specified participant (default to 'CHI').

dict(str: float)

MLUm(participant='CHI')

Return a map from a file path to the file’s MLU by morphemes.

MLU = mean length of utterance. This method is identical to MLUm.

participant : str, optional
The specified participant (default to 'CHI').

dict(str: float)

MLUw(participant='CHI')

Return a map from a file path to the file’s MLU by words.

MLU = mean length of utterance.

participant : str, optional
The specified participant (default to 'CHI').

dict(str: float)

TTR(participant='CHI')

Return a map from a file path to the file’s TTR.

TTR = type-token ratio

participant : str, optional
The specified participant (default to 'CHI').

dict(str: float)

abspath(basename)

Return the absolute path of basename.

basename : str
The basename (e.g., “foobar.cha”) of the desired data file.

str

add(*filenames)

Add one or more CHAT filenames to the current reader.

*filenames
Filenames may take glob patterns with wildcards * and ?.
age(participant='CHI', months=False)

Return a map from a file path to the participant’s age.

The age is in the form of (years, months, days).

participant : str, optional
The specified participant
months : bool, optional
If True, age is in months.

dict(str: tuple(int, int, int)) or dict(str: float)

character_sents(participant=None, exclude=None, by_files=False)

Return a list of sents of Chinese characters by participant in all files.

Parameters:
  • participant – Specify the participant(s); defaults to all participants.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(list(str)), or dict(str: list(list(str)))

characters(participant=None, exclude=None, by_files=False)

Return a list of Chinese characters by participant in all files.

Parameters:
  • participant – Specify the participant(s); defaults to all participants.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(str), or dict(str: list(str))

clear()

Clear everything and reset as an empty Reader instance.

date_of_birth()

Return a map from a file path to the date of birth.

dict(str: dict(str: tuple(int, int, int)))

dates_of_recording()

Return a map from a file path to the date of recording.

The date of recording is in the form of (year, month, day).

dict(str: list(tuple(int, int, int)))

filenames(sorted_by_age=False)

Return the set of absolute-path filenames.

sorted_by_age : bool, optional
Whether to return the filenames as a list sorted by the target child’s age.

set of str or list of str

headers()

Return a dict mapping a file path to the headers of that file.

dict(str: dict)

index_to_tiers()

Return a dict mapping a file path to the file’s index_to_tiers dict.

dict(str: dict)

jyutping_sents(participant=None, exclude=None, by_files=False)

Return a list of sents of jyutping strings by participant in all files.

Parameters:
  • participant – Specify the participant(s); defaults to all participants.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(list(str)), or dict(str: list(list(str)))

jyutpings(participant=None, exclude=None, by_files=False)

Return a list of jyutping strings by participant in all files.

Parameters:
  • participant – Specify the participant(s); defaults to all participants.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(str), or dict(str: list(str))

languages()

Return a map from a file path to the languages used.

dict(str: list(str))

number_of_files()

Return the number of files.

int

participants()

Return a dict mapping a file path to the file’s participant info.

dict(str: dict)

remove(*filenames)

Remove one or more CHAT filenames from the current reader.

*filenames
Filenames may take glob patterns with wildcards * and ?.
search(onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), sent_range=(0, 0), tagged=True, sents=False, participant=None, exclude=None, by_files=False)

Search for the specified element(s).

Jyutping elements

Parameters are onset, nucleus, coda, tone, initial, final, jyutping. If jyutping is used, none of the other Jyutping elements can be. If final is used, neither nucleus nor coda can be. onset and initial cannot conflict, unless one or both of them are None. Regular expression matching applies to onset, nucleus, coda, tone, and initial.

Chinese character

Parameter: character (only one is allowed)

Part-of-speech tag

Parameter: pos

Regular expression matching applies.

Word or sentence range

word_range: specify the span of words to the left and right of a match word; defaults to (0, 0).

sent_range: specify the span of sents preceding and following the sent containing a match word; defaults to (0, 0).

If sent_range is used, word_range is ignored.

Output formatting

If sents is True (the default), sents containing a match word are returned; otherwise just a word instead.

If tagged is True (the default), words are tagged in the form of (word, pos, jyutping, rel); otherwise just word token strings.

by_files: If False (the default), the return object is a list encompassing search results for all files. If True, the return object is dict(absolute-path filename: list of search results for that file) instead.

Others

participant: specify the participant(s) (default: all participants).

update(reader)

Combine the current CHAT Reader instance with reader.

reader : Reader