Corpus Reader Methods

The Representation of “Words”

The representation of “words” in PyCantonese comes in two flavors (similar to NLTK):

  1. The “simple” representation as a string, which is what appears as a word token (in Chinese characters) in a transcription line starting with * in the CHAT transcript.

  2. The “tagged” representation as a tuple of (word, pos, jyutping, rel), which contains information from the transcription line and its %-tiers:

    word (str) – Chinese character(s)

    pos (str) – part-of-speech tag from %mor (All PoS tags are rendered in uppercase.)

    jyutping (str) – Jyutping romanization from %mor (plus inflectional information, if any)

    rel – dependency and grammatical relation from %gra (no known datasets have made used of %gra yet)

To illustrate, let us consider the following CHAT utterance with its %mor tier:

*XXA:       喂 遲 啲 去 唔 去 旅行 啊 ?
%mor:       e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?

The list of “simple” words from this utterance are the list of word token strings:

['喂', '遲', '啲', '去', '唔', '去', '旅行', '啊', '?']

The list of “tagged” words from this utterance are a list of 4-tuples:

[('喂', 'E', 'wai3', ''),
 ('遲', 'A', 'ci4', ''),
 ('啲', 'U', 'di1', ''),
 ('去', 'V', 'heoi3', ''),
 ('唔', 'D', 'm4', ''),
 ('去', 'V', 'heoi3', ''),
 ('旅行', 'VN', 'leoi5hang4', ''),
 ('啊', 'Y', 'aa3', ''),
 ('?', '?', '', '')]

A Note on the Access Methods

>>> import pycantonese as pc
>>> corpus = pc.hkcancor()

A corpus object, such as corpus as shown just above, has an array of methods X(). An example is number_of_files():

>>> corpus.number_of_files()
58

Many of these methods together with their documentation notes are programmatically inherited from the library PyLangAcq for language acquisition research. A few remarks here are necessary to avoid confusion.

Many methods have the optional parameter participant, which may safely be ignored in PyCantonese. The parameter participant specifies which participant(s) are of interest. This is important in the context of language acquisition: 'CHI' for the target child, 'MOT' for the mother, and so forth. In the CHAT format of HKCanCor that PyCantonese includes, the participants are rendered as codes such 'XXA', 'XXB' etc based on the original HKCanCor files. When participant is not specified, all participants are automatically included.

Another optional parameter of interest is by_files. Typically, a corpus comes in the form of multiple CHAT files. If a method X() has by_files, this parameter is set to be False by default, so that X() returns whatever it is for all the files without the file structure. If you are interested in results for individual files, set by_files to be True and the return object is dict(absolute-path filename: X() for that file) instead.

Full Reader API

class pycantonese.corpus.CantoneseCHATReader(*filenames, **kwargs)[source]

A reader for Cantonese CHAT corpus files.

Note

Some of the methods are inherited from the parent class pylangacq.chat.Reader for language acquisition, which may or may not be applicable to your use case.

Methods

IPSyn([participant])

Return a map from a file path to the file’s IPSyn.

MLU([participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUm([participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUw([participant])

Return a map from a file path to the file’s MLU by words.

TTR([participant])

Return a map from a file path to the file’s TTR.

abspath(basename)

Return the absolute path of basename.

add(*filenames)

Add one or more CHAT filenames to the current reader.

age([participant, months])

Return a map from a file path to the participant’s age.

character_sents([participant, exclude, by_files])

Return the data as sentences of individual Cantonese characters.

characters([participant, exclude, by_files])

Return the data split in individual Cantonese characters.

clear()

Clear everything and reset as an empty Reader instance.

concordance(search_item[, participant, …])

Return a list of utterances with search_item for participant.

date_of_birth()

Return a map from a file path to the date of birth.

dates_of_recording()

Return a map from a file path to the date of recording.

filenames([sorted_by_age])

Return the set of absolute-path filenames.

from_chat_files(*filenames, **kwargs)

Create a Reader object with CHAT data files.

from_chat_str(chat_str[, encoding])

Create a Reader object with CHAT data as a string.

headers()

Return a dict mapping a file path to the headers of that file.

index_to_tiers()

Return a dict mapping a file path to the file’s index_to_tiers dict.

jyutping_sents([participant, exclude, by_files])

Return the sentences in Jyutping romanization.

jyutpings([participant, exclude, by_files])

Return the words in Jyutping romanization.

languages()

Return a map from a file path to the languages used.

number_of_files()

Return the number of files.

number_of_utterances([participant, exclude, …])

Return the number of utterances for participant in all files.

part_of_speech_tags([participant, exclude, …])

Return the part-of-speech tags in the data for participant.

participant_codes([by_files])

Return the participant codes (e.g., {'CHI', 'MOT'}).

participants()

Return a dict mapping a file path to the file’s participant info.

remove(*filenames)

Remove one or more CHAT filenames from the current reader.

search(*[, onset, nucleus, coda, tone, …])

Search the data for the given criteria.

sents([participant, exclude, by_files])

Return a list of sents by participant in all files.

tagged_sents([participant, exclude, by_files])

Return a list of tagged sents by participant in all files.

tagged_words([participant, exclude, by_files])

Return a list of tagged words by participant in all files.

update(reader)

Combine the current CHAT Reader instance with reader.

utterances([participant, exclude, clean, …])

Return a list of (participant, utterance) pairs from all files.

word_frequency([participant, exclude, …])

Return a word frequency counter for participant in all files.

word_ngrams(n[, participant, exclude, …])

Return a word n-gram counter by participant in all files. participant : str or iterable of str, optional Participants of interest. If unspecified or None, all participants are included. exclude : str or iterable of str, optional Participants to exclude. If unspecified or None, no participants are excluded. by_files : bool, optional If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether. keep_case : bool, optional If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.

words([participant, exclude, by_files])

Return a list of words by participant in all files.

IPSyn(participant='CHI')

Return a map from a file path to the file’s IPSyn.

IPSyn = index of productive syntax

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: int)
MLU(participant='CHI')

Return a map from a file path to the file’s MLU by morphemes.

MLU = mean length of utterance. This method is identical to MLUm.

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: float)
MLUm(participant='CHI')

Return a map from a file path to the file’s MLU by morphemes.

MLU = mean length of utterance. This method is identical to MLUm.

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: float)
MLUw(participant='CHI')

Return a map from a file path to the file’s MLU by words.

MLU = mean length of utterance.

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: float)
TTR(participant='CHI')

Return a map from a file path to the file’s TTR.

TTR = type-token ratio

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: float)
abspath(basename)

Return the absolute path of basename.

Parameters
basenamestr

The basename (e.g., “foobar.cha”) of the desired data file.

Returns
str
add(*filenames)

Add one or more CHAT filenames to the current reader.

Parameters
*filenames

Filenames may take glob patterns with wildcards * and ?.

age(participant='CHI', months=False)

Return a map from a file path to the participant’s age.

The age is in the form of (years, months, days).

Parameters
participantstr, optional

The specified participant

monthsbool, optional

If True, age is in months.

Returns
dict(str: tuple(int, int, int)) or dict(str: float)
character_sents(participant=None, exclude=None, by_files=False)[source]

Return the data as sentences of individual Cantonese characters.

Parameters
participantstr or iterable[str], optional

One or more participants to include the data for. If unspecified, all participants are included.

excludestr or iterable[str], optional

One or more participants to exclude the data for. If unspecified, no participants are excluded.

by_filesbool, optional

If True (default: False), return data organized by the individual file paths.

Returns
list[list[str]], or dict[str, list[list[str]]] if by_files is True
characters(participant=None, exclude=None, by_files=False)[source]

Return the data split in individual Cantonese characters.

Parameters
participantstr or iterable[str], optional

One or more participants to include the data for. If unspecified, all participants are included.

excludestr or iterable[str], optional

One or more participants to exclude the data for. If unspecified, no participants are excluded.

by_filesbool, optional

If True (default: False), return data organized by the individual file paths.

Returns
list[str], or dict[str, list[str]] if by_files is True
clear()

Clear everything and reset as an empty Reader instance.

concordance(search_item, participant=None, exclude=None, match_entire_word=True, lemma=False, by_files=False)[source]

Return a list of utterances with search_item for participant.

All strings are aligned for search_item by space padding to create the word concordance effect.

Parameters
search_itemstr

Word or lemma to search for.

match_entire_wordbool, optional

If False (default: True), substring matching is performed.

lemmabool, optional

If True (default: False), search_item refers to the lemma (from “mor” in the tagged word) instead.

participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list, or dict(str: list)
date_of_birth()

Return a map from a file path to the date of birth.

Returns
dict(str: dict(str: tuple(int, int, int)))
dates_of_recording()

Return a map from a file path to the date of recording.

The date of recording is in the form of (year, month, day).

Returns
dict(str: list(tuple(int, int, int)))
filenames(sorted_by_age=False)

Return the set of absolute-path filenames.

Parameters
sorted_by_agebool, optional

Whether to return the filenames as a list sorted by the target child’s age.

Returns
set of str or list of str
classmethod from_chat_files(*filenames, **kwargs)

Create a Reader object with CHAT data files.

Parameters
filenamesstr or iterable or str, optional

One or more filenames. A filename may match exactly a CHAT file (e.g., 'eve01.cha') or matches multiple files by glob patterns (e.g., 'eve*.cha', for 'eve01.cha', 'eve02.cha', etc.). * matches any number (including zero) of characters, while ? matches exactly one character. A filename can be either an absolute or relative path. If no filenames are provided, an empty Reader instance is created.

kwargs

Only the keyword encoding is recognized, which defaults to ‘utf8’. (New in version 0.9)

Returns
Reader

Notes

Because CHAT data most likely comes as files on disk, an equivalent library top-level function pylangacq.read_chat is defined for convenience.

classmethod from_chat_str(chat_str, encoding='utf8')

Create a Reader object with CHAT data as a string.

Parameters
chat_strstr

CHAT data as an in-memory string. It would be what a single CHAT data file contains.

encoding

Encoding of the CHAT data

Returns
Reader
headers()

Return a dict mapping a file path to the headers of that file.

Returns
dict(str: dict)
index_to_tiers()

Return a dict mapping a file path to the file’s index_to_tiers dict.

Returns
dict(str: dict)
jyutping_sents(participant=None, exclude=None, by_files=False)[source]

Return the sentences in Jyutping romanization.

Parameters
participantstr or iterable[str], optional

One or more participants to include the data for. If unspecified, all participants are included.

excludestr or iterable[str], optional

One or more participants to exclude the data for. If unspecified, no participants are excluded.

by_filesbool, optional

If True (default: False), return data organized by the individual file paths.

Returns
list[list[str]], or dict[str, list[list[str]]] if by_files is True
jyutpings(participant=None, exclude=None, by_files=False)[source]

Return the words in Jyutping romanization.

Parameters
participantstr or iterable[str], optional

One or more participants to include the data for. If unspecified, all participants are included.

excludestr or iterable[str], optional

One or more participants to exclude the data for. If unspecified, no participants are excluded.

by_filesbool, optional

If True (default: False), return data organized by the individual file paths.

Returns
list[str], or dict[str, list[str]] if by_files is True
languages()

Return a map from a file path to the languages used.

Returns
dict(str: list(str))
number_of_files()

Return the number of files.

Returns
int
number_of_utterances(participant=None, exclude=None, by_files=False)

Return the number of utterances for participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
int or dict(str: int)
part_of_speech_tags(participant=None, exclude=None, by_files=False)

Return the part-of-speech tags in the data for participant.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
set or dict(str: set)
participant_codes(by_files=False)

Return the participant codes (e.g., {'CHI', 'MOT'}).

Parameters
by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
set(str) or dict(str: set(str))
participants()

Return a dict mapping a file path to the file’s participant info.

Returns
dict(str: dict)
remove(*filenames)

Remove one or more CHAT filenames from the current reader.

Parameters
*filenames

Filenames may take glob patterns with wildcards * and ?.

search(*, onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=0, 0, sent_range=0, 0, tagged=True, sents=False, participant=None, exclude=None, by_files=False)[source]

Search the data for the given criteria.

For examples, please see https://pycantonese.org/searches.html.

Parameters
onsetstr, optional

Onset to search for. A regex is supported.

nucleusstr, optional

Nucleus to search for. A regex is supported.

codastr, optional

Coda to search for. A regex is supported.

tonestr, optional

Tone to search for. A regex is supported.

initialstr, optional

Initial to search for. A regex is supported. An initial, a term more prevalent in traditional Chinese phonology, is the equivalent of an onset.

finalstr, optional

Final to search for. A final, a term more prevalent in traditional Chinese phonology, is the equivalent of a nucleus plus a coda.

jyutpingstr, optional

Jyutping romanization of one Cantonese character to search for. If the romanization contains more than one character, a ValueError is raised.

characterstr, optional

One or more Cantonese characters (within a segmented word) to search for.

posstr, optional

A part-of-speech tag to search for. A regex is supported.

word_rangetuple[int, int], optional

Span of words to the left and right of a matching word to include in the output. The default is (0, 0) to disable a range. If sent_range is used, word_range is ignored.

sent_rangetuple[int, int], optional

Span of sentences before and after a sentence containing a matching word to include in the output. The default is (0, 0) to disable a range. If sent_range is used, word_range is ignored.

taggedbool, optional

If True (the default), words in the output are in the tagged form. Otherwise just word token strings are returned.

sentsbool, optional

If True (default is False), sentences containing matching words are returned. Otherwise, only matching words are returned.

participantstr or iterable[str], optional

One or more participants to include in the search. If unspecified, all participants are included.

excludestr or iterable[str], optional

One or more participants to exclude in the search. If unspecified, no participants are excluded.

by_filesbool, optional

If True (default: False), return data organized by the individual file paths.

Returns
list
sents(participant=None, exclude=None, by_files=False)

Return a list of sents by participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(list(str)) or dict(str: list(list(str)))
tagged_sents(participant=None, exclude=None, by_files=False)

Return a list of tagged sents by participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(list(tuple)) or dict(str: list(list(tuple)))
tagged_words(participant=None, exclude=None, by_files=False)

Return a list of tagged words by participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(tuple) or dict(str: list(tuple))
update(reader)

Combine the current CHAT Reader instance with reader.

Parameters
readerReader
utterances(participant=None, exclude=None, clean=True, by_files=False)

Return a list of (participant, utterance) pairs from all files.

Parameters
cleanbool, optional

Whether to filter away the CHAT annotations in the utterance.

participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(str) or dict(str: list(str))
word_frequency(participant=None, exclude=None, keep_case=True, by_files=False)

Return a word frequency counter for participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

keep_casebool, optional

If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.

Returns
Counter, or dict(str: Counter)
word_ngrams(n, participant=None, exclude=None, keep_case=True, by_files=False)

Return a word n-gram counter by participant in all files. participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

keep_casebool, optional

If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.

Returns
Counter, or dict(str: Counter)
words(participant=None, exclude=None, by_files=False)

Return a list of words by participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(str) or dict(str: list(str))