pycantonese.corpus.CantoneseCHATReader

class pycantonese.corpus.CantoneseCHATReader(*filenames, **kwargs)[source]

A reader for Cantonese CHAT corpus files.

Note

Some of the methods are inherited from the parent class pylangacq.chat.Reader for language acquisition, which may or may not be applicable to your use case.

Methods

IPSyn([participant])

Return a map from a file path to the file’s IPSyn.

MLU([participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUm([participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUw([participant])

Return a map from a file path to the file’s MLU by words.

TTR([participant])

Return a map from a file path to the file’s TTR.

abspath(basename)

Return the absolute path of basename.

add(*filenames)

Add one or more CHAT filenames to the current reader.

age([participant, months])

Return a map from a file path to the participant’s age.

character_sents([participant, exclude, by_files])

Return the data as sentences of individual Cantonese characters.

characters([participant, exclude, by_files])

Return the data split in individual Cantonese characters.

clear()

Clear everything and reset as an empty Reader instance.

concordance(search_item[, participant, …])

Return a list of utterances with search_item for participant.

date_of_birth()

Return a map from a file path to the date of birth.

dates_of_recording()

Return a map from a file path to the date of recording.

filenames([sorted_by_age])

Return the set of absolute-path filenames.

from_chat_files(*filenames, **kwargs)

Create a Reader object with CHAT data files.

from_chat_str(chat_str[, encoding])

Create a Reader object with CHAT data as a string.

headers()

Return a dict mapping a file path to the headers of that file.

index_to_tiers()

Return a dict mapping a file path to the file’s index_to_tiers dict.

jyutping_sents([participant, exclude, by_files])

Return the sentences in Jyutping romanization.

jyutpings([participant, exclude, by_files])

Return the words in Jyutping romanization.

languages()

Return a map from a file path to the languages used.

number_of_files()

Return the number of files.

number_of_utterances([participant, exclude, …])

Return the number of utterances for participant in all files.

part_of_speech_tags([participant, exclude, …])

Return the part-of-speech tags in the data for participant.

participant_codes([by_files])

Return the participant codes (e.g., {'CHI', 'MOT'}).

participants()

Return a dict mapping a file path to the file’s participant info.

remove(*filenames)

Remove one or more CHAT filenames from the current reader.

search(*[, onset, nucleus, coda, tone, …])

Search the data for the given criteria.

sents([participant, exclude, by_files])

Return a list of sents by participant in all files.

tagged_sents([participant, exclude, by_files])

Return a list of tagged sents by participant in all files.

tagged_words([participant, exclude, by_files])

Return a list of tagged words by participant in all files.

update(reader)

Combine the current CHAT Reader instance with reader.

utterances([participant, exclude, clean, …])

Return a list of (participant, utterance) pairs from all files.

word_frequency([participant, exclude, …])

Return a word frequency counter for participant in all files.

word_ngrams(n[, participant, exclude, …])

Return a word n-gram counter by participant in all files. participant : str or iterable of str, optional Participants of interest. If unspecified or None, all participants are included. exclude : str or iterable of str, optional Participants to exclude. If unspecified or None, no participants are excluded. by_files : bool, optional If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether. keep_case : bool, optional If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.

words([participant, exclude, by_files])

Return a list of words by participant in all files.

__init__(*filenames, **kwargs)[source]

Initialize a reader for Cantonese CHAT corpus files.

Parameters
*filenamesiterable of str

File paths to Cantonese CHAT data files. Glob filename matching is supported.

**kwargs

Keyword arguments passed to CantoneseCHATReader. Currently, only the encoding kwarg is supported (default: ‘utf8’).

Methods

IPSyn([participant])

Return a map from a file path to the file’s IPSyn.

MLU([participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUm([participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUw([participant])

Return a map from a file path to the file’s MLU by words.

TTR([participant])

Return a map from a file path to the file’s TTR.

__init__(*filenames, **kwargs)

Initialize a reader for Cantonese CHAT corpus files.

abspath(basename)

Return the absolute path of basename.

add(*filenames)

Add one or more CHAT filenames to the current reader.

age([participant, months])

Return a map from a file path to the participant’s age.

character_sents([participant, exclude, by_files])

Return the data as sentences of individual Cantonese characters.

characters([participant, exclude, by_files])

Return the data split in individual Cantonese characters.

clear()

Clear everything and reset as an empty Reader instance.

concordance(search_item[, participant, …])

Return a list of utterances with search_item for participant.

date_of_birth()

Return a map from a file path to the date of birth.

dates_of_recording()

Return a map from a file path to the date of recording.

filenames([sorted_by_age])

Return the set of absolute-path filenames.

from_chat_files(*filenames, **kwargs)

Create a Reader object with CHAT data files.

from_chat_str(chat_str[, encoding])

Create a Reader object with CHAT data as a string.

headers()

Return a dict mapping a file path to the headers of that file.

index_to_tiers()

Return a dict mapping a file path to the file’s index_to_tiers dict.

jyutping_sents([participant, exclude, by_files])

Return the sentences in Jyutping romanization.

jyutpings([participant, exclude, by_files])

Return the words in Jyutping romanization.

languages()

Return a map from a file path to the languages used.

number_of_files()

Return the number of files.

number_of_utterances([participant, exclude, …])

Return the number of utterances for participant in all files.

part_of_speech_tags([participant, exclude, …])

Return the part-of-speech tags in the data for participant.

participant_codes([by_files])

Return the participant codes (e.g., {'CHI', 'MOT'}).

participants()

Return a dict mapping a file path to the file’s participant info.

remove(*filenames)

Remove one or more CHAT filenames from the current reader.

search(*[, onset, nucleus, coda, tone, …])

Search the data for the given criteria.

sents([participant, exclude, by_files])

Return a list of sents by participant in all files.

tagged_sents([participant, exclude, by_files])

Return a list of tagged sents by participant in all files.

tagged_words([participant, exclude, by_files])

Return a list of tagged words by participant in all files.

update(reader)

Combine the current CHAT Reader instance with reader.

utterances([participant, exclude, clean, …])

Return a list of (participant, utterance) pairs from all files.

word_frequency([participant, exclude, …])

Return a word frequency counter for participant in all files.

word_ngrams(n[, participant, exclude, …])

Return a word n-gram counter by participant in all files. participant : str or iterable of str, optional Participants of interest. If unspecified or None, all participants are included. exclude : str or iterable of str, optional Participants to exclude. If unspecified or None, no participants are excluded. by_files : bool, optional If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether. keep_case : bool, optional If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.

words([participant, exclude, by_files])

Return a list of words by participant in all files.