pycantonese.corpus.CantoneseCHATReader¶
-
class
pycantonese.corpus.
CantoneseCHATReader
(*filenames, **kwargs)[source]¶ A reader for Cantonese CHAT corpus files.
Note
Some of the methods are inherited from the parent class
pylangacq.chat.Reader
for language acquisition, which may or may not be applicable to your use case.Methods
IPSyn
([participant])Return a map from a file path to the file’s IPSyn.
MLU
([participant])Return a map from a file path to the file’s MLU by morphemes.
MLUm
([participant])Return a map from a file path to the file’s MLU by morphemes.
MLUw
([participant])Return a map from a file path to the file’s MLU by words.
TTR
([participant])Return a map from a file path to the file’s TTR.
abspath
(basename)Return the absolute path of
basename
.add
(*filenames)Add one or more CHAT
filenames
to the current reader.age
([participant, months])Return a map from a file path to the participant’s age.
character_sents
([participant, exclude, by_files])Return the data as sentences of individual Cantonese characters.
characters
([participant, exclude, by_files])Return the data split in individual Cantonese characters.
clear
()Clear everything and reset as an empty Reader instance.
concordance
(search_item[, participant, …])Return a list of utterances with search_item for participant.
Return a map from a file path to the date of birth.
Return a map from a file path to the date of recording.
filenames
([sorted_by_age])Return the set of absolute-path filenames.
from_chat_files
(*filenames, **kwargs)Create a
Reader
object with CHAT data files.from_chat_str
(chat_str[, encoding])Create a
Reader
object with CHAT data as a string.headers
()Return a dict mapping a file path to the headers of that file.
Return a dict mapping a file path to the file’s index_to_tiers dict.
jyutping_sents
([participant, exclude, by_files])Return the sentences in Jyutping romanization.
jyutpings
([participant, exclude, by_files])Return the words in Jyutping romanization.
Return a map from a file path to the languages used.
Return the number of files.
number_of_utterances
([participant, exclude, …])Return the number of utterances for participant in all files.
part_of_speech_tags
([participant, exclude, …])Return the part-of-speech tags in the data for participant.
participant_codes
([by_files])Return the participant codes (e.g.,
{'CHI', 'MOT'}
).Return a dict mapping a file path to the file’s participant info.
remove
(*filenames)Remove one or more CHAT
filenames
from the current reader.search
(*[, onset, nucleus, coda, tone, …])Search the data for the given criteria.
sents
([participant, exclude, by_files])Return a list of sents by participant in all files.
tagged_sents
([participant, exclude, by_files])Return a list of tagged sents by participant in all files.
tagged_words
([participant, exclude, by_files])Return a list of tagged words by participant in all files.
update
(reader)Combine the current CHAT Reader instance with
reader
.utterances
([participant, exclude, clean, …])Return a list of (participant, utterance) pairs from all files.
word_frequency
([participant, exclude, …])Return a word frequency counter for participant in all files.
word_ngrams
(n[, participant, exclude, …])Return a word
n
-gram counter byparticipant
in all files. participant : str or iterable of str, optional Participants of interest. If unspecified orNone
, all participants are included. exclude : str or iterable of str, optional Participants to exclude. If unspecified orNone
, no participants are excluded. by_files : bool, optional IfTrue
, return dict(absolute-path filename: X for that file) instead of X for all files altogether. keep_case : bool, optional IfTrue
(the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. IfFalse
, all word tokens are forced to be in lowercase.words
([participant, exclude, by_files])Return a list of words by participant in all files.
-
__init__
(*filenames, **kwargs)[source]¶ Initialize a reader for Cantonese CHAT corpus files.
- Parameters
- *filenamesiterable of str
File paths to Cantonese CHAT data files. Glob filename matching is supported.
- **kwargs
Keyword arguments passed to CantoneseCHATReader. Currently, only the
encoding
kwarg is supported (default: ‘utf8’).
Methods
IPSyn
([participant])Return a map from a file path to the file’s IPSyn.
MLU
([participant])Return a map from a file path to the file’s MLU by morphemes.
MLUm
([participant])Return a map from a file path to the file’s MLU by morphemes.
MLUw
([participant])Return a map from a file path to the file’s MLU by words.
TTR
([participant])Return a map from a file path to the file’s TTR.
__init__
(*filenames, **kwargs)Initialize a reader for Cantonese CHAT corpus files.
abspath
(basename)Return the absolute path of
basename
.add
(*filenames)Add one or more CHAT
filenames
to the current reader.age
([participant, months])Return a map from a file path to the participant’s age.
character_sents
([participant, exclude, by_files])Return the data as sentences of individual Cantonese characters.
characters
([participant, exclude, by_files])Return the data split in individual Cantonese characters.
clear
()Clear everything and reset as an empty Reader instance.
concordance
(search_item[, participant, …])Return a list of utterances with search_item for participant.
Return a map from a file path to the date of birth.
Return a map from a file path to the date of recording.
filenames
([sorted_by_age])Return the set of absolute-path filenames.
from_chat_files
(*filenames, **kwargs)Create a
Reader
object with CHAT data files.from_chat_str
(chat_str[, encoding])Create a
Reader
object with CHAT data as a string.headers
()Return a dict mapping a file path to the headers of that file.
Return a dict mapping a file path to the file’s index_to_tiers dict.
jyutping_sents
([participant, exclude, by_files])Return the sentences in Jyutping romanization.
jyutpings
([participant, exclude, by_files])Return the words in Jyutping romanization.
Return a map from a file path to the languages used.
Return the number of files.
number_of_utterances
([participant, exclude, …])Return the number of utterances for participant in all files.
part_of_speech_tags
([participant, exclude, …])Return the part-of-speech tags in the data for participant.
participant_codes
([by_files])Return the participant codes (e.g.,
{'CHI', 'MOT'}
).Return a dict mapping a file path to the file’s participant info.
remove
(*filenames)Remove one or more CHAT
filenames
from the current reader.search
(*[, onset, nucleus, coda, tone, …])Search the data for the given criteria.
sents
([participant, exclude, by_files])Return a list of sents by participant in all files.
tagged_sents
([participant, exclude, by_files])Return a list of tagged sents by participant in all files.
tagged_words
([participant, exclude, by_files])Return a list of tagged words by participant in all files.
update
(reader)Combine the current CHAT Reader instance with
reader
.utterances
([participant, exclude, clean, …])Return a list of (participant, utterance) pairs from all files.
word_frequency
([participant, exclude, …])Return a word frequency counter for participant in all files.
word_ngrams
(n[, participant, exclude, …])Return a word
n
-gram counter byparticipant
in all files. participant : str or iterable of str, optional Participants of interest. If unspecified orNone
, all participants are included. exclude : str or iterable of str, optional Participants to exclude. If unspecified orNone
, no participants are excluded. by_files : bool, optional IfTrue
, return dict(absolute-path filename: X for that file) instead of X for all files altogether. keep_case : bool, optional IfTrue
(the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. IfFalse
, all word tokens are forced to be in lowercase.words
([participant, exclude, by_files])Return a list of words by participant in all files.
-