API Reference

Corpus Data

`read_chat`(path[, match, exclude, encoding])	Read Cantonese CHAT data files.
`hkcancor`()	Create a corpus object for the Hong Kong Cantonese Corpus.
`CHATReader`()	A reader for Cantonese CHAT corpus files.
`CHATReader.search`(*[, onset, nucleus, coda, ...])	Search the data for the given criteria.

Jyutping Romanization

`characters_to_jyutping`(chars[, segmenter])	Convert Cantonese characters into Jyutping romanization.
`parse_jyutping`(jp_str)	Parse Jyutping romanization into onset, nucleus, code, and tone.
`jyutping_to_yale`(jp_str[, as_list])	Convert Jyutping romanization into Yale romanization.
`jyutping_to_tipa`(jp_str)	Convert Jyutping romanization into LaTeX TIPA.

Natural Language Processing

`stop_words`([add, remove])	Return Cantonese stop words.
`parse_text`(data, *[, segment_kwargs, ...])	Parse raw Cantonese text.
`segment`(unsegmented[, cls])	Segment the unsegmented input.
`word_segmentation.Segmenter`(*[, ...])	A customizable word segmentation model.
`pos_tag`(words[, tagset])	Tag the words for their parts of speech.
`pos_tagging.hkcancor_to_ud`([tag])	Map a part-of-speech tag from HKCanCor to Universal Dependencies.

`CHATReader`

class pycantonese.CHATReader[source]

A reader for Cantonese CHAT corpus files.

Note

Some of the methods are inherited from the parent class Reader for language acquisition, which may or may not be applicable to your use case.

Methods

`ages`([participant, months])	Return the ages of the given participant in the data.
`append`(reader)	Append data from another reader.
`append_left`(reader)	Left-append data from another reader.
`characters`([participants, exclude, ...])	Return the data in individual Chinese characters.
`clear`()	Remove all data from this reader.
`dates_of_recording`([by_files])	Return the dates of recording.
`extend`(readers)	Extend data from other readers.
`extend_left`(readers)	Left-extend data from other readers.
`file_paths`()	Return the file paths.
`filter`([match, exclude])	Return a new reader filtered by file paths.
`from_dir`(path[, match, exclude, extension, ...])	Instantiate a reader from a local directory with CHAT data files.
`from_files`(paths[, match, exclude, ...])	Instantiate a reader from local CHAT data files.
`from_strs`(strs[, ids, parallel])	Instantiate a reader from in-memory CHAT data strings.
`from_zip`(path[, match, exclude, extension, ...])	Instantiate a reader from a local or remote ZIP file.
`head`([n, participants, exclude])	Return the first several utterances.
`headers`()	Return the headers.
`info`([verbose])	Print a summary of this Reader's data.
`ipsyn`()	(Not implemented - the upstream `ipsyn` method works for English only.)
`jyutping`([participants, exclude, ...])	Return the data in Jyutping romanization.
`languages`([by_files])	Return the languages in the data.
`mlu`([participant])	Return the mean lengths of utterance (MLU).
`mlum`([participant])	Return the mean lengths of utterance by morphemes.
`mluw`([participant, exclude_switch])	Return the mean lengths of utterance by words.
`n_files`()	Return the number of files.
`participants`([by_files])	Return the participants (e.g., CHI, MOT).
`pop`()	Drop the last data file from the reader and return it as a reader.
`pop_left`()	Drop the first data file from the reader and return it as a reader.
`search`(*[, onset, nucleus, coda, tone, ...])	Search the data for the given criteria.
`sents`([participants, exclude, by_files])	Return the sents.
`tagged_sents`([participants, exclude, by_files])	Return the tagged sents.
`tagged_words`([participants, exclude, by_files])	Return the tagged words.
`tail`([n, participants, exclude])	Return the last several utterances.
`to_chat`(path[, is_dir, filenames, tabular, ...])	Export to CHAT data files.
`to_strs`([tabular])	Yield CHAT data strings.
`tokens`([participants, exclude, ...])	Return the tokens.
`ttr`([keep_case, participant])	Return the type-token ratios (TTR).
`utterances`([participants, exclude, by_files])	Return the utterances.
`word_frequencies`([keep_case, participants, ...])	Return word frequencies.
`word_ngrams`(n[, keep_case, participants, ...])	Return word ngrams.
`words`([participants, exclude, ...])	Return the words.

character_sents
jyutping_sents
jyutpings

ages(participant='CHI', months=False) → Union[List[Tuple[int, int, int]], List[float]]

Return the ages of the given participant in the data.

Parameters

participantstr, optional: Participant of interest, which defaults to the typical use case of "CHI" for the target child.
monthsbool, optional: If False (the default), age is represented as a tuple of (years, months, days), e.g., “1;06.00” in CHAT becomes (1, 6, 0). If True, age is a float for the number of months, e.g., “1;06.00” in CHAT becomes 18.0 for 18 months.

Returns

List[Tuple[int, int, int]] if months is False, otherwise List[float]

append(reader: pylangacq.chat.Reader) → None

Append data from another reader.

New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.

Parameters

readerReader: A reader from which to append data

append_left(reader: pylangacq.chat.Reader) → None

Left-append data from another reader.

New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.

Parameters

readerReader: A reader from which to left-append data

characters(participants=None, exclude=None, by_utterances=False, by_files=False) → Union[List[str], List[List[str]], List[List[List[str]]]][source]

Return the data in individual Chinese characters.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_utterancesbool, optional: If True, the resulting objects are wrapped as a list at the utterance level. If False (the default), such utterance-level list structure does not exist.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

List[List[List[str]]] if both by_utterances and by_files are True
List[List[str]] if by_utterances is True and by_files is False
List[List[str]] if by_utterances is False and by_files is True
List[str] if both by_utterances and by_files are False

clear() → None: Remove all data from this reader.

dates_of_recording(by_files=False) → Union[Set[datetime.date], List[Set[datetime.date]]]

Return the dates of recording.

Parameters

by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

Set[datetime.date] if by_files is False,
otherwise List[Set[datetime.date]]]

extend(readers: Iterable[pylangacq.chat.Reader]) → None

Extend data from other readers.

New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.

Parameters

readersIterable[Reader]: Readers from which to extend data

extend_left(readers: Iterable[pylangacq.chat.Reader]) → None

Left-extend data from other readers.

New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.

Parameters

readersIterable[Reader]: Readers from which to extend data

file_paths() → List[str]

Return the file paths.

If the data comes from in-memory strings, then the “file paths” are arbitrary UUID random strings.

Returns

List[str]

filter(match: str = None, exclude: str = None) → pylangacq.chat.Reader

Return a new reader filtered by file paths.

Parameters

matchstr, optional: If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in "Eve" here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure of Brown/Eve/xxx.cha for Eve’s data. If this parameter is not specified or None is passed in (the default), such file path filtering does not apply.
excludestr, optional: If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.

Returns

pylangacq.Reader

Raises

TypeError: If neither match nor exclude is specified.

classmethod from_dir(path: str, match: str = None, exclude: str = None, extension: str = '.cha', encoding: str = 'utf-8', parallel: bool = True) → pylangacq.chat.Reader

Instantiate a reader from a local directory with CHAT data files.

Parameters

pathstr: Local directory that contains CHAT data files. Files are searched for recursively under this directory, and those that satisfy match and extension are parsed and handled by the reader.
matchstr, optional: If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in "Eve" here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure of Brown/Eve/xxx.cha for Eve’s data. If this parameter is not specified or None is passed in (the default), such file path filtering does not apply.
excludestr, optional: If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.
encodingstr, optional: Text encoding to parse the CHAT data. The default value is "utf-8" for Unicode UTF-8.
extensionstr, optional: File extension for CHAT data files. The default value is ".cha".
parallelbool, optional: If True (the default), CHAT reading and parsing is parallelized for speed-up, because in most cases multiple CHAT data files and/or strings are being handled. Under certain circumstances (e.g., your application is already parallelized and further parallelization from within PyLangAcq might be undesirable), you may like to consider setting this parameter to False.

Returns

pylangacq.Reader

classmethod from_files(paths: List[str], match: str = None, exclude: str = None, encoding: str = 'utf-8', parallel: bool = True) → pylangacq.chat.Reader

Instantiate a reader from local CHAT data files.

Parameters

pathsList[str]: List of local file paths of the CHAT data. The ordering of the paths determines that of the parsed CHAT data in the resulting reader.
matchstr, optional: If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in "Eve" here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure of Brown/Eve/xxx.cha for Eve’s data. If this parameter is not specified or None is passed in (the default), such file path filtering does not apply.
excludestr, optional: If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.
encodingstr, optional: Text encoding to parse the CHAT data. The default value is "utf-8" for Unicode UTF-8.
parallelbool, optional: If True (the default), CHAT reading and parsing is parallelized for speed-up, because in most cases multiple CHAT data files and/or strings are being handled. Under certain circumstances (e.g., your application is already parallelized and further parallelization from within PyLangAcq might be undesirable), you may like to consider setting this parameter to False.

Returns

pylangacq.Reader

classmethod from_strs(strs: List[str], ids: List[str] = None, parallel: bool = True) → pylangacq.chat.Reader

Instantiate a reader from in-memory CHAT data strings.

Parameters

strsList[str]: List of CHAT data strings. The ordering of the strings determines that of the parsed CHAT data in the resulting reader.
idsList[str], optional: List of identifiers. If not provided, UUID random strings are used. When file paths are referred to in other parts of this package, they mean these identifiers if you have instantiated the reader by this method.
parallelbool, optional: If True (the default), CHAT reading and parsing is parallelized for speed-up, because in most cases multiple CHAT data files and/or strings are being handled. Under certain circumstances (e.g., your application is already parallelized and further parallelization from within PyLangAcq might be undesirable), you may like to consider setting this parameter to False.

Returns

pylangacq.Reader

classmethod from_zip(path: str, match: str = None, exclude: str = None, extension: str = '.cha', encoding: str = 'utf-8', parallel: bool = True, use_cached: bool = True, session: requests.sessions.Session = None) → pylangacq.chat.Reader

Instantiate a reader from a local or remote ZIP file.

If the input data is a remote ZIP file and you expect to call this method with the same path multiple times, consider downloading the data to the local system and then reading it from there to avoid unnecessary re-downloading. Caching a remote ZIP file isn’t implemented (yet) as the upstream CHILDES / TalkBank data is updated in minor ways from time to time.

Parameters

pathstr: Either a local file path or a URL (one that begins with "https://" or "http://") for a ZIP file containing CHAT data files. For instance, you can provide either a local path to a ZIP file downloaded from CHILDES, or simply a URL such as "https://childes.talkbank.org/data/Eng-NA/Brown.zip".
matchstr, optional: If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in "Eve" here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure of Brown/Eve/xxx.cha for Eve’s data. If this parameter is not specified or None is passed in (the default), such file path filtering does not apply.
excludestr, optional: If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.
encodingstr, optional: Text encoding to parse the CHAT data. The default value is "utf-8" for Unicode UTF-8.
extensionstr, optional: File extension for CHAT data files. The default value is ".cha".
parallelbool, optional: If True (the default), CHAT reading and parsing is parallelized for speed-up, because in most cases multiple CHAT data files and/or strings are being handled. Under certain circumstances (e.g., your application is already parallelized and further parallelization from within PyLangAcq might be undesirable), you may like to consider setting this parameter to False.
use_cachedbool, optional: If True (the default), and if the path is a URL for a remote ZIP archive, then CHAT reading attempts to use the previously downloaded data cached on disk. This setting allows you to call this function with the same URL repeatedly without hitting the CHILDES / TalkBank server more than once for the same data. Pass in False to force a new download; the upstream CHILDES / TalkBank data is updated in minor ways from time to time, e.g., for CHAT format, header/metadata information, updated annotations. See also the helper functions: pylangacq.chat.cached_data_info(), pylangacq.chat.remove_cached_data().
sessionrequests.Session, optional: If the path is a URL for a remote ZIP archive, data downloading is done with reasonable settings of retries and timeout by default, in order to be robust against intermittent network issues. If necessary, pass in your own instance of requests.Session to customize.

Returns

pylangacq.Reader

head(n: int = 5, participants=None, exclude=None)

Return the first several utterances.

Parameters

nint, optional: The number of utterances to return.
participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

Returns

list of utterances

headers() → List[Dict]

Return the headers.

Returns

List[Dict]

info(verbose=False) → None

Print a summary of this Reader’s data.

Parameters

verbosebool, optional: If True (default is False), show the details of all the files.

ipsyn()[source]: (Not implemented - the upstream ipsyn method works for English only.)

jyutping(participants=None, exclude=None, by_utterances=False, by_files=False) → Union[List[str], List[List[str]], List[List[List[str]]]][source]

Return the data in Jyutping romanization.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_utterancesbool, optional: If True, the resulting objects are wrapped as a list at the utterance level. If False (the default), such utterance-level list structure does not exist.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

List[List[List[str]]] if both by_utterances and by_files are True
List[List[str]] if by_utterances is True and by_files is False
List[List[str]] if by_utterances is False and by_files is True
List[str] if both by_utterances and by_files are False

languages(by_files=False) → Union[Set[str], List[List[str]]]

Return the languages in the data.

Parameters

by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

Set[str] if by_files is False, otherwise List[List[str]]: When by_files is True, the ordering of languages given by the list indicates language dominance. Such ordering would not make sense when by_files is False, in which case the returned object is a set instead of a list.

mlu(participant='CHI') → List[float]

Return the mean lengths of utterance (MLU).

This method is equivalent to mlum().

Parameters

participantstr, optional: Participant of interest, which defaults to the typical use case of "CHI" for the target child.

Returns

List[float]

mlum(participant='CHI') → List[float]

Return the mean lengths of utterance by morphemes.

Parameters

participantstr, optional: Participant of interest, which defaults to the typical use case of "CHI" for the target child.

Returns

List[float]

mluw(participant='CHI', exclude_switch: bool = False) → List[float]

Return the mean lengths of utterance by words.

Parameters

participantstr, optional: Participant of interest, which defaults to the typical use case of "CHI" for the target child.
exclude_switchbool, optional: If True, words with the suffix “@s” for switching to another language (not uncommon in code-mixing or multilingual acquisition) are excluded. The default is False.

Returns

List[float]

n_files() → int: Return the number of files.

participants(by_files=False) → Union[Set[str], List[Set[str]]]

Return the participants (e.g., CHI, MOT).

Parameters

by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

Set[str] if by_files is False, otherwise List[Set[str]]

pop() → pylangacq.chat.Reader

Drop the last data file from the reader and return it as a reader.

Returns

pylangacq.Reader

pop_left() → pylangacq.chat.Reader

Drop the first data file from the reader and return it as a reader.

Returns

pylangacq.Reader

search(*, onset=None, nucleus=None, coda=None, tone=None, initial=None, final=None, jyutping=None, character=None, pos=None, word_range=(0, 0), utterance_range=(0, 0), sent_range=(0, 0), by_tokens=True, by_utterances=False, tagged=None, sents=None, participants=None, exclude=None, by_files=False)[source]

Search the data for the given criteria.

For examples, please see https://pycantonese.org/searches.html.

Parameters

onsetstr, optional: Onset to search for. A regex is supported.
nucleusstr, optional: Nucleus to search for. A regex is supported.
codastr, optional: Coda to search for. A regex is supported.
tonestr, optional: Tone to search for. A regex is supported.
initialstr, optional: Initial to search for. A regex is supported. An initial, a term more prevalent in traditional Chinese phonology, is the equivalent of an onset.
finalstr, optional: Final to search for. A final, a term more prevalent in traditional Chinese phonology, is the equivalent of a nucleus plus a coda.
jyutpingstr, optional: Jyutping romanization of one Cantonese character to search for. If the romanization contains more than one character, a ValueError is raised.
characterstr, optional: One or more Cantonese characters (within a segmented word) to search for.
posstr, optional: A part-of-speech tag to search for. A regex is supported.
word_rangetuple[int, int], optional: Span of words to the left and right of a matching word to include in the output. The default is (0, 0) to disable a range. If sent_range is used, word_range is ignored.
utterance_rangeTuple[int, int], optional: Span of utterances before and after an utterance containing a matching word to include in the output. If set to (0, 0) (the default), no utterance range output is generated. If utterance_range is used, word_range is ignored.
sent_rangeTuple[int, int], optional: [Deprecated; please use utterance_range instead]
by_tokensbool, optional: If True (the default), words in the output are in the token form (i.e., with Jyutping and part-of-speech tags). Otherwise just words as text strings are returned.
by_utterancesbool, optional: If True (default is False), utterances containing matching words are returned. Otherwise, only matching words are returned.
taggedbool, optional: [Deprecated; please use by_tokens instead]
sentsbool, optional: [Deprecated; please use by_utterances instead]
participantsstr or iterable[str], optional: One or more participants to include in the search. If unspecified, all participants are included.
excludestr or iterable[str], optional: One or more participants to exclude in the search. If unspecified, no participants are excluded.
by_filesbool, optional: If True (default: False), return data organized by the individual file paths.

Returns

list

sents(participants=None, exclude=None, by_files=False) → Union[List[List[str]], List[List[List[str]]]]

Return the sents.

Deprecated since version 0.13.0: Please use words() with by_utterances=True instead.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

List[List[str]] if by_files is False, otherwise List[List[List[str]]]

tagged_sents(participants=None, exclude=None, by_files=False) → Union[List[List[pylangacq.objects.Token]], List[List[List[pylangacq.objects.Token]]]]

Return the tagged sents.

Deprecated since version 0.13.0: Please use tokens() with by_utterances=True instead.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

List[List[Token]] if by_files is False,
otherwise List[List[List[Token]]]

tagged_words(participants=None, exclude=None, by_files=False) → Union[List[pylangacq.objects.Token], List[List[pylangacq.objects.Token]]]

Return the tagged words.

Deprecated since version 0.13.0: Please use tokens() with by_utterances=False instead.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

List[Token] if by_files is False, otherwise List[List[Token]]

tail(n: int = 5, participants=None, exclude=None)

Return the last several utterances.

Parameters

nint, optional: The number of utterances to return.
participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

Returns

list of utterances

to_chat(path: str, is_dir: bool = False, filenames: Optional[Iterable[str]] = None, tabular: bool = True, encoding: str = 'utf-8') → None

Export to CHAT data files.

Parameters

pathstr: The path to a file where you want to output the CHAT data, e.g., “data.cha”, “foo/bar/data.cha”.
is_dirbool, optional: If True (default is False), then path is interpreted as a directory instead. The CHAT data is written to possibly multiple files under this directory. The number of files you get can be checked by calling n_files(), which depends on how this reader object is created.
filenamesIterable[str], optional: Used only when is_dir is True. These are the filenames of the CHAT files to write. If None or not given, {0001.cha, 0002.cha, …} are used.
tabularbool, optional: If True, adjust spacing such that the three tiers of the utterance, %mor, and %gra are aligned in a tabular form. Note that such alignment would drop annotations (e.g., pauses) on the main utterance tier.
encodingstr, optional: Text encoding to output the CHAT data as. The default value is "utf-8" for Unicode UTF-8.

Raises

ValueError

If you attempt to output data to a single local file, but the CHAT data in this reader appears to be organized in multiple files.
If you attempt to output data to a directory while providing your own filenames, but the number of your filenames doesn’t match the number of CHAT files in this reader object.

to_strs(tabular: bool = True) → Generator[str, None, None]

Yield CHAT data strings.

Note

The header information may not be completely reproduced in the output CHAT strings. Known issues all have to do with a header field used multiple times in the original CHAT data. For Date, only the first date of recording is retained in the output string. For all other multiply used header fields (e.g., Tape Location, Time Duration), only the last value in a given CHAT file is retained. Note that ID for participant information is not affected.

Parameters

tabularbool, optional: If True, adjust spacing such that the three tiers of the utterance, %mor, and %gra are aligned in a tabular form. Note that such alignment would drop annotations (e.g., pauses) on the main utterance tier.

Yields

str: CHAT data string for one file.

tokens(participants=None, exclude=None, by_utterances=False, by_files=False) → Union[List[pylangacq.objects.Token], List[List[pylangacq.objects.Token]], List[List[List[pylangacq.objects.Token]]]]

Return the tokens.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_utterancesbool, optional: If True, the resulting objects are wrapped as a list at the utterance level. If False (the default), such utterance-level list structure does not exist.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

List[List[List[Token]]] if both by_utterances and by_files are True
List[List[Token]] if by_utterances is True and by_files is False
List[List[Token]] if by_utterances is False and by_files is True
List[Token] if both by_utterances and by_files are False

ttr(keep_case=True, participant='CHI') → List[float]

Return the type-token ratios (TTR).

Parameters

keep_casebool, optional: If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.
participantstr, optional: Participant of interest, which defaults to the typical use case of "CHI" for the target child.

Returns

List[float]

utterances(participants=None, exclude=None, by_files=False) → Union[List[pylangacq.objects.Utterance], List[List[pylangacq.objects.Utterance]]]

Return the utterances.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

List[Utterance] if by_files is False, otherwise List[List[Utterance]]

word_frequencies(keep_case=True, participants=None, exclude=None, by_files=False) → Union[collections.Counter, List[collections.Counter]]

Return word frequencies.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.
keep_casebool, optional: If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.

Returns

collections.Counter if by_files is False,
otherwise List[collections.Counter]

word_ngrams(n, keep_case=True, participants=None, exclude=None, by_files=False) → Union[collections.Counter, List[collections.Counter]]

Return word ngrams.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.
keep_casebool, optional: If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.

Returns

collections.Counter if by_files is False,
otherwise List[collections.Counter]

words(participants=None, exclude=None, by_utterances=False, by_files=False) → Union[List[str], List[List[str]], List[List[List[str]]]]

Return the words.

Parameters

participantsstr or iterable of str, optional: Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.
excludestr or iterable of str, optional: Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.
by_utterancesbool, optional: If True, the resulting objects are wrapped as a list at the utterance level. If False (the default), such utterance-level list structure does not exist.
by_filesbool, optional: If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns

List[List[List[str]]] if both by_utterances and by_files are True
List[List[str]] if by_utterances is True and by_files is False
List[List[str]] if by_utterances is False and by_files is True
List[str] if both by_utterances and by_files are False

`Token`

class pycantonese.corpus.Token(word: str, pos: Optional[str], jyutping: Optional[str], mor: Optional[str], gloss: Optional[str], gra: Optional[pylangacq.objects.Gra])[source]

Token with attributes as parsed from a CHAT utterance.

Attributes

wordstr: Word form of the token
posstr: Part-of-speech tag
jyutpingstr: Jyutping romanization
morstr: Morphological information
glossstr: Gloss in English
graGra: Grammatical relation

Methods

to_gra_tier
to_mor_tier

`Jyutping`

class pycantonese.jyutping.Jyutping(onset: str, nucleus: str, coda: str, tone: str)[source]

Jyutping representation of a Chinese/Cantonese character.

Attributes

onsetstr: Onset
nucleusstr: Nucleus
codastr: Coda
tonestr: Tone

__eq__(other): Return self==value.

__hash__ = None

__init__(onset: str, nucleus: str, coda: str, tone: str) → None

__repr__(): Return repr(self).

__str__()[source]: Combine onset + nucleus + coda + tone.

property final: Return the final (= nucleus + coda).

API Reference

Corpus Data

Jyutping Romanization

Natural Language Processing

CHATReader

Token

Jyutping

`CHATReader`

`Token`

`Jyutping`