Corpus Reader Methods
After you have created a corpus reader (see Corpus Data), the headers, transcriptions, and annotations are all accessible through the methods of the corpus reader object.
Let’s say we have a corpus reader with the built-in HKCanCor data.
>>> import pycantonese
>>> corpus = pycantonese.hkcancor()
Headers
A CHAT data file typically has metadata around the top of the file,
with lines that begin with @
.
The metadata include participants’ demographics (age, gender, etc),
date of recording, and languages used in the data.
Specifically for HKCanCor, the participants in all the data files are anonymous.
In PyCantonese’s rendition of HKCanCor,
their names are simply placeholders such as A
, B
, etc, and their
corresponding three-letter codes are XXA
, XXB
, etc.
In contrast, many CHILDES and TalkBank datasets have their participants identified.
By convention, the target child’s code is CHI
, the child’s mother’s MOT
,
and the child’s father’s FAT
.
Since PyCantonese uses PyLangAcq to parse CHAT data files, the way in which header information is accessed is identical between the two packages. Please see PyLangAcq’s documentation on headers.
To see how a header from HKCanCor translates
to its representation in PyCantonese,
here is the header from FC-001_v2.cha
, the first (by filename) of the 58 CHAT files:
@UTF8
@Begin
@Languages: yue , eng
@Participants: XXA A Adult , XXB B Adult
@ID: yue , eng|HKCanCor|XXA|34;|female|||Adult||origin:HK|
@ID: yue , eng|HKCanCor|XXB|37;|female|||Adult||origin:HK|
@Date: 30-APR-1997
@Tape Number: 001
In this example, this recording session was between two Hong Kong female speakers
(ages 34 and 37), recorded on April 30th, 1997. The languages in this data file
are both Cantonese and English (in that order of usage frequency;
the ordering in yue , eng
is meaningful).
Through the corpus reader object corpus
we’ve just created,
we see the same information by calling the method headers()
(which returns a list of dicts; [0]
gets the first dict that
corresponds to FC-001_v2.cha
):
>>> corpus.headers()[0]
{'UTF8': '',
'Languages': ['yue', 'eng'],
'Participants': {'XXA': {'name': 'A',
'language': 'yue , eng',
'corpus': 'HKCanCor',
'age': '34;',
'sex': 'female',
'group': '',
'ses': '',
'role': 'Adult',
'education': '',
'custom': 'origin:HK'},
'XXB': {'name': 'B',
'language': 'yue , eng',
'corpus': 'HKCanCor',
'age': '37;',
'sex': 'female',
'group': '',
'ses': '',
'role': 'Adult',
'education': '',
'custom': 'origin:HK'}},
'Date': {datetime.date(1997, 4, 30)},
'Tape Number': '001'}
Here are the currently implemented methods for header information:
|
Return the ages of the given participant in the data. |
|
Return the dates of recording. |
|
Return the headers. |
|
Return the languages in the data. |
|
Return the participants (e.g., CHI, MOT). |
Whenever it is appropriate, these header information methods, as well as
the data access methods to be introduced below, have optional arguments
to (i) control the output data structure (by_utterances
and by_files
)
and (ii) filter by participants (participants
and exclude
).
The methods’ hyperlinks point to more detailed documentation.
Transcriptions and Annotations
A PyCantonese corpus reader is an instance of the CHATReader
class.
While this class inherits the CHAT handling capabilities from the underlying
PyLangAcq package, CHATReader
has several
additional functionality to deal with Cantonese-specific elements,
particularly Jyutping romanization and Chinese characters.
CHATReader
has convenience methods to give you an overview
of the data in the reader.
|
Print a summary of this Reader's data. |
|
Return the first several utterances. |
|
Return the last several utterances. |
>>> corpus.info()
58 files
16162 utterances
153654 words
Utterance Count Word Count File Path
-- ----------------- ------------ --------------
#1 245 1998 FC-001_v2.cha
#2 134 2581 FC-005a_v2.cha
#3 172 2721 FC-005b_v2.cha
#4 111 1208 FC-009b_v.cha
#5 139 1308 FC-011_v.cha
...
(set `verbose` to True for all the files)
>>> corpus.head()
*XXA: 喂 遲 啲 去 唔 去 旅行 啊 ?
%mor: E|wai3 A|ci4 U|di1 V|heoi3 D|m4 V|heoi3 VN|leoi5hang4 Y|aa3 ?
*XXA: 你 老公 有冇 平 機票 啊 ?
%mor: R|nei5 N|lou5gung1 V1|jau5mou5 A|peng4 N|gei1piu3 Y|aa3 ?
*XXB: 平 機票 要 淡季 先 有得 平 𡃉 喎 .
%mor: A|peng4 N|gei1piu3 VU|jiu3 AN|daam6gwai3 D|sin1 VU|jau5dak1 A|peng4 Y|gaa3 Y|wo3 .
*XXB: 而家 旺 - .
%mor: T|ji4gaa1 A|wong6 -| .
*XXA: 冇得 去 嗱 .
%mor: VU|mou5dak1 V|heoi3 Y|laa4 .
Here are the major CHATReader
methods to access data
at different levels of data structure:
|
Return the words. |
|
Return the tokens. |
|
Return the utterances. |
Words are the usual text strings. Think of tokens as words but with annotations (part-of-speech tags, morphological information, etc). An utterance is a a list of tokens plus associated information (the participant of the utterance, time markers if there are associated audio-visual materials, etc).
>>> corpus.words()[:10]
['喂', '遲', '啲', '去', '唔', '去', '旅行', '啊', '?', '你']
>>>
>>> corpus.tokens()[:10]
[Token(word='喂', pos='E', jyutping='wai3', mor=None, gloss=None, gra=None),
Token(word='遲', pos='A', jyutping='ci4', mor=None, gloss=None, gra=None),
Token(word='啲', pos='U', jyutping='di1', mor=None, gloss=None, gra=None),
Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
Token(word='唔', pos='D', jyutping='m4', mor=None, gloss=None, gra=None),
Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
Token(word='啊', pos='Y', jyutping='aa3', mor=None, gloss=None, gra=None),
Token(word='?', pos='?', jyutping=None, mor=None, gloss=None, gra=None),
Token(word='你', pos='R', jyutping='nei5', mor=None, gloss=None, gra=None)]
>>>
>>> corpus.utterances()[:1]
[Utterance(participant='XXA',
tokens=[Token(word='喂', pos='E', jyutping='wai3', mor=None, gloss=None, gra=None),
Token(word='遲', pos='A', jyutping='ci4', mor=None, gloss=None, gra=None),
Token(word='啲', pos='U', jyutping='di1', mor=None, gloss=None, gra=None),
Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
Token(word='唔', pos='D', jyutping='m4', mor=None, gloss=None, gra=None),
Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
Token(word='啊', pos='Y', jyutping='aa3', mor=None, gloss=None, gra=None),
Token(word='?', pos='?', jyutping=None, mor=None, gloss=None, gra=None)],
time_marks=None,
tiers={'XXA': '喂 遲 啲 去 唔 去 旅行 啊 ?',
'%mor': 'e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?'})]
PyCantonese has an augmented representation of tokens, where Jyutping romanization and glosses have their own dedicated attributes.
Jyutping Romanization
Tokens, as annotated words, are instances of the Token
class.
A Token
instance has the PyCantonese-specific attribute
jyutping
to accommodate Jyutping romanization.
To illustrate, below is the first utterance in FC-001_v2.cha
,
where Jyutping romanization is found in the %mor
tier:
*XXA: 喂 遲 啲 去 唔 去 旅行 啊 ?
%mor: e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?
Here are the corresponding tokens from PyCantonese,
where the data in CHAT format has been parsed into Token
objects, with the attribute jyutping
storing Jyutping romanization:
>>> some_tokens = corpus.tokens(by_utterances=True)[0]
>>> some_tokens
[Token(word='喂', pos='E', jyutping='wai3', mor=None, gloss=None, gra=None),
Token(word='遲', pos='A', jyutping='ci4', mor=None, gloss=None, gra=None),
Token(word='啲', pos='U', jyutping='di1', mor=None, gloss=None, gra=None),
Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
Token(word='唔', pos='D', jyutping='m4', mor=None, gloss=None, gra=None),
Token(word='去', pos='V', jyutping='heoi3', mor=None, gloss=None, gra=None),
Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gloss=None, gra=None),
Token(word='啊', pos='Y', jyutping='aa3', mor=None, gloss=None, gra=None),
Token(word='?', pos='?', jyutping=None, mor=None, gloss=None, gra=None)]
>>> for token in some_tokens:
... print(token.jyutping)
...
wai3
ci4
di1
heoi3
m4
heoi3
leoi5hang4
aa3
None
Given the ubiquitous status of Jyutping in the study of Cantonese,
the jyutping()
method is also defined for convenience:
>>> corpus.jyutping(by_utterances=True)[0]
['wai3', 'ci4', 'di1', 'heoi3', 'm4', 'heoi3', 'leoi5hang4', 'aa3', None]
For further processing Jyutping romanization, please see the Jyutping Romanization page.
Chinese Characters
Corpus data in the CHAT format is word-segmented,
and the same word segmentation is preserved in the output of
the CHATReader
methods
words()
,
tokens()
,
and utterances()
.
For Cantonese data, a (segmented) word can be, say, 廣東話 (“Cantonese”) with
three Chinese characters.
To work with data at the character level, characters()
is available:
>>> corpus.characters(by_utterances=True)[0]
['喂', '遲', '啲', '去', '唔', '去', '旅', '行', '啊', '?']
If you independently have Cantonese data in Chinese characters, PyCantonese has tools for word segmentation and part-of-speech tagging.
Word Frequencies and Ngrams
For word counts in various flavors, use the methods
word_frequencies()
and
word_ngrams()
:
>>> word_freq = corpus.word_frequencies() # A collections.Counter object
>>> word_freq.most_common(10)
[('.', 13251),
(',', 9282),
('係', 5019),
('啊', 4110),
('?', 2911),
('我', 2755),
('噉', 2741),
('呢', 2734),
('你', 2570),
('佢', 2259)]
>>>
>>> trigrams = corpus.word_ngrams(3) # A collections.Counter object
>>> trigrams.most_common(10)
[(('係', '啊', '.'), 527),
((',', '誒', ','), 520),
(('呢', ',', '就'), 219),
(('係', '啊', ','), 209),
(('係', '囖', '.'), 202),
(('吖', '嗎', '.'), 202),
(('𡃉', '喎', '.'), 186),
(('𠺢', '嗎', '.'), 167),
(('係', '喇', '.'), 140),
(('係', '喇', ','), 134)]