Parsing Cantonese Text
To take advantage of the myriad of functions offered by CHATReader
,
the previous documentation pages (Corpus Data, Corpus Reader Methods, Corpus Search Queries) assume that
you have CHAT-formatted data that is already processed for word segmentation,
characters-to-Jyutping conversion, and part-of-speech tagging.
What if you have your own unprocessed Cantonese data?
This is where the parse_text()
function comes in handy.
parse_text()
takes raw Cantonese text as input
and returns a CHATReader
object containing the processed data.
Input 1: A Plain String
If you have unprocessed Cantonese text (prose, conversational data, etc.),
then you can simply pass in a plain Python string to parse_text()
:
>>> import pycantonese
>>> data = "你食咗飯未呀?食咗喇!你聽日得唔得閒呀?"
>>> corpus = pycantonese.parse_text(data)
>>> corpus.head()
*X: 你 食 咗 飯 未 呀 ?
%mor: PRON|nei5 VERB|sik6 PART|zo2 NOUN|faan6 ADV|mei6 PART|aa4 ?
*X: 食 咗 喇 !
%mor: VERB|sik6 PART|zo2 PART|laa1 !
*X: 你 聽日 得 唔 得閒 呀 ?
%mor: PRON|nei5 ADV|ting1jat6 VERB|dak1 ADV|m4 ADJ|dak1haan4 PART|aa4 ?
Note:
Because the output of
parse_text()
is aCHATReader
object, all methods and attributes forCHATReader
will work (words()
,tokens()
,utterances()
,search()
, etc).Since CHAT is designed for conversational data and your input data is a string,
parse_text()
attempts simple utterance segmentation (by the Chinese full-width punctuation marks{",", "!", "。"}
as well as the end-of-line character"\n"
).By default, a dummy participant
"X"
is assigned to each utterance. To provide your own participant, pass it to theparticipant
keyword argument ofparse_text()
.
Since the input string data is a vanilla Python string,
we can also pipe raw Cantonese text from a local file into the parse_text()
function:
import pycantonese
# Suppose you have Cantonese text in data.txt.
with open("data.txt") as f:
corpus = pycantonese.parse_text(f.read())
Input 2: A List of Strings
If you want to control utterance segmentation on your own,
you can provide parse_text()
with a list of strings instead of a single string.
Each string in the list will be treated as an utterance:
>>> import pycantonese
>>> data = ["你食咗飯未呀?", "食咗喇!你聽日得唔得閒呀?"]
>>> corpus = pycantonese.parse_text(data)
>>> corpus.head()
*X: 你 食 咗 飯 未 呀 ?
%mor: PRON|nei5 VERB|sik6 PART|zo2 NOUN|faan6 ADV|mei6 PART|aa4 ?
*X: 食 咗 喇 ! 你 聽日 得 唔 得閒 呀 ?
%mor: VERB|sik6 PART|zo2 PART|laa1 ! PRON|nei5 ADV|ting1jat6 VERB|dak1 ADV|m4 ADJ|dak1haan4 PART|aa4 ?
See how the input "食咗喇!你聽日得唔得閒呀?"
was treated as an utterance,
without utterance segmentation due to the exclamation point "!"
in the middle.
Input 3: A List of Tuples of Strings
If your data has multiple participants (e.g., a dialog, a play or drama script)
and you would like to encode such participant information for downstream analysis,
then you can provide parse_text()
with a list of tuples of strings.
In each tuple, the first element is the participant,
and the second one is the unparsed utterance string:
>>> import pycantonese
>>> data = [
... ("小麗", "你食咗飯未呀?"),
... ("小怡", "食咗喇!你聽日得唔得閒呀?"),
... ]
>>> corpus = pycantonese.parse_text(data)
>>> corpus.head()
*小麗: 你 食 咗 飯 未 呀 ?
%mor: PRON|nei5 VERB|sik6 PART|zo2 NOUN|faan6 ADV|mei6 PART|aa4 ?
*小怡: 食 咗 喇 ! 你 聽日 得 唔 得閒 呀 ?
%mor: VERB|sik6 PART|zo2 PART|laa1 ! PRON|nei5 ADV|ting1jat6 VERB|dak1 ADV|m4 ADJ|dak1haan4 PART|aa4 ?
Customizing Word Segmentation
parse_text()
has an optional argument called segment_kwargs
.
You can pass in a dictionary here to customize the behavior of word segmentation.
The key-value pairs in this dictionary are passed as keyword arguments to the underlying
segment()
function.
>>> import pycantonese
>>> from pycantonese.word_segmentation import Segmenter
>>> # The ``Segmenter`` class can take an "allow" or "disallow" list of words.
>>> # The example below shows the use of an "allow" list that happens to be
>>> # a hard-coded set of strings (with only one string: ``"得唔得閒"``).
>>> # You can create your own allow/disallow list so long as the list is a container
>>> # of strings (e.g., from memory, from a local file).
>>> my_segmenter = Segmenter(allow={"得唔得閒"})
>>> data = [
... ("小麗", "你食咗飯未呀?"),
... ("小明", "食咗喇!你聽日得唔得閒呀?"),
... ]
>>> # The pycantonese.segment function takes the `cls` kwarg for a custom segmenter,
>>> # which is why we can pass in ``{"cls": my_segmenter}`` to ``segment_kwargs``.
>>> corpus = pycantonese.parse_text(data, segment_kwargs={"cls": my_segmenter})
>>> corpus.head()
*小麗: 你 食 咗 飯 未 呀 ?
%mor: PRON|nei5 VERB|sik6 PART|zo2 NOUN|faan6 ADV|mei6 PART|aa4 ?
*小明: 食 咗 喇 ! 你 聽日 得唔得閒 呀 ?
%mor: VERB|sik6 PART|zo2 PART|laa1 ! PRON|nei5 ADV|ting1jat6 VERB|dak1m4dak1haan4 PART|aa4 ?
Note the difference in the way "得唔得閒"
is segmented between here and previous examples.
Customizing Part-of-Speech Tagging
parse_text()
has an optional argument called pos_tag_kwargs
.
You can pass in a dictionary here to customize the behavior of part-of-speech tagging.
The key-value pairs in this dictionary are passed as keyword arguments to the underlying
pos_tag()
function.
>>> import pycantonese
>>> data = [
... ("小麗", "你食咗飯未呀?"),
... ("小明", "食咗喇!你聽日得唔得閒呀?"),
... ]
>>> corpus = pycantonese.parse_text(data, pos_tag_kwargs={"tagset": "hkcancor"})
>>> corpus.head()
*小麗: 你 食 咗 飯 未 呀 ?
%mor: R|nei5 V|sik6 U|zo2 N|faan6 D|mei6 Y|aa4 ?
*小明: 食 咗 喇 ! 你 聽日 得 唔 得閒 呀 ?
%mor: V|sik6 U|zo2 Y|laa1 ! R|nei5 T|ting1jat6 V|dak1 D|m4 A|dak1haan4 Y|aa4 ?
Outputting CHAT Data
Once you have created a CHATReader
object using your own data,
you may like to export the CHAT-formatted data to a local file.
This way, you can more easily share the processed data with your colleagues,
reload the data (see Corpus Data) for further processing and analysis in your workflow,
and so forth.
With a CHATReader
object, simply call the to_chat()
method with a local file path.
file_path = "result.cha"
corpus.to_chat(file_path)
# If you're running code on Google Colab,
# you can download the file like this:
from google.colab import files
files.download(file_path)
More Customization
Under the hood, parse_text()
calls the existing functions
from PyCantonese.
While parse_text()
is designed to cover the basic use cases
with limited customization, a more custom workflow may require you to put
the various pieces together in your own way.
Please see the individual documentation pages for details
(Jyutping Romanization, Word Segmentation, Part-of-Speech Tagging).