A basic part-of-speech tagger is provided by
which takes a segmented phrase or sentence as the input:
>>> import pycantonese >>> unsegmented = '我噚日買嗰對鞋。' # I bought that pair of shoes yesterday. >>> segmented = pycantonese.segment(unsegmented) >>> segmented ['我', '噚日', '買', '嗰', '對', '鞋', '。'] >>> pycantonese.pos_tag(segmented) [('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'NOUN'), ('鞋', 'NOUN'), ('。', 'PUNCT')]
The part-of-speech tagger uses the averaged perceptron model trained on
HKCanCor has already been annotated for part-of-speech tags,
with a tagset of over 100 tags
(46 of which are described).
pos_tag() maps the HKCanCor tagset to the
Universal Dependencies v2 tagset
(with 17 tags),
for cross-linguistic natural language processing work.
If you would like the original HKCanCor tagset,
pos_tag() accepts the keyword argument
>>> pycantonese.pos_tag(segmented, tagset="hkcancor") [('我', 'R'), ('噚日', 'T'), ('買', 'V'), ('嗰', 'R'), ('對', 'Q'), ('鞋', 'N'), ('。', '。')]
The helper function
exposes the tagset mapping from HKCanCor to Universal Dependencies.
Due to the statistical nature of part-of-speech tagging,
the quality of results from
pos_tag() depends on
(i) the training data,
(ii) the quality of word segmentation, since the function expects a segmented input.
Currently, a major limitation is the fact that HKCanCor is perhaps still
the only Cantonese corpus with a permissive license that comes annotated
with part-of-speech tags.
Its relatively small size (about 150,000 tagged words) means that models
more sophisticated than a standard averaged perceptron approach wouldn’t be worth it.
If you think the results from
pos_tag() are odd,
it is potentially due to the HKCanCor training data
(e.g., specific occurrences of word + tag combinations might have thrown off the tagger),
or the quality of word segmentation, especially if your segmented input comes from
– please get in touch
if you would like further investigation.