New in version 3.1.0.
A basic part-of-speech tagger is provided by
which takes a segmented phrase or sentence as the input:
>>> import pycantonese as pc >>> unsegmented = '我噚日買嗰對鞋。' # I bought that pair of shoes yesterday. >>> segmented = pc.segment(unsegmented) >>> segmented ['我', '噚日', '買', '嗰', '對', '鞋', '。'] >>> pc.pos_tag(segmented) [('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'NOUN'), ('鞋', 'NOUN'), ('。', 'PUNCT')]
The part-of-speech tagger uses the averaged perceptron model trained on
HKCanCor has already been tagged for part-of-speech tags,
with a tagset of over 100 tags (46 of which are described at
pos_tag() maps the HKCanCor tagset to the
Universal Dependencies v2 tagset
(with 17 tags, https://universaldependencies.org/u/pos/index.html),
for cross-linguistic natural language processing work.
If you would like the original HKCanCor tagset,
pos_tag() accepts the keyword argument
>>> pc.pos_tag(segmented, tagset="hkcancor") [('我', 'R'), ('噚日', 'T'), ('買', 'V'), ('嗰', 'R'), ('對', 'Q'), ('鞋', 'N'), ('。', '。')]
The helper function
exposes the tagset mapping from HKCanCor to Universal Dependencies.
Due to the statistical nature of part-of-speech tagging,
the quality of results from
pos_tag() depends on
(i) the training data,
(ii) the quality of word segmentation, since the function expects a segmented input.
Currently, a major limitation is the fact that HKCanCor is perhaps still
the only Cantonese corpus with a permissive license that comes annotated
with part-of-speech tags.
Its relatively small size (about 150,000 tagged words) means that models
more sophisticated than a standard averaged perceptron approach wouldn’t be worth it.
If you think the results from
pos_tag() are odd,
it is potentially due to the HKCanCor training data
(e.g., specific occurrences of word + tag combinations might have thrown off the tagger),
or the quality of word segmentation, especially if your segmented input comes from
segment() (also trained by HKCanCor)
– please get in touch
if you would like further investigation.