Tag the words for their parts of speech.
The part-of-speech tagger is trained by the HKCanCor data. While HKCanCor uses a part-of-speech tagset of over 100 tags (46 of which are described at http://compling.hss.ntu.edu.sg/hkcancor/), these tags have been mapped to the much smaller Universal Dependencies v2 tagset of 17 tags (https://universaldependencies.org/u/pos/index.html) for training this POS tagger.
New in version 3.1.0.
As of November 2020, PyCantonese v3.1.0 hasn’t been released yet. The availability and behavior of this function are subject to change in the upcoming release.
A segmented sentence or phrase, where each word is a string of Cantonese characters.
- list[tuple[str, str]]
The segmented sentence/phrase where each word is paired with its predicted POS tag.
If the input is a string (e.g., an unsegmented string of Cantonese).
>>> words = ['我', '噚日', '買', '嗰', '對', '鞋', '。'] >>> pos_tag(words) # I bought those shoes yesterday. [('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'ADP'), ('鞋', 'NOUN'), ('。', 'PUNCT')]