pycantonese.pos_tag

pycantonese.pos_tag(words)[source]

Tag the words for their parts of speech.

The part-of-speech tagger is trained by the HKCanCor data. While HKCanCor uses a part-of-speech tagset of over 100 tags (46 of which are described at http://compling.hss.ntu.edu.sg/hkcancor/), these tags have been mapped to the much smaller Universal Dependencies v2 tagset of 17 tags (https://universaldependencies.org/u/pos/index.html) for training this POS tagger.

New in version 3.1.0.

Warning

As of November 2020, PyCantonese v3.1.0 hasn’t been released yet. The availability and behavior of this function are subject to change in the upcoming release.

Parameters
wordslist[str]

A segmented sentence or phrase, where each word is a string of Cantonese characters.

Returns
list[tuple[str, str]]

The segmented sentence/phrase where each word is paired with its predicted POS tag.

Raises
TypeError

If the input is a string (e.g., an unsegmented string of Cantonese).

Examples

>>> words = ['我', '噚日', '買', '嗰', '對', '鞋', '。']
>>> pos_tag(words)  # I bought those shoes yesterday.
[('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'ADP'), ('鞋', 'NOUN'), ('。', 'PUNCT')]