pycantonese.pos_tag

pycantonese.pos_tag(words, tagset='universal')[source]

Tag the words for their parts of speech.

The part-of-speech tagger uses an averaged perceptron model, and is trained by the HKCanCor data.

New in version 3.1.0.

Parameters

wordslist[str]

A segmented sentence or phrase, where each word is a string of Cantonese characters.

tagsetstr, {“universal”, “hkcancor”}

The part-of-speech tagset that the returned tags are in. Supported options:

"hkcancor", for the tagset used by the original HKCanCor data. There are over 100 tags, 46 of which are described at http://compling.hss.ntu.edu.sg/hkcancor/.
"universal" (default option), for the Universal Dependencies v2 tagset. There are 17 tags; see https://universaldependencies.org/u/pos/index.html. Internally, this option applies hkcancor_to_ud() to convert HKCanCor tags to UD tags.

Returns

list[tuple[str, str]]: The segmented sentence/phrase where each word is paired with its predicted POS tag.

Raises

TypeError: If the input is a string (e.g., an unsegmented string of Cantonese).
ValueError: If the tagset argument is not one of the allowed options from {"universal", "hkcancor"}.

Examples

>>> words = ['我', '噚日', '買', '嗰', '對', '鞋', '。']  # I bought that pair of shoes yesterday.
>>> pos_tag(words)
[('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'NOUN'), ('鞋', 'NOUN'), ('。', 'PUNCT')]
>>> pos_tag(words, tagset="hkcancor")
[('我', 'R'), ('噚日', 'T'), ('買', 'V'), ('嗰', 'R'), ('對', 'Q'), ('鞋', 'N'), ('。', '。')]