pycantonese.pos_tag

pycantonese.pos_tag(words, tagset='universal')[source]

Tag the words for their parts of speech.

The part-of-speech tagger uses an averaged perceptron model, and is trained by the HKCanCor data.

New in version 3.1.0.

Parameters
wordslist[str]

A segmented sentence or phrase, where each word is a string of Cantonese characters.

tagsetstr, {“universal”, “hkcancor”}

The part-of-speech tagset that the returned tags are in. Supported options:

Returns
list[tuple[str, str]]

The segmented sentence/phrase where each word is paired with its predicted POS tag.

Raises
TypeError

If the input is a string (e.g., an unsegmented string of Cantonese).

ValueError

If the tagset argument is not one of the allowed options from {"universal", "hkcancor"}.

Examples

>>> words = ['我', '噚日', '買', '嗰', '對', '鞋', '。']  # I bought that pair of shoes yesterday.
>>> pos_tag(words)
[('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'NOUN'), ('鞋', 'NOUN'), ('。', 'PUNCT')]
>>> pos_tag(words, tagset="hkcancor")
[('我', 'R'), ('噚日', 'T'), ('買', 'V'), ('嗰', 'R'), ('對', 'Q'), ('鞋', 'N'), ('。', '。')]