pycantonese.pos_tag
- pycantonese.pos_tag(words, tagset='universal')[source]
Tag the words for their parts of speech.
The part-of-speech tagger uses an averaged perceptron model, and is trained by the HKCanCor data.
New in version 3.1.0.
- Parameters
- wordslist[str]
A segmented sentence or phrase, where each word is a string of Cantonese characters.
- tagsetstr, {“universal”, “hkcancor”}
The part-of-speech tagset that the returned tags are in. Supported options:
"hkcancor"
, for the tagset used by the original HKCanCor data. There are over 100 tags, 46 of which are described at http://compling.hss.ntu.edu.sg/hkcancor/."universal"
(default option), for the Universal Dependencies v2 tagset. There are 17 tags; see https://universaldependencies.org/u/pos/index.html. Internally, this option applieshkcancor_to_ud()
to convert HKCanCor tags to UD tags.
- Returns
- list[tuple[str, str]]
The segmented sentence/phrase where each word is paired with its predicted POS tag.
- Raises
- TypeError
If the input is a string (e.g., an unsegmented string of Cantonese).
- ValueError
If the
tagset
argument is not one of the allowed options from{"universal", "hkcancor"}
.
Examples
>>> words = ['我', '噚日', '買', '嗰', '對', '鞋', '。'] # I bought that pair of shoes yesterday. >>> pos_tag(words) [('我', 'PRON'), ('噚日', 'ADV'), ('買', 'VERB'), ('嗰', 'PRON'), ('對', 'NOUN'), ('鞋', 'NOUN'), ('。', 'PUNCT')] >>> pos_tag(words, tagset="hkcancor") [('我', 'R'), ('噚日', 'T'), ('買', 'V'), ('嗰', 'R'), ('對', 'Q'), ('鞋', 'N'), ('。', '。')]