pycantonese.pos_tagging.hkcancor_to_ud

pycantonese.pos_tagging.hkcancor_to_ud(tag: str = None)[source]

Map a part-of-speech tag from HKCanCor to Universal Dependencies.

HKCanCor uses a part-of-speech tagset of over 100 tags (46 of which are described at http://compling.hss.ntu.edu.sg/hkcancor/). For applications that would benefit from a less granular part-of-speech tagset (e.g, part-of-speech tagging, especially when only major word classes are of interest and/or when there’s not sufficient annotated data for training), we can map the HKCanCor tagset to the Universal Dependencies v2 tagset with 17 tags (https://universaldependencies.org/u/pos/index.html) – the purpose of this function.

Any unrecognized tag is mapped to "X".

New in version 3.1.0.

Warning

As of November 2020, PyCantonese v3.1.0 hasn’t been released yet. The availability and behavior of this function are subject to change in the upcoming release.

Parameters
tagstr

A tag from the original HKCanCor annotated data. If not provided or None, this function returns the entire dictionary of the tagset mapping from HKCanCor to UD.

Returns
str or dict[str, str]

A tag from the Universal Dependencies v2 tagset, or a dictioary from HKCanCor to UD tags if no input is given.

Examples

>>> hkcancor_to_ud("V")
'VERB'