Stop Words
In many natural language processing tasks, it is often necessary to filter
stop words, English examples of which include function words such as
pronouns and determiners. PyCantonese provides the function
stop_words()
that returns a set of about 100 Cantonese stop words:
>>> import pycantonese
>>> stop_words = pycantonese.stop_words()
>>> len(stop_words)
104
>>> stop_words
{'一啲', '一定', '不如', '不過', ...}
Depending on your use cases, you may like to add or remove stop words
from the default ones.
The stop_words()
function has the optional arguments of
add
and remove
.
add
can either be a string (e.g., treat "香港"
as a stop word if your
data is all about Hong Kong) or an iterable of strings:
>>> import pycantonese
>>> stop_words_1 = pycantonese.stop_words(add='香港')
>>> len(stop_words_1)
105
>>> '香港' in stop_words_1
True
>>> stop_words_2 = pycantonese.stop_words(add=['香港島', '九龍', '新界']) # Hong Kong Island, Kowloon, the New Territories
>>> len(stop_words_2)
107
>>> {'香港島', '九龍', '新界'}.issubset(stop_words_2)
True
Similarly, the remove
argument can also take either a string or an iterable
of strings.