|Introduction to LIVAC SEGMENT Online|
|All guest users are welcome to try our LIVAC Chinese segmentation system. You can initially input up to 1000 characters into the LIVAC segmentation engine each time. If you register an account with your email address, you can segment at most 2000 characters each time.We apologize for this limitation, which is necessary because of the limitation in total capacity.|
Chinese Word Segmentation, or tokenisation, is the process of breaking a string of Chinese characters into meaningful units (words). This process is non-trivial because there is no inherent word boundary in Chinese texts, and is a crucial step in processing Chinese language texts.
There are many approaches to tackle this problem, from dictionary-based approach, to rule-based approach, to statistical or hybrid approaches.
We tackle this problem by machine learning using manually verified LIVAC corpus (http://www.livac.org) to train a segmentation engine. The Traditional Chinese version currently has made use of around 10 years of Hong Kong materials, while the Simplified Chinese version makes use of Beijing materials about the same period.
|Try Segmentation System|