Shared Task in NLPCC 2015:
Chinese Word Segmentation and POS Tagging for Micro-Blog Texts

1 Introduction

Word segmentation and Part-of-Speech (POS) tagging are two fundamental tasks for Chinese language processing.In recent years, word segmentation and POS tagging have undergone great development. The popular method is to regard these two tasks as sequence labeling problem, which can be handled with supervised learning algorithms such as Conditional Random Fields (CRF).
However, the performances of the state-of-the-art systems are still relatively low for the informal texts, such as micro-blogs, forums.
In this shared task, we wish to investigate the performances of Chinese word segmentation and POS tagging for the micro-blog texts.

2 Description of the Task

2.1 Subtasks

This task focus the two fundamental problems of Chinese language processing: word segmentation and POS tagging, which can be divided into two subtasks:
1. Chinese word segmentation
2. Joint Chinese word segmentation and POS Tagging

2.2 Tracks

Each participant will be allowed to submit the three runs for each subtask: closed track run, semi-open track run and open track run.

1. In the closed track, participants could only use information found in the provided training data.
Information such as externally obtained word counts, part of speech information, or name lists was excluded.

2. In the semi-open track, participants could use the information extracted from the provided background data in addition to the provided training data.
Information such as externally obtained word counts, part of speech information, or name lists was excluded.

3. In the open track, participants could use the information which should be public and be easily obtained.
But it is not allowed to obtain the result by the manual labeling or crowdsourcing way.

3 Data

Different with the popular used news dataset, we use relatively informal texts from Sina Weibo1. The training and test data consist of micro-blogs from various topics, such as finance, sports, entertainment, and so on.

The data are collected from Sina Weibo. Both the training and test files are UTF-8 encoded. The information of dataset is shown in Table 1.

There are total 36 POS tags in this dataset. A detailed list of POS tags is shown in Table 2.

3.1 Background Data

Besides the training data, we also provide the background data, from which the training and test data are drawn.
The purpose is to find the more sophisticated features by the unsupervised way.

4 Download

The dataset, including micro-texts, a standard train/test split, the word segmentation and POS tagging, can be obtained by sending a request email to us.
Specifically, the researchers interested in the dataset should download and fill up this Agreement Form and send it back to Xipeng Qiu (xpqiu@fudan.edu.cn; Email title: NLPCC2015 data request).
We will then send you the download instructions at our discretion.

Please cite this paper if the dataset helps your research.

  @ARTICLE{Qiu:2015,
	   author  =  {Xipeng Qiu and Peng Qian and Liusong Yin and Shiyu Wu and Xuanjing Huang},
	   title   =   {Overview of the {NLPCC} 2015 Shared Task: {Chinese} Word Segmentation
	                and {POS} Tagging for Micro-blog Texts},
	   journal = {arXiv preprint arXiv:1505.07599},
	   year    =  {2015}}

5 Evaluation Metrics

We use the standard SIGHAN bake-off scoring program to calculate precision, recall,F1-score and out-of-vocabulary (OOV) word recall.


Table 1: Statistical information of dataset.
Dataset Sents Words Chars Word Types Char Types OOV Rate
Training 10,000 215,567 348,551 28,355 39,73 -
Test 5,000 106,843 172,342 18,785 3,540 9.75%
Total 15,000 322,410 520,555 35,277 4,243 -

6 Contact

Please feel free to send any questions or comments to xpqiu@fudan.edu.cn