- Chinese Word Segmentation and POS Tagging for Micro-Blog Texts
The training and test data consist of micro-blogs from various topics, such as finance, sports, entertainment, and so on.
A newer and larger dataset can be found in https://github.com/FudanNLP/NLPCC-WordSeg-Weibo .
- Neural Sentence Ordering
Since abstracts of paper are always well written and have strong logic clues. We collect all abstracts of papers (before 2016-5-25) from arXiv.com. Abstracts from arXiv can be mainly classified into 7 categories: statistics, quantitative biology, physics, computer science, nonlinear sciences, quantitative finance and mathematics. The development set and test set are the first and last 10% abstracts from shuffled data, and the training set consists of the remains. The detailed information of arXiv dataset is shown in https://arxiv.org/abs/1607.06952 .