Khmer Word Segmentation Using Conditional Random Fields
Word Segmentation is a critical task that is the foundation of much natural language processing research. This paper is a study of Khmer word segmentation using an approach based on conditional random fields (CRFs). A large manually-segmented corpus was developed to train the segmenter, and we provide details of a set of word segmentation strategies that were used by the human annotators during the manual annotation. The trained CRF segmenter was compared empirically to a baseline approach based on maximum matching that used a dictionary extracted from the manually segmented corpus. The CRF segmenter outperformed the baseline in terms of precision, recall and f-score by a wide margin. The segmenter was also evaluated as a pre-processing step in a statistical machine translation system. It gave rise to substantial increases in BLEU score of up to 7.7 points, relative to a maximum matching baseline.
PDF