Machine Translation (MT) is a research area of high theoretical and practical value. The research of Machine Translation requires joint efforts of various disciplines: linguistics, mathematics, Artificial Intelligence, Computer Science, etc. This thesis mainly describes the design and implementation of a Chinese-English machine translation system based on stratified analysis of syntax and semantics – THCEMT.

We first discuss the different characteristics of Chinese and other natural languages. In our system, we deployed a hybrid grammar system which incorporates Context Free Grammar, Attribute-Constraint Grammar, and Case Grammar. The mechanism of the system is rule-based and combines top-down and bottom-up analysis methods. The analysis of semantics and some special Chinese words are highly emphasized.

In the lexical analysis level, an efficient Chinese automatic word segmentation method which combines Adjacent Matching method and post-segmentation correction based on syntactic and semantic constraints is proposed. Using this method we can solve the problem of common segmentation error caused by reiterative locution, cross-link ambiguity, and polysemantic ambiguity. In the cover range of our segmentation rules, the accuracy of Chinese word segmentation is approximately 100%. Various methods are used to disambiguate words in both syntactic and semantic level.

In the syntactic and semantic analysis level, we propose a Chinese sentence analysis method that combines the top-down segmentation of word groups and bottom-up unification. This method makes efficient use of syntactic and semantic constraints in Chinese sentences, and emphasizes the capability of semantic analysis. It is also well stratified and easy to implement.

In the system rule processing level, we have realized an expandable system with the rule bases separated from the main program and set up a natural, easy-to-interpret knowledge representation system. In this way, we can avoid frequent changing of program codes when there are changes in our rules. The interpretation of rules is also classified into several levels, thus partially solving the rule conflict problem in typical rule-based systems.

Currently, we have built an automatic Chinese-English Machine Translation system running on PCs, with rule bases separated from the translation program. The whole process of translation, from input of source texts to output of target texts, is totally automatic. Tests on example Chinese texts show that the system can produce accurate and intelligible English texts within the limit of the existing rule bases.


  Hao Yan (Pattern Recognition & Intelligent Control)

  Directed by Professor Mao, Yuhang

Keywords: machine translation, rule-based, word segmentation, unification

Here is my thesis on the Chinese-English Machine Translation. Sorry, it's in Chinese, and it's huge.

Here is a presentation that I gave to my group about my previous work in Tsinghua University.

