next up previous contents index
Next: Related Work Up: Introduction Previous: Introduction

Motivation

 

The work presented here should be seen as a part of a greater framework. The goal is to create a system for computer-assisted correction of written texts. However, the algorithms presented in this thesis are general, and they can be used in a variety of applications. We are concerned mostly with natural language processing (NLP) applications, but finite automata are useful in other domains, such as compilers, text processors, DNA sequence matching, VLSI design, etc. Most natural language processing (NLP) applications have layered architecture (fig. 1.1). In a real-life NLP system, some layers may be missing. The higher the level is situated on the diagram, the greater the possibility that it is missing. This happens for two reasons. The first is that the lower levels are much more easier to implement. The second one is that we can do without them in many applications without a significant loss of performance. The boundaries between layers are blurred. One cannot perform a good syntactic analysis without the knowledge of semantics, a good semantic analysis without pragmatics, etc. Statistical methods can be used in much the same way on all levels. However, the divisions are useful, because they correspond to different modules, or groups of modules, in NLP software.

  figure71
Figure 1.1: NLP layers

The bottom layer is the lexicon and the morphology . The morphology provides information about features of individual words. It is done either by dictionary lookup, or by computation. The set of features depends on the requirements of other system components. The features that are included most frequently are morphological in nature, i.e. such as part-of-speech, number, gender, etc. Other features to be considered are subcategorization and meaning. All algorithms presented in this thesis can be situated in this lowest layer.

The layer can be implemented in a variety of ways. The method chosen here is to use finite-state automata and transducers. There are two main features of automata that justify the choice:

They should be seen jointly in comparison with other methods.

We use deterministic, acyclic finite-state automata (DAFSA). Acyclic automata are particularily suited to represent dictionaries, as they accept finite languages - a finite set of words, which corresponds to a definition of a dictionary. If we extend the notion of a word to represent an inflected form with annotations, we can construct morphological dictionaries - dictionaries that contain not only words, but their canonical forms and morphological categories as well. Deterministic automata are much faster in recognition than their nondeterministic counterparts, as recognizing a word means following a single path in an automaton - no (costly) backtracking is required.

The use of finite-state automata in NLP is not new. What is new is the incremental method of their construction, which lowers the memory requirements. Also, the type of automata we use is original. They do not have final states - they have final transitions. The origins of the work presented here date back to a lecture given by Lauri Karttunen ([Kar94]) on a seminar in Archamps, France.

Chapter 5 discusses a new original method for constructing automata that guess the lexemes and the morphological categories that correspond to unknown inflected word forms. These automata are constructed from already existing morphological lexicons. Although a method for guessing using a rule inference from a lexicon was proposed by Andrei Mikheev ([Mik97]), and the conversion of rules to finite-state transducers proposed by Emmanuel Roche and Yves Schabes ([RS95]), they both use Eric Brill's learning paradigm.

The present dissertation upholds the following theses:

We realize those objectives by:


next up previous contents index
Next: Related Work Up: Introduction Previous: Introduction

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html