next up previous contents index
Next: Overview of the Dissertation Up: Introduction Previous: Motivation

Related Work

 

For an excellent general introduction to automata, see [HU79]. A taxonomy of finite-state automata minimization  algorithms can be found in [Wat93b], and a related taxonomy of finite-state automata construction algorithms can be found in [Wat93a]. Both papers discuss finite-state automata in general, and they do not include the algorithms that are described here. The algorithms that are presented here are for deterministic, acyclic FSAs. Since they are specific, they are faster than traditional ones, and they use less memory. They could form a separate branch in Watson's taxonomy. The algorithm for data in arbitrary order was also developed independently by Richard E. Watson and Bruce W. Watson. An alternative way for an incremental construction of automata and transducers was proposed by Dominique Revuz ([Rev91]). While doing a pseudo-minimization  during the construction phase, and economizing the use of memory, his algorithms construct the minimal automaton  in two steps, the second being the proper minimization. The first stage results in a automaton that is not minimal and that needs more memory than the minimal one.

The classical work on automata and two-level rules in morphology  is [KK94]. Although that paper was published in 1994, its unpublished versions were in circulation from about 15 years earlier. Other fundamental papers from this field are two articles ([Kos83] and [Kos84]) by Kimmo Koskenniemi, all influenced by [KK94]. Two-level rules and their implementation as transducers are discussed in [RRBP92] and [Spr92], probably the most popular books on morphology. A survey on the use of transducers in NLP is presented in [Moh97]. Several articles on the subject are in a book [RS97] edited by Emmanuel Roche and Yves Schabes.

The (physically) smallest automata are produced by using the concepts and algorithms developed by Tomasz Kowaltowski et al. ([KLS93b]). They introduce the notion of binary automata, and perform a very efficient optimization on the bit level (see section 4.4).

The morphological analysis  of unknown words is discussed by Eric Brill ([Bri93] and [Bri95]), Andrei Mikheev ([Mik97]), and mentioned by Emmanuel Roche and Yves Schabes ([RS95]). Eric Brill tries to discover rules governing the assignment of morphological categories to words. The assignment is based on prefixes and endings, and on context. They need an annotated corpus to learn the rules. Various rules generated from templates are applied to the corpus in turns, the best one, i.e. the one that gives categories that are closest to those in corpus is chosen and applied to the corpus, and the process continues until new rules do not improve the results. Emmanuel Roche and Yves Schabes ([RS95]) show how the rules can be transformed into a subsequential transducer which results in significant improvement in processing speed, and provides compact representation of rules. Andrei Mikheev ([Mik97]) uses a morphological lexicon instead of corpus to learn the rules using Brill's algorithm. Statistical techniques (e.g. [WMStex2html_wrap_inline570893]) have also been used for this purpose.

Spelling correction  has a rich bibliography. Three classical works are [Dam64], [Pet80], and [DLS83]. Those papers deal with isolated words. Systems that use context in spelling correction have also been described ([CH83], [CDKtex2html_wrap_inline570891], [CDGK92], [CGM93], [HJMtex2html_wrap_inline570882], [JHMR83], [KS81]). Shallow parsing for spelling correction was advocated in [Atw87]. A pioneering work on the use of finite-state methods in spelling correction was done by Kemal Oflazer ([OG94] and [Ofl96]).

Perfect hashing   by using finite-state automata is described in a tutorial ([Roc95]), which was presented at the ACL'95 conference.


next up previous contents index
Next: Overview of the Dissertation Up: Introduction Previous: Motivation

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html