Next: Spelling and Restoration of Up: POLITECHNIKA GDANSKA Technical Previous: Finite State Transducers

Data for Applications

In this chapter, we discuss the problem of what should the automata contain (what languages they should accept) in various applications. The construction algorithms perceive data as lists of words, with words being a sequence of symbols from a given alphabet . They do not interpret the symbols themselves. However, the programs that use the automata do interpret them, and they expect the words in automata to be formed in a particular way.

The data formats discussed here are appropriate for the algorithms presented in the next two chapters, i.e. they are suitable for the incremental construction of automata and transducers. Although this does not mean they cannot be used with other construction algorithms, this does not mean they are suitable for every construction algorithm either. In fact, there is a whole family of methods ([Wat93a], also see section 4.1) that build automata from regular expressions. Another widely used method for building lexical transducers ([KK94], [Kar94]) is to merge transducers representing lexemes and two-level rules by using intersection and composition. That method makes it possible to construct transducers directly form morphological descriptions of lexicon entries. Finally, our algorithms deal with lexicalized data, while automata and transducers can be used as an implementation of translating rules. For example, two-level rules are compiled into transducers ([Kos83], [Kos84]), and simple automata can be used to model phonological phenomena ([BE94]).

The disadvantage of the data format presented here is that the data takes more space than e.g. regular expressions. However, it can be used on-the-fly as it is produced, i.e. occupying no space at all. In addition, this format is flexible, as it does not assume that a particular tool was used to form the data.

We believe that the method for coding the lexemes by specifying how many characters from the inflected form are present in it, and then appending the suffix of the inflected form is a common knowledge among researchers in the field. However, we are not aware of the use elsewhere of the methods for coding prefixes and infixes that we describe here.

Next: Spelling and Restoration of Up: POLITECHNIKA GDANSKA Technical Previous: Finite State Transducers

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html