Next: Method Description Up: Construction of Guessing Automata Previous: Construction of Guessing Automata

Motivation

A morphological analysis of words in a text is needed in many applications. It constitutes a prerequisite for natural language parsing and all for applications that use it, it is also useful in document retrieval. Such analysis is usually lexicon-based, i.e. it requires a morphological lexicon.

Unfortunately, real-world texts contain correct words that cannot be found in a lexicon. It seems impossible to record all words of a living language in a lexicon, as a lexicon is static in nature, and a language is a living thing - new words are coined continually. Another reason for finding words not present in the lexicon is the Zipf's law [Bri93]. The Zipf's law states that the rank of an element divided by the frequency of occurrence is constant. E.g. in the Brown corpus, two percent of different words account for sixty nine percent of the text. About seventy five percent of different words occur five or fewer times in the corpus. Fifty eight percent of different words occur two or fewer times, and fourth four percent only occur once. The consequence of the Zipf's law is that by doubling the number of words in the lexicon, one gets only a few percents of the coverage of an arbitrary unrestricted text. Therefore, increasing the size of the lexicon is a very costly effort yielding minute results.

New words are also constructed by derivation on compounding. While the analysis of such words is relatively easy, their number is very small compared to the number of potential words formed in the same way. Therefore, it is not practical to store all such derivatives and compounds in the lexicon. In many cases there may be many ways to form a new word, and it is not possible to predict which one would be chosen.

Additionally, texts may contain incorrect words. For purpose of e.g. spelling corrections, the morpho-syntactic categories of a misspelled word may help reduce the list of possible corrections. If the misspelling does not affect the word's flectional ending (and its prefix, if present), these categories may still be easily obtainable from the corrupted version.

It is possible to use textbooks to write rules that associate word endings with specific tags. Actually, it seems to be a natural way to write a category guesser . It is possible to enhance the rules so that not only categories, but lexemes as well are predicted. However, this process requires a considerable amount of linguistic knowledge. Textbook rules do not capture many exceptions that are present because of the nature of human languages, so the rules must be refined over and over again.

Our aim is to

use an already existing morphological lexicon;
reduce the size of the lexicon by cutting out useless information (beginnings of words);
generalize the knowledge contained in the lexicon so that accurate prediction of morphological information for unknown words be possible.

Next: Method Description Up: Construction of Guessing Automata Previous: Construction of Guessing Automata

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html