next up previous contents index
Next: Perfect Hashing Up: Data for Applications Previous: With Automata - Acceptors

Morphological Analysis of Unknown Words

 

For most of flectional languages, such as Polish, French, or German, the part that carries the information about the categories of the word, and about the ways of obtaining the canonical form, is contained in the ending of the inflected form. As we intended to recognize word endings, our lexicon  is constructed from inverted words, i.e. the last letter of a word is the first letter of the string stored in the lexicon, the penultimate one - the second one, and so on. The beginning of a word is marked with a special character to make it possible to distinguish complete words from suffixes. We append annotations at the end of the inverted word. Another special character (an annotation separator ) separates words from annotations.

As before, we use the term ``annotation'' to refer to everything past the annotation separator. Again, the annotations may include lexemes and morphological categories . If lexemes are to be included, they need the same (or similar) coding scheme as shown in the previous subsection.

The annotations are not inverted. E.g. the word spała, which could have annotations Cæ+V[aspect=imperf mode=ind tense=past num=sg pers=3 gender=fem], may be represented as shown in fig. 3.8.

  figure569
Figure 3.8: Input data for "spała" for a morphological guesser

Prefixes introduce problems similar to those in automata representing lexicons. If lexemes are to be guessed, they cannot be represented in full form, as they would inflate automata. Exactly the same coding scheme as that described in the previous subsection can be used to solve that problem.

We cannot assume that categories of inflected forms depend only on suffixes. This is true e.g. for French, but there are many other languages that do not follow that rule. In Polish, superlatives of adjectives and adverbs are formed by adding a prefix ``naj'' to corresponding comparatives. In German, past participles are formed by adding a prefix ``ge'', and a suffix ``t'' that is used in other forms.

Prefixes also inflate guessing automata, because the parts that distinguish between various annotations are at beginnings of words. Transitions that code them cannot be removed, and neither can all the transitions that precede them. So where prefixes are used, whole words are stored in the automaton.

There is, however, a solution to this problem. We augment the annotations so that we can move there the prefixes from inflected forms. To separate prefixes from other annotations we use another annotation separator . E.g. for a Polish adjective ``szybszy'' (``faster'') we have the input data shown in fig. 3.9.

  figure578
Figure 3.9: Input data for "szybszy" for a morphological guesser

This means that ``szybszy'' is an inflected form of an adjective in comparative degree with given categories, that this form has no prefix, and that the corresponding lexeme is formed from ``szybszy'' by removing ``D'' - ``A'' = 3 letters from the end, and appending ``ki''. The data superlative form of the same adjective is given in fig. 3.10.

  figure586
Figure 3.10: Input data for "najszybszy" for a morphological guesser

This means that the inflected form is formed from ``szybszy'' by prepending the prefix ``naj'' (thus giving ``najszybszy''), and that the lexeme is formed from ``szybszy'' by removing 3 last letters and adding ``ki''. Note that endings of ``szybszy'' and ``najszybszy'' are the same, but the forms have different morphological categories .

By moving prefixes to annotations, we achieved the situation where the inflected form (without the prefix) has the same beginning as the corresponding lexeme (at least in regular words), so to obtain the lexeme from the inflected form we need to cut some letters from the end of the inflected form and append some others, just as if there were no prefixes.

Unfortunately, this approach does not solve all problems with the German past participle. Some verbs have prefixes that can be separated from stems. In such cases the prefix ``ge'' becomes an infix, e.g. the past participle for ``einladen'' is ``eingeladet''. The solution is the same as in case of data for morphological analysis: an additional code (the infix code ) specifies the offset of the infix from the beginning of the word. The code is inserted in front of the infix. If there is no infix, and the inflected form contains a prefix, the code says that the offset is zero, i.e. the code is ``A''. The code for eingeladet is given in fig. 3.11. The 3 capital letters after the first annotation separator  specify what and where should be deleted from the inflected form before the suffix n could be appended to obtain the lexeme: ``C'' means there are 2 characters to be deleted near the beginning of the inflected form, ``D'' - the characters to be deleted are after the third character of the inflected form, ``B'' - the last character of the inflected form should be deleted.

  figure600
Figure 3.11: Coded infixes in data for a morphological guesser

The removable prefix can be removed form the inflected form, and put somewhere else in a sentence: in ``er ladet sie ein'', ladet and ein come from the lexeme einladen. The data for the guesser can take that into account in a similar way to the handling of prefixes. The question is whether this should be done in the dictionary for the guesser. There are many prefixes that behave in that way, and they could be listed in many guesses, making the output less clear. It seems that the problem should rather be solved at the syntactic level .


next up previous contents index
Next: Perfect Hashing Up: Data for Applications Previous: With Automata - Acceptors

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html