More Generalization

Next: Experimental Results Up: Method Description Previous: Default Annotations

More Generalization

Sometimes, it is impossible to devise a rule that associates an ending with the correct annotation, because the choice is lexicalized, i.e. it depends on a particular word, and it seems arbitrary from the morphological point of view. For example, in Polish, there is a rule that transforms adjectival endings -sny in lexemes into -śniejszy in comparatives and superlatives. There is, however, another rule that transforms endings -sny into -śniejszy in comparatives and superlatives. So there is no other way of knowing what the lexeme might be from a comparative or superlative ending other than a dictionary lookup. R6 introduces artificial divisions, e.g.:

-raśniejszy -raśny
-iaśniejszy -iasny
-maśniejszy -maśny
-waśniejszy -waśny
jaśniejszy jasny
-ośniejszy -ośny
-dośniejszy -dosny
-ześniejszy -zesny
-oleśniejszy -olesny
-bleśniejszy -bleśny
-uśniejszy -uśny

while the right answer is that both annotations must be considered:

-śniejszy -śny
-śniejszy -sny

To cope with that situation, we introduce a new rule that strives to accommodate such cases. We will use the term first annotated state to name a state that is a target of a transition labeled with the annotation separator (a state that begins an annotation or a set of annotations).

tex2html_wrap5252 R7. If for a given state the number of first annotated states that are reachable from the given state does not exceed a given limit, then:

replace the first annotated states by their union;
replace all the states and transitions between the chosen state and the union of the first annotated states by a single transition labeled with the annotation separator .

Note that it is possible to introduce a lower limit on the number of states to be removed in order to insure that we are dealing with a case such that the one described above (sny and sny). The rule can then work in parallel with R6.

To make things clear, we need to describe what we mean by a union of states. We make it by constructing a new state that has all transitions from contributing states. For pairs of transitions that go to different states, we construct transitions going to unions of those states:

tex2html_wrap5254 A union of states A and B is a state having all transitions from A and B labeled with characters present once in all transitions from A and B, all transitions form A and B that have the same labels and go to the same states, and for all transitions from A and B that have the same labels, but go to different states, transitions of the same labels going to states being a union of target states.

It is worth noting that while the rule R6 introduces very detailed distinctions, R7 discards details. For the guesser, the result of applying R7 is that one gets more choices than without having applied R6 or R7. As to the lexicon size, R7 removes small differences between similar word forms, making it possible to infer more general and compact relations between endings and annotations.

Please note that although no annotation possibility is lost, and the automaton is much smaller, the answers for known words are no longer 100% accurate. The correct answer appears always, but it may be accompanied by other, incorrect possibilities. In many cases exceptions are merged with regular rules. A lower limit imposed on the number of states to be removed by this rule can solve the problem.

Next: Experimental Results Up: Method Description Previous: Default Annotations

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html