next up previous contents index
Next: Morphology Up: Usage Previous: Spelling Correction

Restoration of Diacritics

 

Restoration of diacritics  can be seen as a special case of spelling correction. In particular, the same division on word-based methods and methods using the context is present. The methods that use the context are similar to those used in spelling correction. For a description of those methods see [Yar94a] and [Yar94b].

The word-based methods for the restoration of diacritics are also the basis for the context-based methods by providing them with choices. The techniques used in them are partially different from those used in spelling correction. In particular, if the word from a text is present in the lexicon, it does not mean that it is correct; all words that when deprived of diacritics give the word should be given.

We define a function strip that converts the letter with diacritics in the word being the argument of the function to the letters that have the same shape, but do not have diacritics. That function defines a relation:


displaymath1541

The purpose of the word-based method for the restoration of diacritics is to find all words that are in tex2html_wrap_inline5448 with the word in question. This can be achieved by the use of statistical techniques  measuring the frequencies of sequences of letters in words (see [Dac97]), but the standard technique is to use a lexicon. In particular, the lexicon in form of a finite-state automaton offers the usual advantages: compact representation and great speed of processing. Figure 6.3 gives the algorithm for finding all words in the lexicon that are in tex2html_wrap_inline5448 with the given word.

  figure1546
Figure 6.3: Restoration of diacritics with finite-state automata

The dot operator in fig. 6.3 represents concatenation. Note that it is also possible to use the same algorithm with the morphological dictionaries , implemented both with transducers and automata-acceptors. The corrections necessary for their adaptation are trivial. 


next up previous contents index
Next: Morphology Up: Usage Previous: Spelling Correction

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html