Po polsku

Morphological dictionary acquisition tool

Introduction

The purpose of this tool is to assist in gathering data for a morphological dictionary. It is assumed that a morphological dictionary for a given language already exists, and that it was created using mmorph - a morphology tool developed at ISSCO, Geneva. It is also assumed that you use fsa utilities available from http://www.pg.gda.pl/~jandac/fsa.html.

In order to use this tool you need mmorph and fsa_guess, as well as a guessing automaton for fsa_guess. You can produce that automaton with fsa_build or fsa_ubuild, and prepare data for them with scripts available from the same fsa utilities package.

The basic procedure is as follows:

  1. Produce a guessing automaton from the dictionary. Consult the README file from the fsa utilities package, and manual pages for fsa_build(1), fsa_guess(1), and fsa_guess(5). You will also find appropriate scripts in the same package.
  2. Produce a list of words not present in the dictionary. You can use fsa_spell for that purpose - consult the manual page for fsa_spell(1).
  3. Run fsa_guess with the guessing automaton on the list of unknown words, and save the resulting file (the guesses).
  4. You can use chkmorph.pl script to eliminate those guesses that do not produce the inflected form they are supposed to produce.
  5. Load the guesses file to the Morphological Dictionary Acquisition Tool.
  6. Use the Morphological Dictionary Acquisition Tool to produce descriptions in mmorph format.
  7. Save descriptions in a file.
  8. Merge the new descriptions with existing ones.

Loading the guesses file.

The guesses file can either be loaded using menus, or specified in the command line, or using "Load new" button under the "Word form" window. To load the file from menus, choose File/Open guesses, and then choose appropriate file. To specify the file in the command line, use -G guessesfile. It has to be capital G, as tcl/tk steel -g. To load guesses using a button, just press it.

Producing descriptions in mmorph format.

For each word form you are interested in, do:

  1. Click the form. One or more descriptions should appear in the Descriptions' pane.
  2. Choose a description from the Descriptions' pane. If you are not sure which of them is correct, click on the description and press Mmorph button. All forms derived from that description should appear in the Mmorph output pane. If you cannot see the difference between two descriptions, choose both (use Control button) and press Mmorph button. The Mmorph pane should show the difference between forms produced by those descriptions. If you want to see if all required forms are generated, click on "yes" in the "Expand alternatives" field. Then if the description contains e.g. "case=nom|acc", i.e. more than one possible value, the description will be broken into two. The first one will contain "gen=mon", the second one -- "gen=acc".
  3. If none of the descriptions in Descriptions pane is correct, you can correct it by clicking on Correct button. A new popup window will appear, in which you could make corrections. Another possibility is to correct entries that appear in the mmorph output window, and then to press mAtch mmorph. The tool will try to find the matching description. In order to do that, it needs additional information, usually found in the file "paradigm", or in another file that is specified with cusTomize/paradigm file. The first character of that file is a character that begins a comment (you can change it if you like). All lines in the file beginning with that character are ignored. Other lines are formed in three columns. The columns are separated with spaces or horizontal tabulation characters. The first column contains a part of speech (POS). The other two columns are relevant only for descriptions containing that POS. There can be more than one line with the same POS. The second column contains a regular expression. If the expression matches the description, then the third column contains a list of features, such that if we change their values, we might arrive at the correct description. A feature name can be followed with an asterisk. In that case all possible combinations of the values of that feature will be generated. If the correct description is found, the background color of the corrected entry will be changed to green. Notice that it may take some time, during which the mAtch mmorph button will still be pressed. You can also use guided correction. Press the right mouse button on the description you want to change.
  4. Press the Save button. The description is added to a list of descriptions that will be saved at the end of the session (i.e. when you quit the tool). Depending on the "Save removes" radio buttons, saving the description removes all word forms generated by it from the word form pane, the current form (the one that was used for guessing), or none.

Saving descriptions in a file.

The descriptions are saved automatically when you quit by pressing on the Exit button, or choosing Exit from the File menu. You can specify the output file with -o command line option.

Customization.

Command line options can be used to change the behavior of the tool. Remember to separate the option from its value with a space.

You can change the language of menus, buttons, and labels by either specifying the language in a command line using -l, or by choosing Customize/Language menu entry. A description for that language must exist in the language description file. The file itself may be specified using -c command line option.

You can specify the font used for displaying word forms, descriptions, and mmorph output either by choosing Customize/Font from the menus, or with -f command line option.

You can specify the name of the file produced when pressing Mmorph button, and being the input for mmorph so that it can expand it. You can do that using -m command line option.


Jan Daciuk, e-mail: jandac.eti.pg.gda.pl (replace the first dot with "@")