Po polsku
Morphological dictionary acquisition tool
Introduction
The purpose of this tool is to assist in gathering data for a
morphological dictionary. It is assumed that a morphological
dictionary for a given language already exists, and that it was
created using mmorph - a morphology tool developed at ISSCO, Geneva.
It is also assumed that you use fsa utilities available from
http://www.pg.gda.pl/~jandac/fsa.html.
In order to use this tool you need mmorph and fsa_guess, as well as a
guessing automaton for fsa_guess. You can produce that automaton with
fsa_build
or fsa_ubuild,
and prepare data for them with scripts
available from the same fsa utilities package.
The basic procedure is as follows:
- Produce a guessing automaton from the dictionary. Consult the
README file from the fsa utilities package, and manual pages for
fsa_build(1), fsa_guess(1), and fsa_guess(5). You will also find appropriate scripts in the same package.
- Produce a list of words not present in the dictionary. You can use
fsa_spell
for that purpose - consult the manual page for
fsa_spell(1).
- Run fsa_guess with the guessing automaton on the list of unknown
words, and save the resulting file (the guesses).
- You can use chkmorph.pl script to eliminate those guesses that
do not produce the inflected form they are supposed to produce.
- Load the guesses file to the Morphological Dictionary Acquisition Tool.
- Use the Morphological Dictionary Acquisition Tool to produce
descriptions in mmorph format.
- Save descriptions in a file.
- Merge the new descriptions with existing ones.
Loading the guesses file.
The guesses file can either be loaded using menus, or specified in the
command line, or using "Load new" button under the "Word form" window.
To load the file from menus, choose File/Open guesses, and then choose
appropriate file. To specify the file in the command line, use -G
guessesfile. It has to be capital G, as tcl/tk steel -g. To load
guesses using a button, just press it.
Producing descriptions in mmorph format.
For each word form you are interested in, do:
- Click the form. One or more descriptions should appear in
the Descriptions' pane.
- Choose a description from the Descriptions' pane. If you are not
sure which of them is correct, click on the description and press
Mmorph button. All forms derived from that description should
appear in the Mmorph output pane. If you cannot see the difference
between two descriptions, choose both (use Control button) and
press Mmorph button. The Mmorph pane should show the difference
between forms produced by those descriptions. If you want to see if
all required forms are generated, click on "yes" in the "Expand
alternatives" field. Then if the description contains
e.g. "case=nom|acc", i.e. more than one possible value, the
description will be broken into two. The first one will contain
"gen=mon", the second one -- "gen=acc".
- If none of the descriptions in Descriptions pane is correct, you
can correct it by clicking on Correct button. A new popup window
will appear, in which you could make corrections. Another
possibility is to correct entries that appear in the mmorph output
window, and then to press mAtch mmorph. The tool will try to find
the matching description. In order to do that, it needs additional
information, usually found in the file "paradigm", or in another
file that is specified with cusTomize/paradigm file. The first
character of that file is a character that begins a comment (you
can change it if you like). All lines in the file beginning with
that character are ignored. Other lines are formed in three
columns. The columns are separated with spaces or horizontal
tabulation characters. The first column contains a part of speech
(POS). The other two columns are relevant only for descriptions
containing that POS. There can be more than one line with the same
POS. The second column contains a regular expression. If the
expression matches the description, then the third column contains
a list of features, such that if we change their values, we might
arrive at the correct description. A feature name can be followed
with an asterisk. In that case all possible combinations of the
values of that feature will be generated. If the correct
description is found, the background color of the corrected entry
will be changed to green. Notice that it may take some time, during
which the mAtch mmorph button will still be pressed.
You can also use guided correction. Press the right mouse button on
the description you want to change.
- Press the Save button. The description is added to a list of
descriptions that will be saved at the end of the session
(i.e. when you quit the tool). Depending on the "Save removes"
radio buttons, saving the description removes all word forms
generated by it from the word form pane, the current form (the one
that was used for guessing), or none.
Saving descriptions in a file.
The descriptions are saved automatically when you quit by pressing on
the Exit button, or choosing Exit from the File menu. You can specify
the output file with -o command line option.
Customization.
Command line options can be used to change the behavior of the
tool. Remember to separate the option from its value with a space.
You can change the language of menus, buttons, and labels by either
specifying the language in a command line using -l, or by choosing
Customize/Language menu entry. A description for that language must
exist in the language description file. The file itself may be
specified using -c command line option.
You can specify the font used for displaying word forms, descriptions,
and mmorph output either by choosing Customize/Font from the menus, or
with -f command line option.
You can specify the name of the file produced when pressing Mmorph
button, and being the input for mmorph so that it can expand it. You
can do that using -m command line option.
Jan Daciuk,
e-mail: jandac.eti.pg.gda.pl (replace the first dot with "@")