Table of Contents

NAME

fsa_guess - guess lexeme and categories of a word

SYNOPSIS

fsa_guess [ options ] [ <infile ] [ >outfile ]

DESCRIPTION

fsa_guess reads lines from the input. Each line contains one word. For each word (inflected form), its probable categories are printed, based on the contents of a dictionary of (category, word ending) pairs. If the program has been compiled with GUESS_LEXEMES option, and the dictionary has been prepared accordingly, not only categories, but lexemes as well are printed. If the program has been compiled with GUESS_MMORPH option, and -m option has been given, morphological descriptions of words are printed.

OPTIONS

-d dictionary
use that dictionary. Several dictionaries may be given. At least one dictionary must be specified. Dictionaries are automata built using fsa_ubuild or fsa_build with -X option. Data for the automata must be prepared in a special way.

If the automata are to be used to predict only the categories, each line of the input to fsa_build should contain inverted word with the beginning (the end when inverted) of the word marked with the filler character, followed by an annotation separator, and followed by tags. See prep_atg.awk script available in the package.To treat such dictionaries fsa_guess should either not be compiled with GUESS_LEXEMES compile option, or it should be called with -p and -g options. The standard name extension for dictionaries prepared in this way is atg.

If fsa_guess is to guess also lexemes, it must be compiled with GUESS_LEXEMES compile option, and the input to fsa_build must contain in each line: the inflected form, annotation separator, a code, lexeme ending, annotation separator, and tags (annotations). The code specifies how many characters from the end of the inflected form must be deleted before appending there the lexeme ending to get the lexeme. It is one character. To calculate the number, take the character code and substract 65 (character code for 'A') from it. See prep_atl.awk script available in the package. The standard name extension for automata prepared in this way is atl.

To make fsa_guess take into account information included in prefixes, it must be compiled with GUESS_PREFIX. In data lines for fsa_build, the first annotation separator is replaced by two annotation separators for entries that do not contain prefixes, otherwise the prefix is deleted from the inverted inflected form leaving the filler character, and placed between the two annotation marks. The prefix is stored as is, i.e. not inverted. The standard name extension for automata prepared in this way is atp.

If fsa_morph is to predict morphological descriptions for mmorph, it must be compiled with GUESS_MMORPH. To see whether fsa_guess was compiled with that option, invoke it with -v. The format for fsa_build takes the format: inverted_+K1e1+K2K3K4e2+a1+categories, where inverted is an inverted inflected form, K1 is a character code describing how many characters should be deleted from the end of the inflected form (coded as explained above) in order to get the canonical form (possibly followed by an ending), e1 is the ending that should be appended to the inflected form to obtain the lexeme (more precisely: the canonical or base form) after the ending of the inflected form has been stripped. K2 is a character code describing how many characters should be deleted from the end of the canonical form (coded as explained above) in order to get the lexical form (possibly followed by an ending). K3 is a character code indicating the position of an archphoneme in the lexical form. If no archphoneme is present, the code is 'A'. Otherwise the code is 'B' for the last character, 'C' -- for the penultimate one, and so on (after removal of K2 chars). K4 says how many characters the archphoneme replaces. 'A' means 0, 'B' -- 1, etc. This code is present only when K3 is not 'A', i.e. when there is an archphoneme. e2 is the ending of the lexical form; it should be appended to what is left from the canonical form after removing K2 characters from the end, and replacing some characters by an archphoneme if needed to obtain the lexical form. a1 is the archphoneme (as specified in mmorph). '+' is the annotation separator. It is stored in the header of a dictionary, and can be specified as an option to fsa_build.

-g
makes fsa_guess work as if it were compiled without GUESS_LEXEMES. This option is available only if the program was compiled with GUESS_LEXEMES. The result is that the program assumes that the dictionaries do not contain information about lexemes (or more precisely, the canonical or base forms). Without this option fsa_guess (when compiled with GUESS_LEXEMES) will try to guess lexemes, and it will assume that information about lexemes is included in the dictionaries. To see the compile options used to build fsa_guess call it with -v option.
-p
makes fsa_guess work as if it were compiled without GUESS_PREFIX. This option is avalaible only if the program is compiled with GUESS_PREFIX. The result is that the program assumes that the dictionaries do not contain information about prefixes. Without this option fsa_guess (when compiled without GUESS_LEXEMES) will try to use information about prefixes, and it will assume that such information is stored in the dictionaries. To see the compile options used to build fsa_guess, call it with -v option.
-i input_file
specifies an input file - file that contains words which categories should be guessed. More than one file can be specified (i.e. the option can be used more than once). In absence of -i option, standard input is used.
-l language_file
specifies a file that holds language specific information, i.e. (for now) characters that form words, and pairs of (lowercase, uppercase) characters for case conversion. If the option is not specified, latin letters with standard case conversions will be used.
-m
specifies that the dictionary contains information that makes it possible to predict mmorph entries (morphological descriptions) of unknown inflected forms. fsa_guess will take a word, and produce an entry for the Lexicon section of mmorph input data (see mmorph(5) ). This option is only available when fsa_guess was compiled with GUESS_MMORPH compile option.
-v
print version details. This includes compile options used to build fsa_guess.

EXIT STATUS

  1. OK
  2. Invalid options, or lack of a required option.
  3. Dictionary file could not be opened.
  4. Not enough memory.
  5. Possible cycle in the automaton detected

SEE ALSO

fsa_accent(1) , fsa_build(1) , fsa_guess(5) , fsa_hash(1) , fsa_morph(1) , fsa_morph(5) , fsa_prefix(1) , fsa_spell(1) , fsa_ubuild(1) , fsa_visual(1) .

BUGS

Send bug reports to the author: Jan Daciuk, jjaannddaacc@eti.pg.gda.pl (correct the stuttering).


Table of Contents