next up previous contents index
Next: EXIT STATUS Up: fsa95build Previous: DESCRIPTION

OPTIONS

-O
make the resulting automaton smaller. The time required to build the automaton is much greater. How much greater depends on compile options used during compilation of fsa_build. See Makefile and INSTALL from the distribution for an explanation of various compile options. The default options compress the automaton the most. This option cannot be used with -N option.

-i input_file
specifies input file. That file should contain a list of words, one word per line. In absence of -i option, standard input is used instead.

-o output_file
specifies output file, i.e. where the automaton should be placed. In absence of -o option, standard output is used instead.

-A annotation_separator
specifies a character that separates words from morphological annotations.

-X
prepares an index a tergo that is used to predict word categories. This option is available only if the program was compiled with A_TERGO compile option. Specifying PRUNE_ARCS compile option helps making the resulting automaton smaller and faster. These compile options are on by default. The format of data depends on compile options used to build the fsa_guess program, and affects the outcome of that program.

For fsa_guess compiled without GUESS_LEXEMES, the input data should be a list of inverted words with annotations. Each line should contain an inverted word (i.e. the first character should be the last character of the word, the second one - the penultimate one, and so on. This inverted word should be followed immediately by a filler character and an annotation separator, and then by grammatical annotations. They specify some morphological properties of words, such as number, gender, etc.

Assuming that a file file contains data in 3 columns: inflected word, canonical form, annotations, the following incantation:

awk `{s=""; for(i=1;i<=length($1);i++) s = substr($1,i,1) s;
printf ``%s_+%stex2html_wrap_inline5522n",s,$3;}' file | sort -u > file.idx

prepares data for the a tergo index. The incantation should be all in one line. For more detail see the contents of prep_atg.awk file included in the distribution. The standard name extension for automata prepared in this way is atg.

For fsa_guess compiled with GUESS_LEXEMES, but without GUESS_PREFIX, one data line should contain the same information as above, but an additional annotation separator, a code, and the ending of the corresponding lexeme must be inserted in front of the first annotation separator. The code specifies how many characters from the end of the inflected word must be rejected before appending the ending of the lexeme. The code is a letter. `A' means there are no characters to reject, `B' - there is one, `C' - 2, and so on. For more detail see prep_atl.awk file included in the distribution. The standard name extension for automata prepared in this way is atl.

For fsa_guess compiled with both GUESS_LEXEMES, and GUESS_PREFIX, data lines are similar to those specified above. For inflected forms that do not contain flectional prefixes, an additional annotation separator is added after the first one (see prep_atp.awk file included in the distribution). For inflected forms that do contain flectional prefixes, the prefix is removed from the inverted word leaving the filler character, and it is placed between two annotation separators in simple, noninverted form. The prep_atp.awk file does not contain code for recognizing prefixes; it should be modified for individual languages and recognize specific morphological categories. Only prefixes that differentiate between forms that have the same suffix should be recognized. The standard name extension for automata prepared in this way is atp.

-N
number entries. All entries are numbered according to their position (line number) in the input stream. This is so called perfect hashing  . This option works only if the program was compiled with NUMBERS compile option. This option (i.e. -N) cannot be used with -O option.

-v
print version details with compile options used.


next up previous contents index
Next: EXIT STATUS Up: fsa95build Previous: DESCRIPTION

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html