intl/hyphenation/src/README.hyphen

changeset 0
6474c204b198
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/intl/hyphenation/src/README.hyphen	Wed Dec 31 06:09:35 2014 +0100
     1.3 @@ -0,0 +1,108 @@
     1.4 +Brief explanation of the hyphenation algorithm herein.[1]
     1.5 +
     1.6 +Raph Levien <raph@acm.org>
     1.7 +4 Aug 1998
     1.8 +
     1.9 +   The hyphenation algorithm is basically the same as Knuth's TeX
    1.10 +algorithm. However, the implementation is quite a bit faster.
    1.11 +
    1.12 +   The hyphenation files from TeX can almost be used directly. There
    1.13 +is a preprocessing step, however. If you don't do the preprocessing
    1.14 +step, you'll get bad hyphenations (i.e. a silent failure).
    1.15 +
    1.16 +   Start with a file such as hyphen.us. This is the TeX ushyph1.tex
    1.17 +file, with the exception dictionary encoded using the same rules as
    1.18 +the main portion of the file. Any line beginning with % is a comment.
    1.19 +Each other line should contain exactly one rule.
    1.20 +
    1.21 +   Then, do the preprocessing - "perl substrings.pl hyphen.us". The
    1.22 +resulting file is hyphen.mashed. It's in Perl, and it's fairly slow
    1.23 +(it uses brute force algorithms; about 17 seconds on a P100), but it
    1.24 +could probably be redone in C with clever algorithms. This would be
    1.25 +valuable, for example, if it was handle user-supplied exception
    1.26 +dictionaries by integrating them into the rule table.[2]
    1.27 +
    1.28 +   Once the rules are preprocessed, loading them is quite quick -
    1.29 +about 200ms on a P100. It then hyphenates at about 40,000 words per
    1.30 +second on a P100. I haven't benchmarked it against other
    1.31 +implementations (both TeX and groff contain essentially the same
    1.32 +algorithm), but expect that it runs quite a bit faster than any of
    1.33 +them.
    1.34 +
    1.35 +Knuth's algorithm
    1.36 +
    1.37 +   This section contains a brief explanation of Knuth's algorithm, in
    1.38 +case you missed it from the TeX books. We'll use the semi-word
    1.39 +"example" as our running example.
    1.40 +
    1.41 +   Since the beginning and end of a word are special, the algorithm is
    1.42 +actually run over the prepared word (prep_word in the source)
    1.43 +".example.". Knuths algorithm basically just does pattern matches from
    1.44 +the rule set, then applies the matches. The patterns in this case that
    1.45 +match are "xa", "xam", "mp", and "pl". These are actually stored as
    1.46 +"x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between
    1.47 +the letters, they are added in. If two (or more) patterns have numbers
    1.48 +in the same place, the highest number wins. Here's the example:
    1.49 +
    1.50 + . e x a m p l e .
    1.51 +     x1a
    1.52 +     x a m3
    1.53 +        4m1p
    1.54 +          1p2l2
    1.55 + -----------------
    1.56 + . e x1a4m3p2l2e .
    1.57 +
    1.58 +   Finally, hyphens are placed wherever odd numbers appear. They are,
    1.59 +however, suppressed after the first letter and before the last letter
    1.60 +of the word (TeX actually suppresses them before the next-to-last, as
    1.61 +well). So, it's "ex-am-ple", which is correct.
    1.62 +
    1.63 +   Knuth uses a trie to implement this. I.e. he stores each rule in a
    1.64 +trie structure. For each position in the word, he searches the trie,
    1.65 +searching for a match. Most patterns are short, so efficiency should
    1.66 +be quite good.
    1.67 +
    1.68 +Theory of the algorithm
    1.69 +
    1.70 +   The algorithm works as a slightly modified finite state machine.
    1.71 +There are two kinds of transitions: those that consume one letter of
    1.72 +input (which work just like your regular finite state machine), and
    1.73 +"fallback" transitions, which don't consume any input. If no
    1.74 +transition matching the next letter is found, the fallback is used.
    1.75 +One way of looking at this is a form of compression of the transition
    1.76 +tables - i.e. it behaves the same as a completely vanilla state
    1.77 +machine in which the actual transition table of a node is made up of
    1.78 +the union of transition tables of the node itself, plus its fallbacks.
    1.79 +
    1.80 +   Each state is represented by a string. Thus, if the current state
    1.81 +is "am" and the next letter is "p", then the next state is "amp".
    1.82 +Fallback transitions go to states which chop off one or (sometimes)
    1.83 +more letters from the beginning. For example, if none of the
    1.84 +transitions from "amp" match the next letter, then it will fall back
    1.85 +to "mp". Similarly, if none of the transitions from "mp" match the
    1.86 +next letter, it will fall back to "m".
    1.87 +
    1.88 +   Each state is also associated with a (possibly null) "match"
    1.89 +string. This represents the union of all patterns which are
    1.90 +right-justified substrings of the match string. I.e. the pattern "mp"
    1.91 +is a right-justified substring of the state "amp", so it's numbers get
    1.92 +added in. The actual calculation of this union is done by the
    1.93 +Perl preprocessing script, but could probably be done in C just about
    1.94 +as easily.
    1.95 +
    1.96 +   Because each state transition either consumes one input character
    1.97 +or shortens the state string by one character, the total number of
    1.98 +state transitions is linear in the length of the word.
    1.99 +
   1.100 +[1] Documentations:
   1.101 +
   1.102 +Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er.
   1.103 +Stanford University, 1983. http://www.tug.org/docs/liang.
   1.104 +
   1.105 +László Németh: Automatic non-standard hyphenation in OpenOffice.org,
   1.106 +TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf
   1.107 +
   1.108 +[2] There is the C version of pattern converter "substrings.c"
   1.109 +in the distribution written by Nanning Buitenhuis. Unfortunatelly,
   1.110 +this version hasn't handled the non standard extension of the
   1.111 +algorithm, yet.

mercurial