michael@0: Brief explanation of the hyphenation algorithm herein.[1] michael@0: michael@0: Raph Levien michael@0: 4 Aug 1998 michael@0: michael@0: The hyphenation algorithm is basically the same as Knuth's TeX michael@0: algorithm. However, the implementation is quite a bit faster. michael@0: michael@0: The hyphenation files from TeX can almost be used directly. There michael@0: is a preprocessing step, however. If you don't do the preprocessing michael@0: step, you'll get bad hyphenations (i.e. a silent failure). michael@0: michael@0: Start with a file such as hyphen.us. This is the TeX ushyph1.tex michael@0: file, with the exception dictionary encoded using the same rules as michael@0: the main portion of the file. Any line beginning with % is a comment. michael@0: Each other line should contain exactly one rule. michael@0: michael@0: Then, do the preprocessing - "perl substrings.pl hyphen.us". The michael@0: resulting file is hyphen.mashed. It's in Perl, and it's fairly slow michael@0: (it uses brute force algorithms; about 17 seconds on a P100), but it michael@0: could probably be redone in C with clever algorithms. This would be michael@0: valuable, for example, if it was handle user-supplied exception michael@0: dictionaries by integrating them into the rule table.[2] michael@0: michael@0: Once the rules are preprocessed, loading them is quite quick - michael@0: about 200ms on a P100. It then hyphenates at about 40,000 words per michael@0: second on a P100. I haven't benchmarked it against other michael@0: implementations (both TeX and groff contain essentially the same michael@0: algorithm), but expect that it runs quite a bit faster than any of michael@0: them. michael@0: michael@0: Knuth's algorithm michael@0: michael@0: This section contains a brief explanation of Knuth's algorithm, in michael@0: case you missed it from the TeX books. We'll use the semi-word michael@0: "example" as our running example. michael@0: michael@0: Since the beginning and end of a word are special, the algorithm is michael@0: actually run over the prepared word (prep_word in the source) michael@0: ".example.". Knuths algorithm basically just does pattern matches from michael@0: the rule set, then applies the matches. The patterns in this case that michael@0: match are "xa", "xam", "mp", and "pl". These are actually stored as michael@0: "x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between michael@0: the letters, they are added in. If two (or more) patterns have numbers michael@0: in the same place, the highest number wins. Here's the example: michael@0: michael@0: . e x a m p l e . michael@0: x1a michael@0: x a m3 michael@0: 4m1p michael@0: 1p2l2 michael@0: ----------------- michael@0: . e x1a4m3p2l2e . michael@0: michael@0: Finally, hyphens are placed wherever odd numbers appear. They are, michael@0: however, suppressed after the first letter and before the last letter michael@0: of the word (TeX actually suppresses them before the next-to-last, as michael@0: well). So, it's "ex-am-ple", which is correct. michael@0: michael@0: Knuth uses a trie to implement this. I.e. he stores each rule in a michael@0: trie structure. For each position in the word, he searches the trie, michael@0: searching for a match. Most patterns are short, so efficiency should michael@0: be quite good. michael@0: michael@0: Theory of the algorithm michael@0: michael@0: The algorithm works as a slightly modified finite state machine. michael@0: There are two kinds of transitions: those that consume one letter of michael@0: input (which work just like your regular finite state machine), and michael@0: "fallback" transitions, which don't consume any input. If no michael@0: transition matching the next letter is found, the fallback is used. michael@0: One way of looking at this is a form of compression of the transition michael@0: tables - i.e. it behaves the same as a completely vanilla state michael@0: machine in which the actual transition table of a node is made up of michael@0: the union of transition tables of the node itself, plus its fallbacks. michael@0: michael@0: Each state is represented by a string. Thus, if the current state michael@0: is "am" and the next letter is "p", then the next state is "amp". michael@0: Fallback transitions go to states which chop off one or (sometimes) michael@0: more letters from the beginning. For example, if none of the michael@0: transitions from "amp" match the next letter, then it will fall back michael@0: to "mp". Similarly, if none of the transitions from "mp" match the michael@0: next letter, it will fall back to "m". michael@0: michael@0: Each state is also associated with a (possibly null) "match" michael@0: string. This represents the union of all patterns which are michael@0: right-justified substrings of the match string. I.e. the pattern "mp" michael@0: is a right-justified substring of the state "amp", so it's numbers get michael@0: added in. The actual calculation of this union is done by the michael@0: Perl preprocessing script, but could probably be done in C just about michael@0: as easily. michael@0: michael@0: Because each state transition either consumes one input character michael@0: or shortens the state string by one character, the total number of michael@0: state transitions is linear in the length of the word. michael@0: michael@0: [1] Documentations: michael@0: michael@0: Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er. michael@0: Stanford University, 1983. http://www.tug.org/docs/liang. michael@0: michael@0: László Németh: Automatic non-standard hyphenation in OpenOffice.org, michael@0: TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf michael@0: michael@0: [2] There is the C version of pattern converter "substrings.c" michael@0: in the distribution written by Nanning Buitenhuis. Unfortunatelly, michael@0: this version hasn't handled the non standard extension of the michael@0: algorithm, yet.