Wed, 31 Dec 2014 06:09:35 +0100
Cloned upstream origin tor-browser at tor-browser-31.3.0esr-4.5-1-build1
revision ID fc1c9ff7c1b2defdbc039f12214767608f46423f for hacking purpose.
michael@0 | 1 | Brief explanation of the hyphenation algorithm herein.[1] |
michael@0 | 2 | |
michael@0 | 3 | Raph Levien <raph@acm.org> |
michael@0 | 4 | 4 Aug 1998 |
michael@0 | 5 | |
michael@0 | 6 | The hyphenation algorithm is basically the same as Knuth's TeX |
michael@0 | 7 | algorithm. However, the implementation is quite a bit faster. |
michael@0 | 8 | |
michael@0 | 9 | The hyphenation files from TeX can almost be used directly. There |
michael@0 | 10 | is a preprocessing step, however. If you don't do the preprocessing |
michael@0 | 11 | step, you'll get bad hyphenations (i.e. a silent failure). |
michael@0 | 12 | |
michael@0 | 13 | Start with a file such as hyphen.us. This is the TeX ushyph1.tex |
michael@0 | 14 | file, with the exception dictionary encoded using the same rules as |
michael@0 | 15 | the main portion of the file. Any line beginning with % is a comment. |
michael@0 | 16 | Each other line should contain exactly one rule. |
michael@0 | 17 | |
michael@0 | 18 | Then, do the preprocessing - "perl substrings.pl hyphen.us". The |
michael@0 | 19 | resulting file is hyphen.mashed. It's in Perl, and it's fairly slow |
michael@0 | 20 | (it uses brute force algorithms; about 17 seconds on a P100), but it |
michael@0 | 21 | could probably be redone in C with clever algorithms. This would be |
michael@0 | 22 | valuable, for example, if it was handle user-supplied exception |
michael@0 | 23 | dictionaries by integrating them into the rule table.[2] |
michael@0 | 24 | |
michael@0 | 25 | Once the rules are preprocessed, loading them is quite quick - |
michael@0 | 26 | about 200ms on a P100. It then hyphenates at about 40,000 words per |
michael@0 | 27 | second on a P100. I haven't benchmarked it against other |
michael@0 | 28 | implementations (both TeX and groff contain essentially the same |
michael@0 | 29 | algorithm), but expect that it runs quite a bit faster than any of |
michael@0 | 30 | them. |
michael@0 | 31 | |
michael@0 | 32 | Knuth's algorithm |
michael@0 | 33 | |
michael@0 | 34 | This section contains a brief explanation of Knuth's algorithm, in |
michael@0 | 35 | case you missed it from the TeX books. We'll use the semi-word |
michael@0 | 36 | "example" as our running example. |
michael@0 | 37 | |
michael@0 | 38 | Since the beginning and end of a word are special, the algorithm is |
michael@0 | 39 | actually run over the prepared word (prep_word in the source) |
michael@0 | 40 | ".example.". Knuths algorithm basically just does pattern matches from |
michael@0 | 41 | the rule set, then applies the matches. The patterns in this case that |
michael@0 | 42 | match are "xa", "xam", "mp", and "pl". These are actually stored as |
michael@0 | 43 | "x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between |
michael@0 | 44 | the letters, they are added in. If two (or more) patterns have numbers |
michael@0 | 45 | in the same place, the highest number wins. Here's the example: |
michael@0 | 46 | |
michael@0 | 47 | . e x a m p l e . |
michael@0 | 48 | x1a |
michael@0 | 49 | x a m3 |
michael@0 | 50 | 4m1p |
michael@0 | 51 | 1p2l2 |
michael@0 | 52 | ----------------- |
michael@0 | 53 | . e x1a4m3p2l2e . |
michael@0 | 54 | |
michael@0 | 55 | Finally, hyphens are placed wherever odd numbers appear. They are, |
michael@0 | 56 | however, suppressed after the first letter and before the last letter |
michael@0 | 57 | of the word (TeX actually suppresses them before the next-to-last, as |
michael@0 | 58 | well). So, it's "ex-am-ple", which is correct. |
michael@0 | 59 | |
michael@0 | 60 | Knuth uses a trie to implement this. I.e. he stores each rule in a |
michael@0 | 61 | trie structure. For each position in the word, he searches the trie, |
michael@0 | 62 | searching for a match. Most patterns are short, so efficiency should |
michael@0 | 63 | be quite good. |
michael@0 | 64 | |
michael@0 | 65 | Theory of the algorithm |
michael@0 | 66 | |
michael@0 | 67 | The algorithm works as a slightly modified finite state machine. |
michael@0 | 68 | There are two kinds of transitions: those that consume one letter of |
michael@0 | 69 | input (which work just like your regular finite state machine), and |
michael@0 | 70 | "fallback" transitions, which don't consume any input. If no |
michael@0 | 71 | transition matching the next letter is found, the fallback is used. |
michael@0 | 72 | One way of looking at this is a form of compression of the transition |
michael@0 | 73 | tables - i.e. it behaves the same as a completely vanilla state |
michael@0 | 74 | machine in which the actual transition table of a node is made up of |
michael@0 | 75 | the union of transition tables of the node itself, plus its fallbacks. |
michael@0 | 76 | |
michael@0 | 77 | Each state is represented by a string. Thus, if the current state |
michael@0 | 78 | is "am" and the next letter is "p", then the next state is "amp". |
michael@0 | 79 | Fallback transitions go to states which chop off one or (sometimes) |
michael@0 | 80 | more letters from the beginning. For example, if none of the |
michael@0 | 81 | transitions from "amp" match the next letter, then it will fall back |
michael@0 | 82 | to "mp". Similarly, if none of the transitions from "mp" match the |
michael@0 | 83 | next letter, it will fall back to "m". |
michael@0 | 84 | |
michael@0 | 85 | Each state is also associated with a (possibly null) "match" |
michael@0 | 86 | string. This represents the union of all patterns which are |
michael@0 | 87 | right-justified substrings of the match string. I.e. the pattern "mp" |
michael@0 | 88 | is a right-justified substring of the state "amp", so it's numbers get |
michael@0 | 89 | added in. The actual calculation of this union is done by the |
michael@0 | 90 | Perl preprocessing script, but could probably be done in C just about |
michael@0 | 91 | as easily. |
michael@0 | 92 | |
michael@0 | 93 | Because each state transition either consumes one input character |
michael@0 | 94 | or shortens the state string by one character, the total number of |
michael@0 | 95 | state transitions is linear in the length of the word. |
michael@0 | 96 | |
michael@0 | 97 | [1] Documentations: |
michael@0 | 98 | |
michael@0 | 99 | Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er. |
michael@0 | 100 | Stanford University, 1983. http://www.tug.org/docs/liang. |
michael@0 | 101 | |
michael@0 | 102 | László Németh: Automatic non-standard hyphenation in OpenOffice.org, |
michael@0 | 103 | TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf |
michael@0 | 104 | |
michael@0 | 105 | [2] There is the C version of pattern converter "substrings.c" |
michael@0 | 106 | in the distribution written by Nanning Buitenhuis. Unfortunatelly, |
michael@0 | 107 | this version hasn't handled the non standard extension of the |
michael@0 | 108 | algorithm, yet. |