intl/hyphenation/src/README.hyphen

Sat, 03 Jan 2015 20:18:00 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Sat, 03 Jan 2015 20:18:00 +0100
branch
TOR_BUG_3246
changeset 7
129ffea94266
permissions
-rw-r--r--

Conditionally enable double key logic according to:
private browsing mode or privacy.thirdparty.isolate preference and
implement in GetCookieStringCommon and FindCookie where it counts...
With some reservations of how to convince FindCookie users to test
condition and pass a nullptr when disabling double key logic.

michael@0 1 Brief explanation of the hyphenation algorithm herein.[1]
michael@0 2
michael@0 3 Raph Levien <raph@acm.org>
michael@0 4 4 Aug 1998
michael@0 5
michael@0 6 The hyphenation algorithm is basically the same as Knuth's TeX
michael@0 7 algorithm. However, the implementation is quite a bit faster.
michael@0 8
michael@0 9 The hyphenation files from TeX can almost be used directly. There
michael@0 10 is a preprocessing step, however. If you don't do the preprocessing
michael@0 11 step, you'll get bad hyphenations (i.e. a silent failure).
michael@0 12
michael@0 13 Start with a file such as hyphen.us. This is the TeX ushyph1.tex
michael@0 14 file, with the exception dictionary encoded using the same rules as
michael@0 15 the main portion of the file. Any line beginning with % is a comment.
michael@0 16 Each other line should contain exactly one rule.
michael@0 17
michael@0 18 Then, do the preprocessing - "perl substrings.pl hyphen.us". The
michael@0 19 resulting file is hyphen.mashed. It's in Perl, and it's fairly slow
michael@0 20 (it uses brute force algorithms; about 17 seconds on a P100), but it
michael@0 21 could probably be redone in C with clever algorithms. This would be
michael@0 22 valuable, for example, if it was handle user-supplied exception
michael@0 23 dictionaries by integrating them into the rule table.[2]
michael@0 24
michael@0 25 Once the rules are preprocessed, loading them is quite quick -
michael@0 26 about 200ms on a P100. It then hyphenates at about 40,000 words per
michael@0 27 second on a P100. I haven't benchmarked it against other
michael@0 28 implementations (both TeX and groff contain essentially the same
michael@0 29 algorithm), but expect that it runs quite a bit faster than any of
michael@0 30 them.
michael@0 31
michael@0 32 Knuth's algorithm
michael@0 33
michael@0 34 This section contains a brief explanation of Knuth's algorithm, in
michael@0 35 case you missed it from the TeX books. We'll use the semi-word
michael@0 36 "example" as our running example.
michael@0 37
michael@0 38 Since the beginning and end of a word are special, the algorithm is
michael@0 39 actually run over the prepared word (prep_word in the source)
michael@0 40 ".example.". Knuths algorithm basically just does pattern matches from
michael@0 41 the rule set, then applies the matches. The patterns in this case that
michael@0 42 match are "xa", "xam", "mp", and "pl". These are actually stored as
michael@0 43 "x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between
michael@0 44 the letters, they are added in. If two (or more) patterns have numbers
michael@0 45 in the same place, the highest number wins. Here's the example:
michael@0 46
michael@0 47 . e x a m p l e .
michael@0 48 x1a
michael@0 49 x a m3
michael@0 50 4m1p
michael@0 51 1p2l2
michael@0 52 -----------------
michael@0 53 . e x1a4m3p2l2e .
michael@0 54
michael@0 55 Finally, hyphens are placed wherever odd numbers appear. They are,
michael@0 56 however, suppressed after the first letter and before the last letter
michael@0 57 of the word (TeX actually suppresses them before the next-to-last, as
michael@0 58 well). So, it's "ex-am-ple", which is correct.
michael@0 59
michael@0 60 Knuth uses a trie to implement this. I.e. he stores each rule in a
michael@0 61 trie structure. For each position in the word, he searches the trie,
michael@0 62 searching for a match. Most patterns are short, so efficiency should
michael@0 63 be quite good.
michael@0 64
michael@0 65 Theory of the algorithm
michael@0 66
michael@0 67 The algorithm works as a slightly modified finite state machine.
michael@0 68 There are two kinds of transitions: those that consume one letter of
michael@0 69 input (which work just like your regular finite state machine), and
michael@0 70 "fallback" transitions, which don't consume any input. If no
michael@0 71 transition matching the next letter is found, the fallback is used.
michael@0 72 One way of looking at this is a form of compression of the transition
michael@0 73 tables - i.e. it behaves the same as a completely vanilla state
michael@0 74 machine in which the actual transition table of a node is made up of
michael@0 75 the union of transition tables of the node itself, plus its fallbacks.
michael@0 76
michael@0 77 Each state is represented by a string. Thus, if the current state
michael@0 78 is "am" and the next letter is "p", then the next state is "amp".
michael@0 79 Fallback transitions go to states which chop off one or (sometimes)
michael@0 80 more letters from the beginning. For example, if none of the
michael@0 81 transitions from "amp" match the next letter, then it will fall back
michael@0 82 to "mp". Similarly, if none of the transitions from "mp" match the
michael@0 83 next letter, it will fall back to "m".
michael@0 84
michael@0 85 Each state is also associated with a (possibly null) "match"
michael@0 86 string. This represents the union of all patterns which are
michael@0 87 right-justified substrings of the match string. I.e. the pattern "mp"
michael@0 88 is a right-justified substring of the state "amp", so it's numbers get
michael@0 89 added in. The actual calculation of this union is done by the
michael@0 90 Perl preprocessing script, but could probably be done in C just about
michael@0 91 as easily.
michael@0 92
michael@0 93 Because each state transition either consumes one input character
michael@0 94 or shortens the state string by one character, the total number of
michael@0 95 state transitions is linear in the length of the word.
michael@0 96
michael@0 97 [1] Documentations:
michael@0 98
michael@0 99 Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er.
michael@0 100 Stanford University, 1983. http://www.tug.org/docs/liang.
michael@0 101
michael@0 102 László Németh: Automatic non-standard hyphenation in OpenOffice.org,
michael@0 103 TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf
michael@0 104
michael@0 105 [2] There is the C version of pattern converter "substrings.c"
michael@0 106 in the distribution written by Nanning Buitenhuis. Unfortunatelly,
michael@0 107 this version hasn't handled the non standard extension of the
michael@0 108 algorithm, yet.

mercurial