intl/hyphenation/src/README.nonstandard

Wed, 31 Dec 2014 07:22:50 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Wed, 31 Dec 2014 07:22:50 +0100
branch
TOR_BUG_3246
changeset 4
fc2d59ddac77
permissions
-rw-r--r--

Correct previous dual key logic pending first delivery installment.

michael@0 1 Non-standard hyphenation
michael@0 2 ------------------------
michael@0 3
michael@0 4 Some languages use non-standard hyphenation; `discretionary'
michael@0 5 character changes at hyphenation points. For example,
michael@0 6 Catalan: paral·lel -> paral-lel,
michael@0 7 Dutch: omaatje -> oma-tje,
michael@0 8 German (before the new orthography): Schiffahrt -> Schiff-fahrt,
michael@0 9 Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!)
michael@0 10 Swedish: tillata -> till-lata.
michael@0 11
michael@0 12 Using this extended library, you can define
michael@0 13 non-standard hyphenation patterns. For example:
michael@0 14
michael@0 15 l·1l/l=l
michael@0 16 a1atje./a=t,1,3
michael@0 17 .schif1fahrt/ff=f,5,2
michael@0 18 .as3szon/sz=sz,2,3
michael@0 19 n1nyal./ny=ny,1,3
michael@0 20 .til1lata./ll=l,3,2
michael@0 21
michael@0 22 or with narrow boundaries:
michael@0 23
michael@0 24 l·1l/l=,1,2
michael@0 25 a1atje./a=,1,1
michael@0 26 .schif1fahrt/ff=,5,1
michael@0 27 .as3szon/sz=,2,1
michael@0 28 n1nyal./ny=,1,1
michael@0 29 .til1lata./ll=,3,1
michael@0 30
michael@0 31 Note: Libhnj uses modified patterns by preparing substrings.pl.
michael@0 32 Unfortunatelly, now the conversion step can generate bad non-standard
michael@0 33 patterns (non-standard -> standard pattern conversion), so using
michael@0 34 narrow boundaries may be better for recent Libhnj. For example,
michael@0 35 substrings.pl generates a few bad patterns for Hungarian hyphenation
michael@0 36 patterns resulting bad non-standard hyphenation in a few cases. Using narrow
michael@0 37 boundaries solves this problem. Java HyFo module can check this problem.
michael@0 38
michael@0 39 Syntax of the non-standard hyphenation patterns
michael@0 40 ------------------------------------------------
michael@0 41
michael@0 42 pat1tern/change[,start,cut]
michael@0 43
michael@0 44 If this pattern matches the word, and this pattern win (see README.hyphen)
michael@0 45 in the change region of the pattern, then pattern[start, start + cut - 1]
michael@0 46 substring will be replaced with the "change".
michael@0 47
michael@0 48 For example, a German ff -> ff-f hyphenation:
michael@0 49
michael@0 50 f1f/ff=f
michael@0 51
michael@0 52 or with expansion
michael@0 53
michael@0 54 f1f/ff=f,1,2
michael@0 55
michael@0 56 will change every "ff" with "ff=f" at hyphenation.
michael@0 57
michael@0 58 A more real example:
michael@0 59
michael@0 60 % simple ff -> f-f hyphenation
michael@0 61 f1f
michael@0 62 % Schiffahrt -> Schiff-fahrt hyphenation
michael@0 63 %
michael@0 64 schif3fahrt/ff=f,5,2
michael@0 65
michael@0 66 Specification
michael@0 67
michael@0 68 - Pattern: matching patterns of the original Liang's algorithm
michael@0 69 - patterns must contain only one hyphenation point at change region
michael@0 70 signed with an one-digit odd number (1, 3, 5, 7 or 9).
michael@0 71 These point may be at subregion boundaries: schif3fahrt/ff=,5,1
michael@0 72 - only the greater value guarantees the win (don't mix non-standard and
michael@0 73 non-standard patterns with the same value, for example
michael@0 74 instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2)
michael@0 75
michael@0 76 - Change: new characters.
michael@0 77 Arbitrary character sequence. Equal sign (=) signs hyphenation points
michael@0 78 for OpenOffice.org (like in the example). (In a possible German LaTeX
michael@0 79 preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz
michael@0 80 with `ssz, according to the German and Hungarian Babel settings.)
michael@0 81
michael@0 82 - Start: starting position of the change region.
michael@0 83 - begins with 1 (not 0): schif3fahrt/ff=f,5,2
michael@0 84 - start dot doesn't matter: .schif3fahrt/ff=f,5,2
michael@0 85 - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2
michael@0 86 - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3
michael@0 87 ("össze" looks "össze" in an ISO 8859-1 8-bit editor).
michael@0 88
michael@0 89 - Cut: length of the removed character sequence in the original word.
michael@0 90 - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3
michael@0 91 ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor).
michael@0 92
michael@0 93 Dictionary developing
michael@0 94 ---------------------
michael@0 95
michael@0 96 There hasn't been extended PatGen pattern generator for non-standard
michael@0 97 hyphenation patterns, yet.
michael@0 98
michael@0 99 Fortunatelly, non-standard hyphenation points are forbidden in the PatGen
michael@0 100 generated hyphenation patterns, so with a little patch can be develop
michael@0 101 non-standard hyphenation patterns also in this case.
michael@0 102
michael@0 103 Warning: If you use UTF-8 Unicode encoding in your patterns, call
michael@0 104 substrings.pl with UTF-8 parameter to calculate right
michael@0 105 character positions for non-standard hyphenation:
michael@0 106
michael@0 107 ./substrings.pl input output UTF-8
michael@0 108
michael@0 109 Programming
michael@0 110 -----------
michael@0 111
michael@0 112 Use hyphenate2() or hyphenate3() to handle non-standard hyphenation.
michael@0 113 See hyphen.h for the documentation of the hyphenate*() functions.
michael@0 114 See example.c for processing the output of the hyphenate*() functions.
michael@0 115
michael@0 116 Warning: change characters are lower cased in the source, so you may need
michael@0 117 case conversion of the change characters based on input word case detection.
michael@0 118 For example, see OpenOffice.org source
michael@0 119 (lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx).
michael@0 120
michael@0 121 László Németh
michael@0 122 <nemeth (at) openoffice.org>

mercurial