Wed, 31 Dec 2014 07:22:50 +0100
Correct previous dual key logic pending first delivery installment.
michael@0 | 1 | Non-standard hyphenation |
michael@0 | 2 | ------------------------ |
michael@0 | 3 | |
michael@0 | 4 | Some languages use non-standard hyphenation; `discretionary' |
michael@0 | 5 | character changes at hyphenation points. For example, |
michael@0 | 6 | Catalan: paral·lel -> paral-lel, |
michael@0 | 7 | Dutch: omaatje -> oma-tje, |
michael@0 | 8 | German (before the new orthography): Schiffahrt -> Schiff-fahrt, |
michael@0 | 9 | Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!) |
michael@0 | 10 | Swedish: tillata -> till-lata. |
michael@0 | 11 | |
michael@0 | 12 | Using this extended library, you can define |
michael@0 | 13 | non-standard hyphenation patterns. For example: |
michael@0 | 14 | |
michael@0 | 15 | l·1l/l=l |
michael@0 | 16 | a1atje./a=t,1,3 |
michael@0 | 17 | .schif1fahrt/ff=f,5,2 |
michael@0 | 18 | .as3szon/sz=sz,2,3 |
michael@0 | 19 | n1nyal./ny=ny,1,3 |
michael@0 | 20 | .til1lata./ll=l,3,2 |
michael@0 | 21 | |
michael@0 | 22 | or with narrow boundaries: |
michael@0 | 23 | |
michael@0 | 24 | l·1l/l=,1,2 |
michael@0 | 25 | a1atje./a=,1,1 |
michael@0 | 26 | .schif1fahrt/ff=,5,1 |
michael@0 | 27 | .as3szon/sz=,2,1 |
michael@0 | 28 | n1nyal./ny=,1,1 |
michael@0 | 29 | .til1lata./ll=,3,1 |
michael@0 | 30 | |
michael@0 | 31 | Note: Libhnj uses modified patterns by preparing substrings.pl. |
michael@0 | 32 | Unfortunatelly, now the conversion step can generate bad non-standard |
michael@0 | 33 | patterns (non-standard -> standard pattern conversion), so using |
michael@0 | 34 | narrow boundaries may be better for recent Libhnj. For example, |
michael@0 | 35 | substrings.pl generates a few bad patterns for Hungarian hyphenation |
michael@0 | 36 | patterns resulting bad non-standard hyphenation in a few cases. Using narrow |
michael@0 | 37 | boundaries solves this problem. Java HyFo module can check this problem. |
michael@0 | 38 | |
michael@0 | 39 | Syntax of the non-standard hyphenation patterns |
michael@0 | 40 | ------------------------------------------------ |
michael@0 | 41 | |
michael@0 | 42 | pat1tern/change[,start,cut] |
michael@0 | 43 | |
michael@0 | 44 | If this pattern matches the word, and this pattern win (see README.hyphen) |
michael@0 | 45 | in the change region of the pattern, then pattern[start, start + cut - 1] |
michael@0 | 46 | substring will be replaced with the "change". |
michael@0 | 47 | |
michael@0 | 48 | For example, a German ff -> ff-f hyphenation: |
michael@0 | 49 | |
michael@0 | 50 | f1f/ff=f |
michael@0 | 51 | |
michael@0 | 52 | or with expansion |
michael@0 | 53 | |
michael@0 | 54 | f1f/ff=f,1,2 |
michael@0 | 55 | |
michael@0 | 56 | will change every "ff" with "ff=f" at hyphenation. |
michael@0 | 57 | |
michael@0 | 58 | A more real example: |
michael@0 | 59 | |
michael@0 | 60 | % simple ff -> f-f hyphenation |
michael@0 | 61 | f1f |
michael@0 | 62 | % Schiffahrt -> Schiff-fahrt hyphenation |
michael@0 | 63 | % |
michael@0 | 64 | schif3fahrt/ff=f,5,2 |
michael@0 | 65 | |
michael@0 | 66 | Specification |
michael@0 | 67 | |
michael@0 | 68 | - Pattern: matching patterns of the original Liang's algorithm |
michael@0 | 69 | - patterns must contain only one hyphenation point at change region |
michael@0 | 70 | signed with an one-digit odd number (1, 3, 5, 7 or 9). |
michael@0 | 71 | These point may be at subregion boundaries: schif3fahrt/ff=,5,1 |
michael@0 | 72 | - only the greater value guarantees the win (don't mix non-standard and |
michael@0 | 73 | non-standard patterns with the same value, for example |
michael@0 | 74 | instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2) |
michael@0 | 75 | |
michael@0 | 76 | - Change: new characters. |
michael@0 | 77 | Arbitrary character sequence. Equal sign (=) signs hyphenation points |
michael@0 | 78 | for OpenOffice.org (like in the example). (In a possible German LaTeX |
michael@0 | 79 | preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz |
michael@0 | 80 | with `ssz, according to the German and Hungarian Babel settings.) |
michael@0 | 81 | |
michael@0 | 82 | - Start: starting position of the change region. |
michael@0 | 83 | - begins with 1 (not 0): schif3fahrt/ff=f,5,2 |
michael@0 | 84 | - start dot doesn't matter: .schif3fahrt/ff=f,5,2 |
michael@0 | 85 | - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2 |
michael@0 | 86 | - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3 |
michael@0 | 87 | ("össze" looks "össze" in an ISO 8859-1 8-bit editor). |
michael@0 | 88 | |
michael@0 | 89 | - Cut: length of the removed character sequence in the original word. |
michael@0 | 90 | - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3 |
michael@0 | 91 | ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor). |
michael@0 | 92 | |
michael@0 | 93 | Dictionary developing |
michael@0 | 94 | --------------------- |
michael@0 | 95 | |
michael@0 | 96 | There hasn't been extended PatGen pattern generator for non-standard |
michael@0 | 97 | hyphenation patterns, yet. |
michael@0 | 98 | |
michael@0 | 99 | Fortunatelly, non-standard hyphenation points are forbidden in the PatGen |
michael@0 | 100 | generated hyphenation patterns, so with a little patch can be develop |
michael@0 | 101 | non-standard hyphenation patterns also in this case. |
michael@0 | 102 | |
michael@0 | 103 | Warning: If you use UTF-8 Unicode encoding in your patterns, call |
michael@0 | 104 | substrings.pl with UTF-8 parameter to calculate right |
michael@0 | 105 | character positions for non-standard hyphenation: |
michael@0 | 106 | |
michael@0 | 107 | ./substrings.pl input output UTF-8 |
michael@0 | 108 | |
michael@0 | 109 | Programming |
michael@0 | 110 | ----------- |
michael@0 | 111 | |
michael@0 | 112 | Use hyphenate2() or hyphenate3() to handle non-standard hyphenation. |
michael@0 | 113 | See hyphen.h for the documentation of the hyphenate*() functions. |
michael@0 | 114 | See example.c for processing the output of the hyphenate*() functions. |
michael@0 | 115 | |
michael@0 | 116 | Warning: change characters are lower cased in the source, so you may need |
michael@0 | 117 | case conversion of the change characters based on input word case detection. |
michael@0 | 118 | For example, see OpenOffice.org source |
michael@0 | 119 | (lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx). |
michael@0 | 120 | |
michael@0 | 121 | László Németh |
michael@0 | 122 | <nemeth (at) openoffice.org> |