|
1 Non-standard hyphenation |
|
2 ------------------------ |
|
3 |
|
4 Some languages use non-standard hyphenation; `discretionary' |
|
5 character changes at hyphenation points. For example, |
|
6 Catalan: paral·lel -> paral-lel, |
|
7 Dutch: omaatje -> oma-tje, |
|
8 German (before the new orthography): Schiffahrt -> Schiff-fahrt, |
|
9 Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!) |
|
10 Swedish: tillata -> till-lata. |
|
11 |
|
12 Using this extended library, you can define |
|
13 non-standard hyphenation patterns. For example: |
|
14 |
|
15 l·1l/l=l |
|
16 a1atje./a=t,1,3 |
|
17 .schif1fahrt/ff=f,5,2 |
|
18 .as3szon/sz=sz,2,3 |
|
19 n1nyal./ny=ny,1,3 |
|
20 .til1lata./ll=l,3,2 |
|
21 |
|
22 or with narrow boundaries: |
|
23 |
|
24 l·1l/l=,1,2 |
|
25 a1atje./a=,1,1 |
|
26 .schif1fahrt/ff=,5,1 |
|
27 .as3szon/sz=,2,1 |
|
28 n1nyal./ny=,1,1 |
|
29 .til1lata./ll=,3,1 |
|
30 |
|
31 Note: Libhnj uses modified patterns by preparing substrings.pl. |
|
32 Unfortunatelly, now the conversion step can generate bad non-standard |
|
33 patterns (non-standard -> standard pattern conversion), so using |
|
34 narrow boundaries may be better for recent Libhnj. For example, |
|
35 substrings.pl generates a few bad patterns for Hungarian hyphenation |
|
36 patterns resulting bad non-standard hyphenation in a few cases. Using narrow |
|
37 boundaries solves this problem. Java HyFo module can check this problem. |
|
38 |
|
39 Syntax of the non-standard hyphenation patterns |
|
40 ------------------------------------------------ |
|
41 |
|
42 pat1tern/change[,start,cut] |
|
43 |
|
44 If this pattern matches the word, and this pattern win (see README.hyphen) |
|
45 in the change region of the pattern, then pattern[start, start + cut - 1] |
|
46 substring will be replaced with the "change". |
|
47 |
|
48 For example, a German ff -> ff-f hyphenation: |
|
49 |
|
50 f1f/ff=f |
|
51 |
|
52 or with expansion |
|
53 |
|
54 f1f/ff=f,1,2 |
|
55 |
|
56 will change every "ff" with "ff=f" at hyphenation. |
|
57 |
|
58 A more real example: |
|
59 |
|
60 % simple ff -> f-f hyphenation |
|
61 f1f |
|
62 % Schiffahrt -> Schiff-fahrt hyphenation |
|
63 % |
|
64 schif3fahrt/ff=f,5,2 |
|
65 |
|
66 Specification |
|
67 |
|
68 - Pattern: matching patterns of the original Liang's algorithm |
|
69 - patterns must contain only one hyphenation point at change region |
|
70 signed with an one-digit odd number (1, 3, 5, 7 or 9). |
|
71 These point may be at subregion boundaries: schif3fahrt/ff=,5,1 |
|
72 - only the greater value guarantees the win (don't mix non-standard and |
|
73 non-standard patterns with the same value, for example |
|
74 instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2) |
|
75 |
|
76 - Change: new characters. |
|
77 Arbitrary character sequence. Equal sign (=) signs hyphenation points |
|
78 for OpenOffice.org (like in the example). (In a possible German LaTeX |
|
79 preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz |
|
80 with `ssz, according to the German and Hungarian Babel settings.) |
|
81 |
|
82 - Start: starting position of the change region. |
|
83 - begins with 1 (not 0): schif3fahrt/ff=f,5,2 |
|
84 - start dot doesn't matter: .schif3fahrt/ff=f,5,2 |
|
85 - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2 |
|
86 - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3 |
|
87 ("össze" looks "össze" in an ISO 8859-1 8-bit editor). |
|
88 |
|
89 - Cut: length of the removed character sequence in the original word. |
|
90 - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3 |
|
91 ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor). |
|
92 |
|
93 Dictionary developing |
|
94 --------------------- |
|
95 |
|
96 There hasn't been extended PatGen pattern generator for non-standard |
|
97 hyphenation patterns, yet. |
|
98 |
|
99 Fortunatelly, non-standard hyphenation points are forbidden in the PatGen |
|
100 generated hyphenation patterns, so with a little patch can be develop |
|
101 non-standard hyphenation patterns also in this case. |
|
102 |
|
103 Warning: If you use UTF-8 Unicode encoding in your patterns, call |
|
104 substrings.pl with UTF-8 parameter to calculate right |
|
105 character positions for non-standard hyphenation: |
|
106 |
|
107 ./substrings.pl input output UTF-8 |
|
108 |
|
109 Programming |
|
110 ----------- |
|
111 |
|
112 Use hyphenate2() or hyphenate3() to handle non-standard hyphenation. |
|
113 See hyphen.h for the documentation of the hyphenate*() functions. |
|
114 See example.c for processing the output of the hyphenate*() functions. |
|
115 |
|
116 Warning: change characters are lower cased in the source, so you may need |
|
117 case conversion of the change characters based on input word case detection. |
|
118 For example, see OpenOffice.org source |
|
119 (lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx). |
|
120 |
|
121 László Németh |
|
122 <nemeth (at) openoffice.org> |