1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/intl/hyphenation/src/README.nonstandard Wed Dec 31 06:09:35 2014 +0100 1.3 @@ -0,0 +1,122 @@ 1.4 +Non-standard hyphenation 1.5 +------------------------ 1.6 + 1.7 +Some languages use non-standard hyphenation; `discretionary' 1.8 +character changes at hyphenation points. For example, 1.9 +Catalan: paral·lel -> paral-lel, 1.10 +Dutch: omaatje -> oma-tje, 1.11 +German (before the new orthography): Schiffahrt -> Schiff-fahrt, 1.12 +Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!) 1.13 +Swedish: tillata -> till-lata. 1.14 + 1.15 +Using this extended library, you can define 1.16 +non-standard hyphenation patterns. For example: 1.17 + 1.18 +l·1l/l=l 1.19 +a1atje./a=t,1,3 1.20 +.schif1fahrt/ff=f,5,2 1.21 +.as3szon/sz=sz,2,3 1.22 +n1nyal./ny=ny,1,3 1.23 +.til1lata./ll=l,3,2 1.24 + 1.25 +or with narrow boundaries: 1.26 + 1.27 +l·1l/l=,1,2 1.28 +a1atje./a=,1,1 1.29 +.schif1fahrt/ff=,5,1 1.30 +.as3szon/sz=,2,1 1.31 +n1nyal./ny=,1,1 1.32 +.til1lata./ll=,3,1 1.33 + 1.34 +Note: Libhnj uses modified patterns by preparing substrings.pl. 1.35 +Unfortunatelly, now the conversion step can generate bad non-standard 1.36 +patterns (non-standard -> standard pattern conversion), so using 1.37 +narrow boundaries may be better for recent Libhnj. For example, 1.38 +substrings.pl generates a few bad patterns for Hungarian hyphenation 1.39 +patterns resulting bad non-standard hyphenation in a few cases. Using narrow 1.40 +boundaries solves this problem. Java HyFo module can check this problem. 1.41 + 1.42 +Syntax of the non-standard hyphenation patterns 1.43 +------------------------------------------------ 1.44 + 1.45 +pat1tern/change[,start,cut] 1.46 + 1.47 +If this pattern matches the word, and this pattern win (see README.hyphen) 1.48 +in the change region of the pattern, then pattern[start, start + cut - 1] 1.49 +substring will be replaced with the "change". 1.50 + 1.51 +For example, a German ff -> ff-f hyphenation: 1.52 + 1.53 +f1f/ff=f 1.54 + 1.55 +or with expansion 1.56 + 1.57 +f1f/ff=f,1,2 1.58 + 1.59 +will change every "ff" with "ff=f" at hyphenation. 1.60 + 1.61 +A more real example: 1.62 + 1.63 +% simple ff -> f-f hyphenation 1.64 +f1f 1.65 +% Schiffahrt -> Schiff-fahrt hyphenation 1.66 +% 1.67 +schif3fahrt/ff=f,5,2 1.68 + 1.69 +Specification 1.70 + 1.71 +- Pattern: matching patterns of the original Liang's algorithm 1.72 + - patterns must contain only one hyphenation point at change region 1.73 + signed with an one-digit odd number (1, 3, 5, 7 or 9). 1.74 + These point may be at subregion boundaries: schif3fahrt/ff=,5,1 1.75 + - only the greater value guarantees the win (don't mix non-standard and 1.76 + non-standard patterns with the same value, for example 1.77 + instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2) 1.78 + 1.79 +- Change: new characters. 1.80 + Arbitrary character sequence. Equal sign (=) signs hyphenation points 1.81 + for OpenOffice.org (like in the example). (In a possible German LaTeX 1.82 + preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz 1.83 + with `ssz, according to the German and Hungarian Babel settings.) 1.84 + 1.85 +- Start: starting position of the change region. 1.86 + - begins with 1 (not 0): schif3fahrt/ff=f,5,2 1.87 + - start dot doesn't matter: .schif3fahrt/ff=f,5,2 1.88 + - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2 1.89 + - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3 1.90 + ("össze" looks "össze" in an ISO 8859-1 8-bit editor). 1.91 + 1.92 +- Cut: length of the removed character sequence in the original word. 1.93 + - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3 1.94 + ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor). 1.95 + 1.96 +Dictionary developing 1.97 +--------------------- 1.98 + 1.99 +There hasn't been extended PatGen pattern generator for non-standard 1.100 +hyphenation patterns, yet. 1.101 + 1.102 +Fortunatelly, non-standard hyphenation points are forbidden in the PatGen 1.103 +generated hyphenation patterns, so with a little patch can be develop 1.104 +non-standard hyphenation patterns also in this case. 1.105 + 1.106 +Warning: If you use UTF-8 Unicode encoding in your patterns, call 1.107 +substrings.pl with UTF-8 parameter to calculate right 1.108 +character positions for non-standard hyphenation: 1.109 + 1.110 +./substrings.pl input output UTF-8 1.111 + 1.112 +Programming 1.113 +----------- 1.114 + 1.115 +Use hyphenate2() or hyphenate3() to handle non-standard hyphenation. 1.116 +See hyphen.h for the documentation of the hyphenate*() functions. 1.117 +See example.c for processing the output of the hyphenate*() functions. 1.118 + 1.119 +Warning: change characters are lower cased in the source, so you may need 1.120 +case conversion of the change characters based on input word case detection. 1.121 +For example, see OpenOffice.org source 1.122 +(lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx). 1.123 + 1.124 +László Németh 1.125 +<nemeth (at) openoffice.org>