intl/hyphenation/src/README.nonstandard

changeset 0
6474c204b198
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/intl/hyphenation/src/README.nonstandard	Wed Dec 31 06:09:35 2014 +0100
     1.3 @@ -0,0 +1,122 @@
     1.4 +Non-standard hyphenation
     1.5 +------------------------
     1.6 +
     1.7 +Some languages use non-standard hyphenation; `discretionary'
     1.8 +character changes at hyphenation points. For example,
     1.9 +Catalan: paral·lel -> paral-lel,
    1.10 +Dutch: omaatje -> oma-tje,
    1.11 +German (before the new orthography): Schiffahrt -> Schiff-fahrt,
    1.12 +Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!)
    1.13 +Swedish: tillata -> till-lata.
    1.14 +
    1.15 +Using this extended library, you can define 
    1.16 +non-standard hyphenation patterns. For example:
    1.17 +
    1.18 +l·1l/l=l
    1.19 +a1atje./a=t,1,3
    1.20 +.schif1fahrt/ff=f,5,2
    1.21 +.as3szon/sz=sz,2,3
    1.22 +n1nyal./ny=ny,1,3
    1.23 +.til1lata./ll=l,3,2
    1.24 +
    1.25 +or with narrow boundaries:
    1.26 +
    1.27 +l·1l/l=,1,2
    1.28 +a1atje./a=,1,1
    1.29 +.schif1fahrt/ff=,5,1
    1.30 +.as3szon/sz=,2,1
    1.31 +n1nyal./ny=,1,1
    1.32 +.til1lata./ll=,3,1
    1.33 +
    1.34 +Note: Libhnj uses modified patterns by preparing substrings.pl.
    1.35 +Unfortunatelly, now the conversion step can generate bad non-standard
    1.36 +patterns (non-standard -> standard pattern conversion), so using
    1.37 +narrow boundaries may be better for recent Libhnj. For example,
    1.38 +substrings.pl generates a few bad patterns for Hungarian hyphenation
    1.39 +patterns resulting bad non-standard hyphenation in a few cases. Using narrow
    1.40 +boundaries solves this problem. Java HyFo module can check this problem.
    1.41 +
    1.42 +Syntax of the non-standard hyphenation patterns
    1.43 +------------------------------------------------
    1.44 +
    1.45 +pat1tern/change[,start,cut]
    1.46 +
    1.47 +If this pattern matches the word, and this pattern win (see README.hyphen)
    1.48 +in the change region of the pattern, then pattern[start, start + cut - 1]
    1.49 +substring will be replaced with the "change".
    1.50 +
    1.51 +For example, a German ff -> ff-f hyphenation:
    1.52 +
    1.53 +f1f/ff=f 
    1.54 +
    1.55 +or with expansion
    1.56 +
    1.57 +f1f/ff=f,1,2
    1.58 +
    1.59 +will change every "ff" with "ff=f" at hyphenation.
    1.60 +
    1.61 +A more real example:
    1.62 +
    1.63 +% simple ff -> f-f hyphenation
    1.64 +f1f
    1.65 +% Schiffahrt -> Schiff-fahrt hyphenation
    1.66 +% 
    1.67 +schif3fahrt/ff=f,5,2
    1.68 +
    1.69 +Specification
    1.70 +
    1.71 +- Pattern: matching patterns of the original Liang's algorithm
    1.72 +  - patterns must contain only one hyphenation point at change region
    1.73 +    signed with an one-digit odd number (1, 3, 5, 7 or 9).
    1.74 +    These point may be at subregion boundaries: schif3fahrt/ff=,5,1
    1.75 +  - only the greater value guarantees the win (don't mix non-standard and
    1.76 +    non-standard patterns with the same value, for example
    1.77 +    instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2)
    1.78 +
    1.79 +- Change: new characters.
    1.80 +  Arbitrary character sequence. Equal sign (=) signs hyphenation points
    1.81 +  for OpenOffice.org (like in the example). (In a possible German LaTeX
    1.82 +  preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz
    1.83 +  with `ssz, according to the German and Hungarian Babel settings.)
    1.84 +
    1.85 +- Start: starting position of the change region.
    1.86 +  - begins with 1 (not 0): schif3fahrt/ff=f,5,2
    1.87 +  - start dot doesn't matter: .schif3fahrt/ff=f,5,2
    1.88 +  - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2
    1.89 +  - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3
    1.90 +    ("össze" looks "össze" in an ISO 8859-1 8-bit editor). 
    1.91 +
    1.92 +- Cut: length of the removed character sequence in the original word.
    1.93 +  - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3
    1.94 +    ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor).
    1.95 +
    1.96 +Dictionary developing
    1.97 +---------------------
    1.98 +
    1.99 +There hasn't been extended PatGen pattern generator for non-standard
   1.100 +hyphenation patterns, yet.
   1.101 +
   1.102 +Fortunatelly, non-standard hyphenation points are forbidden in the PatGen
   1.103 +generated hyphenation patterns, so with a little patch can be develop
   1.104 +non-standard hyphenation patterns also in this case.
   1.105 +
   1.106 +Warning: If you use UTF-8 Unicode encoding in your patterns, call
   1.107 +substrings.pl with UTF-8 parameter to calculate right
   1.108 +character positions for non-standard hyphenation:
   1.109 +
   1.110 +./substrings.pl input output UTF-8
   1.111 +
   1.112 +Programming
   1.113 +-----------
   1.114 +
   1.115 +Use hyphenate2() or hyphenate3() to handle non-standard hyphenation.
   1.116 +See hyphen.h for the documentation of the hyphenate*() functions.
   1.117 +See example.c for processing the output of the hyphenate*() functions.
   1.118 +
   1.119 +Warning: change characters are lower cased in the source, so you may need
   1.120 +case conversion of the change characters based on input word case detection.
   1.121 +For example, see OpenOffice.org source
   1.122 +(lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx).
   1.123 +
   1.124 +László Németh
   1.125 +<nemeth (at) openoffice.org>

mercurial