michael@0: Non-standard hyphenation michael@0: ------------------------ michael@0: michael@0: Some languages use non-standard hyphenation; `discretionary' michael@0: character changes at hyphenation points. For example, michael@0: Catalan: paral·lel -> paral-lel, michael@0: Dutch: omaatje -> oma-tje, michael@0: German (before the new orthography): Schiffahrt -> Schiff-fahrt, michael@0: Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!) michael@0: Swedish: tillata -> till-lata. michael@0: michael@0: Using this extended library, you can define michael@0: non-standard hyphenation patterns. For example: michael@0: michael@0: l·1l/l=l michael@0: a1atje./a=t,1,3 michael@0: .schif1fahrt/ff=f,5,2 michael@0: .as3szon/sz=sz,2,3 michael@0: n1nyal./ny=ny,1,3 michael@0: .til1lata./ll=l,3,2 michael@0: michael@0: or with narrow boundaries: michael@0: michael@0: l·1l/l=,1,2 michael@0: a1atje./a=,1,1 michael@0: .schif1fahrt/ff=,5,1 michael@0: .as3szon/sz=,2,1 michael@0: n1nyal./ny=,1,1 michael@0: .til1lata./ll=,3,1 michael@0: michael@0: Note: Libhnj uses modified patterns by preparing substrings.pl. michael@0: Unfortunatelly, now the conversion step can generate bad non-standard michael@0: patterns (non-standard -> standard pattern conversion), so using michael@0: narrow boundaries may be better for recent Libhnj. For example, michael@0: substrings.pl generates a few bad patterns for Hungarian hyphenation michael@0: patterns resulting bad non-standard hyphenation in a few cases. Using narrow michael@0: boundaries solves this problem. Java HyFo module can check this problem. michael@0: michael@0: Syntax of the non-standard hyphenation patterns michael@0: ------------------------------------------------ michael@0: michael@0: pat1tern/change[,start,cut] michael@0: michael@0: If this pattern matches the word, and this pattern win (see README.hyphen) michael@0: in the change region of the pattern, then pattern[start, start + cut - 1] michael@0: substring will be replaced with the "change". michael@0: michael@0: For example, a German ff -> ff-f hyphenation: michael@0: michael@0: f1f/ff=f michael@0: michael@0: or with expansion michael@0: michael@0: f1f/ff=f,1,2 michael@0: michael@0: will change every "ff" with "ff=f" at hyphenation. michael@0: michael@0: A more real example: michael@0: michael@0: % simple ff -> f-f hyphenation michael@0: f1f michael@0: % Schiffahrt -> Schiff-fahrt hyphenation michael@0: % michael@0: schif3fahrt/ff=f,5,2 michael@0: michael@0: Specification michael@0: michael@0: - Pattern: matching patterns of the original Liang's algorithm michael@0: - patterns must contain only one hyphenation point at change region michael@0: signed with an one-digit odd number (1, 3, 5, 7 or 9). michael@0: These point may be at subregion boundaries: schif3fahrt/ff=,5,1 michael@0: - only the greater value guarantees the win (don't mix non-standard and michael@0: non-standard patterns with the same value, for example michael@0: instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2) michael@0: michael@0: - Change: new characters. michael@0: Arbitrary character sequence. Equal sign (=) signs hyphenation points michael@0: for OpenOffice.org (like in the example). (In a possible German LaTeX michael@0: preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz michael@0: with `ssz, according to the German and Hungarian Babel settings.) michael@0: michael@0: - Start: starting position of the change region. michael@0: - begins with 1 (not 0): schif3fahrt/ff=f,5,2 michael@0: - start dot doesn't matter: .schif3fahrt/ff=f,5,2 michael@0: - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2 michael@0: - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3 michael@0: ("össze" looks "össze" in an ISO 8859-1 8-bit editor). michael@0: michael@0: - Cut: length of the removed character sequence in the original word. michael@0: - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3 michael@0: ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor). michael@0: michael@0: Dictionary developing michael@0: --------------------- michael@0: michael@0: There hasn't been extended PatGen pattern generator for non-standard michael@0: hyphenation patterns, yet. michael@0: michael@0: Fortunatelly, non-standard hyphenation points are forbidden in the PatGen michael@0: generated hyphenation patterns, so with a little patch can be develop michael@0: non-standard hyphenation patterns also in this case. michael@0: michael@0: Warning: If you use UTF-8 Unicode encoding in your patterns, call michael@0: substrings.pl with UTF-8 parameter to calculate right michael@0: character positions for non-standard hyphenation: michael@0: michael@0: ./substrings.pl input output UTF-8 michael@0: michael@0: Programming michael@0: ----------- michael@0: michael@0: Use hyphenate2() or hyphenate3() to handle non-standard hyphenation. michael@0: See hyphen.h for the documentation of the hyphenate*() functions. michael@0: See example.c for processing the output of the hyphenate*() functions. michael@0: michael@0: Warning: change characters are lower cased in the source, so you may need michael@0: case conversion of the change characters based on input word case detection. michael@0: For example, see OpenOffice.org source michael@0: (lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx). michael@0: michael@0: László Németh michael@0: