intl/unicharutil/tools/UCDATAREADME.txt

Wed, 31 Dec 2014 06:09:35 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Wed, 31 Dec 2014 06:09:35 +0100
changeset 0
6474c204b198
permissions
-rw-r--r--

Cloned upstream origin tor-browser at tor-browser-31.3.0esr-4.5-1-build1
revision ID fc1c9ff7c1b2defdbc039f12214767608f46423f for hacking purpose.

     1 #
     2 # $Id: UCDATAREADME.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
     3 #
     5                            MUTT UCData Package 1.9
     6                            -----------------------
     8 This is a package that supports ctype-like operations for Unicode UCS-2 text
     9 (and surrogates), case mapping, and decomposition lookup.  To use it, you will
    10 need to get the "UnicodeData-2.0.14.txt" (or later) file from the Unicode Web
    11 or FTP site.
    13 This package consists of two parts:
    15   1. A program called "ucgendat" which generates five data files from the
    16      UnicodeData-2.*.txt file.  The files are:
    18      A. case.dat   - the case mappings.
    19      B. ctype.dat  - the character property tables.
    20      C. decomp.dat - the character decompositions.
    21      D. cmbcl.dat  - the non-zero combining classes.
    22      E. num.dat    - the codes representing numbers.
    24   2. The "ucdata.[ch]" files which implement the functions needed to
    25      check to see if a character matches groups of properties, to map between
    26      upper, lower, and title case, to look up the decomposition of a
    27      character, look up the combining class of a character, and get the number
    28      value of a character.
    30 A short reference to the functions available is in the "api.txt" file.
    32 Techie Details
    33 ==============
    35 The "ucgendat" program parses files from the command line which are all in the
    36 Unicode Character Database (UCDB) format.  An additional properties file,
    37 "MUTTUCData.txt", provides some extra properties for some characters.
    39 The program looks for the two character properties fields (2 and 4), the
    40 combining class field (3), the decomposition field (5), the numeric value
    41 field (8), and the case mapping fields (12, 13, and 14).  The decompositions
    42 are recursively expanded before being written out.
    44 The decomposition table contains all the canonical decompositions.  This means
    45 all decompositions that do not have tags such as "<compat>" or "<font>".
    47 The data is almost all stored as unsigned longs (32-bits assumed) and the
    48 routines that load the data take care of endian swaps when necessary.  This
    49 also means that surrogates (>= 0x10000) can be placed in the data files the
    50 "ucgendat" program parses.
    52 The data is written as external files and broken into five parts so it can be
    53 selectively updated at runtime if necessary.
    55 The data files currently generated from the "ucgendat" program total about 56K
    56 in size all together.
    58 The format of the binary data files is documented in the "format.txt" file.
    60 Mark Leisher <mleisher@crl.nmsu.edu>
    61 13 December 1998
    63 CHANGES
    64 =======
    66 Version 1.9
    67 -----------
    68 1. Fixed a problem with an incorrect amount of storage being allocated for the
    69    combining class nodes.
    71 2. Fixed an invalid initialization in the number code.
    73 3. Changed the Java template file formatting a bit.
    75 4. Added tables and function for getting decompositions in the Java class.
    77 Version 1.8
    78 -----------
    79 1. Fixed a problem with adding certain ranges.
    81 2. Added two more macros for testing for identifiers.
    83 3. Tested with the UnicodeData-2.1.5.txt file.
    85 Version 1.7
    86 -----------
    87 1. Fixed a problem with looking up decompositions in "ucgendat."
    89 Version 1.6
    90 -----------
    91 1. Added two new properties introduced with UnicodeData-2.1.4.txt.
    93 2. Changed the "ucgendat.c" program a little to automatically align the
    94    property data on a 4-byte boundary when new properties are added.
    96 3. Changed the "ucgendat.c" programs to only generate canonical
    97    decompositions.
    99 4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for
   100    initial and final punctuation characters.
   102 5. Minor additions and changes to the documentation.
   104 Version 1.5
   105 -----------
   106 1. Changed all file open calls to include binary mode with "b" for DOS/WIN
   107    platforms.
   109 2. Wrapped the unistd.h include so it won't be included when compiled under
   110    Win32.
   112 3. Fixed a bad range check for hex digits in ucgendat.c.
   114 4. Fixed a bad endian swap for combining classes.
   116 5. Added code to make a number table and associated lookup functions.
   117    Functions added are ucnumber(), ucdigit(), and ucgetnumber().  The last
   118    function is to maintain compatibility with John Cowan's "uctype" package.
   120 Version 1.4
   121 -----------
   122 1. Fixed a bug with adding a range.
   124 2. Fixed a bug with inserting a range in order.
   126 3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros.
   128 4. Added the missing unload for the combining class data.
   130 5. Fixed a bad macro placement in ucisweak().
   132 Version 1.3
   133 -----------
   134 1. Bug with case mapping calculations fixed.
   136 2. Bug with empty character property entries fixed.
   138 3. Bug with incorrect type in the combining class lookup fixed.
   140 4. Some corrections done to api.txt.
   142 5. Bug in certain character property lookups fixed.
   144 6. Added a character property table that records the defined characters.
   146 7. Replaced ucisunknown() with ucisdefined() and ucisundefined().
   148 Version 1.2
   149 -----------
   150 1. Added code to ucgendat to generate a combining class table.
   152 2. Fixed an endian problem with the byte count of decompositions.
   154 3. Fixed some minor problems in the "format.txt" file.
   156 4. Removed some bogus "Ss" values from MUTTUCData.txt file.
   158 5. Added API function to get combining class.
   160 6. Changed the open mode to "rb" so binary data files will be opened correctly
   161    on DOS/WIN as well as other platforms.
   163 7. Added the "api.txt" file.
   165 Version 1.1
   166 -----------
   167 1. Added ucisxdigit() which I overlooked.
   169 2. Added UC_LT to the ucisalpha() macro which I overlooked.
   171 3. Change uciscntrl() to include UC_CF.
   173 4. Added ucisocntrl() and ucfntcntrl() macros.
   175 5. Added a ucisblank() which I overlooked.
   177 6. Added missing properties to ucissymbol() and ucisnumber().
   179 7. Added ucisgraph() and ucisprint().
   181 8. Changed the "Mr" property to "Sy" to mark this subset of mirroring
   182    characters as symmetric to avoid trampling the Unicode/ISO10646 sense of
   183    mirroring.
   185 9. Added another property called "Ss" which includes control characters
   186    traditionally seen as spaces in the isspace() macro.
   188 10. Added a bunch of macros to be API compatible with John Cowan's package.
   190 ACKNOWLEDGEMENTS
   191 ================
   193 Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of
   194 missing things and giving me stuff, particularly a bunch of new macros.
   196 Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out
   197 various bugs.
   199 Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing
   200 out that file modes need to have "b" for DOS/WIN machines, pointing out
   201 unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum().
   203 Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused
   204 incomplete decompositions to be generated by the "ucgendat" program.
   206 Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation
   207 error and an initialization error.

mercurial