intl/unicharutil/tools/UCDATAREADME.txt

changeset 0
6474c204b198
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/intl/unicharutil/tools/UCDATAREADME.txt	Wed Dec 31 06:09:35 2014 +0100
     1.3 @@ -0,0 +1,207 @@
     1.4 +#
     1.5 +# $Id: UCDATAREADME.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
     1.6 +#
     1.7 +
     1.8 +                           MUTT UCData Package 1.9
     1.9 +                           -----------------------
    1.10 +
    1.11 +This is a package that supports ctype-like operations for Unicode UCS-2 text
    1.12 +(and surrogates), case mapping, and decomposition lookup.  To use it, you will
    1.13 +need to get the "UnicodeData-2.0.14.txt" (or later) file from the Unicode Web
    1.14 +or FTP site.
    1.15 +
    1.16 +This package consists of two parts:
    1.17 +
    1.18 +  1. A program called "ucgendat" which generates five data files from the
    1.19 +     UnicodeData-2.*.txt file.  The files are:
    1.20 +
    1.21 +     A. case.dat   - the case mappings.
    1.22 +     B. ctype.dat  - the character property tables.
    1.23 +     C. decomp.dat - the character decompositions.
    1.24 +     D. cmbcl.dat  - the non-zero combining classes.
    1.25 +     E. num.dat    - the codes representing numbers.
    1.26 +
    1.27 +  2. The "ucdata.[ch]" files which implement the functions needed to
    1.28 +     check to see if a character matches groups of properties, to map between
    1.29 +     upper, lower, and title case, to look up the decomposition of a
    1.30 +     character, look up the combining class of a character, and get the number
    1.31 +     value of a character.
    1.32 +
    1.33 +A short reference to the functions available is in the "api.txt" file.
    1.34 +
    1.35 +Techie Details
    1.36 +==============
    1.37 +
    1.38 +The "ucgendat" program parses files from the command line which are all in the
    1.39 +Unicode Character Database (UCDB) format.  An additional properties file,
    1.40 +"MUTTUCData.txt", provides some extra properties for some characters.
    1.41 +
    1.42 +The program looks for the two character properties fields (2 and 4), the
    1.43 +combining class field (3), the decomposition field (5), the numeric value
    1.44 +field (8), and the case mapping fields (12, 13, and 14).  The decompositions
    1.45 +are recursively expanded before being written out.
    1.46 +
    1.47 +The decomposition table contains all the canonical decompositions.  This means
    1.48 +all decompositions that do not have tags such as "<compat>" or "<font>".
    1.49 +
    1.50 +The data is almost all stored as unsigned longs (32-bits assumed) and the
    1.51 +routines that load the data take care of endian swaps when necessary.  This
    1.52 +also means that surrogates (>= 0x10000) can be placed in the data files the
    1.53 +"ucgendat" program parses.
    1.54 +
    1.55 +The data is written as external files and broken into five parts so it can be
    1.56 +selectively updated at runtime if necessary.
    1.57 +
    1.58 +The data files currently generated from the "ucgendat" program total about 56K
    1.59 +in size all together.
    1.60 +
    1.61 +The format of the binary data files is documented in the "format.txt" file.
    1.62 +
    1.63 +Mark Leisher <mleisher@crl.nmsu.edu>
    1.64 +13 December 1998
    1.65 +
    1.66 +CHANGES
    1.67 +=======
    1.68 +
    1.69 +Version 1.9
    1.70 +-----------
    1.71 +1. Fixed a problem with an incorrect amount of storage being allocated for the
    1.72 +   combining class nodes.
    1.73 +
    1.74 +2. Fixed an invalid initialization in the number code.
    1.75 +
    1.76 +3. Changed the Java template file formatting a bit.
    1.77 +
    1.78 +4. Added tables and function for getting decompositions in the Java class.
    1.79 +
    1.80 +Version 1.8
    1.81 +-----------
    1.82 +1. Fixed a problem with adding certain ranges.
    1.83 +
    1.84 +2. Added two more macros for testing for identifiers.
    1.85 +
    1.86 +3. Tested with the UnicodeData-2.1.5.txt file.
    1.87 +
    1.88 +Version 1.7
    1.89 +-----------
    1.90 +1. Fixed a problem with looking up decompositions in "ucgendat."
    1.91 +
    1.92 +Version 1.6
    1.93 +-----------
    1.94 +1. Added two new properties introduced with UnicodeData-2.1.4.txt.
    1.95 +
    1.96 +2. Changed the "ucgendat.c" program a little to automatically align the
    1.97 +   property data on a 4-byte boundary when new properties are added.
    1.98 +
    1.99 +3. Changed the "ucgendat.c" programs to only generate canonical
   1.100 +   decompositions.
   1.101 +
   1.102 +4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for
   1.103 +   initial and final punctuation characters.
   1.104 +
   1.105 +5. Minor additions and changes to the documentation.
   1.106 +
   1.107 +Version 1.5
   1.108 +-----------
   1.109 +1. Changed all file open calls to include binary mode with "b" for DOS/WIN
   1.110 +   platforms.
   1.111 +
   1.112 +2. Wrapped the unistd.h include so it won't be included when compiled under
   1.113 +   Win32.
   1.114 +
   1.115 +3. Fixed a bad range check for hex digits in ucgendat.c.
   1.116 +
   1.117 +4. Fixed a bad endian swap for combining classes.
   1.118 +
   1.119 +5. Added code to make a number table and associated lookup functions.
   1.120 +   Functions added are ucnumber(), ucdigit(), and ucgetnumber().  The last
   1.121 +   function is to maintain compatibility with John Cowan's "uctype" package.
   1.122 +
   1.123 +Version 1.4
   1.124 +-----------
   1.125 +1. Fixed a bug with adding a range.
   1.126 +
   1.127 +2. Fixed a bug with inserting a range in order.
   1.128 +
   1.129 +3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros.
   1.130 +
   1.131 +4. Added the missing unload for the combining class data.
   1.132 +
   1.133 +5. Fixed a bad macro placement in ucisweak().
   1.134 +
   1.135 +Version 1.3
   1.136 +-----------
   1.137 +1. Bug with case mapping calculations fixed.
   1.138 +
   1.139 +2. Bug with empty character property entries fixed.
   1.140 +
   1.141 +3. Bug with incorrect type in the combining class lookup fixed.
   1.142 +
   1.143 +4. Some corrections done to api.txt.
   1.144 +
   1.145 +5. Bug in certain character property lookups fixed.
   1.146 +
   1.147 +6. Added a character property table that records the defined characters.
   1.148 +
   1.149 +7. Replaced ucisunknown() with ucisdefined() and ucisundefined().
   1.150 +
   1.151 +Version 1.2
   1.152 +-----------
   1.153 +1. Added code to ucgendat to generate a combining class table.
   1.154 +
   1.155 +2. Fixed an endian problem with the byte count of decompositions.
   1.156 +
   1.157 +3. Fixed some minor problems in the "format.txt" file.
   1.158 +
   1.159 +4. Removed some bogus "Ss" values from MUTTUCData.txt file.
   1.160 +
   1.161 +5. Added API function to get combining class.
   1.162 +
   1.163 +6. Changed the open mode to "rb" so binary data files will be opened correctly
   1.164 +   on DOS/WIN as well as other platforms.
   1.165 +
   1.166 +7. Added the "api.txt" file.
   1.167 +
   1.168 +Version 1.1
   1.169 +-----------
   1.170 +1. Added ucisxdigit() which I overlooked.
   1.171 +
   1.172 +2. Added UC_LT to the ucisalpha() macro which I overlooked.
   1.173 +
   1.174 +3. Change uciscntrl() to include UC_CF.
   1.175 +
   1.176 +4. Added ucisocntrl() and ucfntcntrl() macros.
   1.177 +
   1.178 +5. Added a ucisblank() which I overlooked.
   1.179 +
   1.180 +6. Added missing properties to ucissymbol() and ucisnumber().
   1.181 +
   1.182 +7. Added ucisgraph() and ucisprint().
   1.183 +
   1.184 +8. Changed the "Mr" property to "Sy" to mark this subset of mirroring
   1.185 +   characters as symmetric to avoid trampling the Unicode/ISO10646 sense of
   1.186 +   mirroring.
   1.187 +
   1.188 +9. Added another property called "Ss" which includes control characters
   1.189 +   traditionally seen as spaces in the isspace() macro.
   1.190 +
   1.191 +10. Added a bunch of macros to be API compatible with John Cowan's package.
   1.192 +
   1.193 +ACKNOWLEDGEMENTS
   1.194 +================
   1.195 +
   1.196 +Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of
   1.197 +missing things and giving me stuff, particularly a bunch of new macros.
   1.198 +
   1.199 +Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out
   1.200 +various bugs.
   1.201 +
   1.202 +Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing
   1.203 +out that file modes need to have "b" for DOS/WIN machines, pointing out
   1.204 +unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum().
   1.205 +
   1.206 +Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused
   1.207 +incomplete decompositions to be generated by the "ucgendat" program.
   1.208 +
   1.209 +Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation
   1.210 +error and an initialization error.

mercurial