intl/unicharutil/tools/UCDATAREADME.txt

Thu, 22 Jan 2015 13:21:57 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Thu, 22 Jan 2015 13:21:57 +0100
branch
TOR_BUG_9701
changeset 15
b8a032363ba2
permissions
-rw-r--r--

Incorporate requested changes from Mozilla in review:
https://bugzilla.mozilla.org/show_bug.cgi?id=1123480#c6

michael@0 1 #
michael@0 2 # $Id: UCDATAREADME.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
michael@0 3 #
michael@0 4
michael@0 5 MUTT UCData Package 1.9
michael@0 6 -----------------------
michael@0 7
michael@0 8 This is a package that supports ctype-like operations for Unicode UCS-2 text
michael@0 9 (and surrogates), case mapping, and decomposition lookup. To use it, you will
michael@0 10 need to get the "UnicodeData-2.0.14.txt" (or later) file from the Unicode Web
michael@0 11 or FTP site.
michael@0 12
michael@0 13 This package consists of two parts:
michael@0 14
michael@0 15 1. A program called "ucgendat" which generates five data files from the
michael@0 16 UnicodeData-2.*.txt file. The files are:
michael@0 17
michael@0 18 A. case.dat - the case mappings.
michael@0 19 B. ctype.dat - the character property tables.
michael@0 20 C. decomp.dat - the character decompositions.
michael@0 21 D. cmbcl.dat - the non-zero combining classes.
michael@0 22 E. num.dat - the codes representing numbers.
michael@0 23
michael@0 24 2. The "ucdata.[ch]" files which implement the functions needed to
michael@0 25 check to see if a character matches groups of properties, to map between
michael@0 26 upper, lower, and title case, to look up the decomposition of a
michael@0 27 character, look up the combining class of a character, and get the number
michael@0 28 value of a character.
michael@0 29
michael@0 30 A short reference to the functions available is in the "api.txt" file.
michael@0 31
michael@0 32 Techie Details
michael@0 33 ==============
michael@0 34
michael@0 35 The "ucgendat" program parses files from the command line which are all in the
michael@0 36 Unicode Character Database (UCDB) format. An additional properties file,
michael@0 37 "MUTTUCData.txt", provides some extra properties for some characters.
michael@0 38
michael@0 39 The program looks for the two character properties fields (2 and 4), the
michael@0 40 combining class field (3), the decomposition field (5), the numeric value
michael@0 41 field (8), and the case mapping fields (12, 13, and 14). The decompositions
michael@0 42 are recursively expanded before being written out.
michael@0 43
michael@0 44 The decomposition table contains all the canonical decompositions. This means
michael@0 45 all decompositions that do not have tags such as "<compat>" or "<font>".
michael@0 46
michael@0 47 The data is almost all stored as unsigned longs (32-bits assumed) and the
michael@0 48 routines that load the data take care of endian swaps when necessary. This
michael@0 49 also means that surrogates (>= 0x10000) can be placed in the data files the
michael@0 50 "ucgendat" program parses.
michael@0 51
michael@0 52 The data is written as external files and broken into five parts so it can be
michael@0 53 selectively updated at runtime if necessary.
michael@0 54
michael@0 55 The data files currently generated from the "ucgendat" program total about 56K
michael@0 56 in size all together.
michael@0 57
michael@0 58 The format of the binary data files is documented in the "format.txt" file.
michael@0 59
michael@0 60 Mark Leisher <mleisher@crl.nmsu.edu>
michael@0 61 13 December 1998
michael@0 62
michael@0 63 CHANGES
michael@0 64 =======
michael@0 65
michael@0 66 Version 1.9
michael@0 67 -----------
michael@0 68 1. Fixed a problem with an incorrect amount of storage being allocated for the
michael@0 69 combining class nodes.
michael@0 70
michael@0 71 2. Fixed an invalid initialization in the number code.
michael@0 72
michael@0 73 3. Changed the Java template file formatting a bit.
michael@0 74
michael@0 75 4. Added tables and function for getting decompositions in the Java class.
michael@0 76
michael@0 77 Version 1.8
michael@0 78 -----------
michael@0 79 1. Fixed a problem with adding certain ranges.
michael@0 80
michael@0 81 2. Added two more macros for testing for identifiers.
michael@0 82
michael@0 83 3. Tested with the UnicodeData-2.1.5.txt file.
michael@0 84
michael@0 85 Version 1.7
michael@0 86 -----------
michael@0 87 1. Fixed a problem with looking up decompositions in "ucgendat."
michael@0 88
michael@0 89 Version 1.6
michael@0 90 -----------
michael@0 91 1. Added two new properties introduced with UnicodeData-2.1.4.txt.
michael@0 92
michael@0 93 2. Changed the "ucgendat.c" program a little to automatically align the
michael@0 94 property data on a 4-byte boundary when new properties are added.
michael@0 95
michael@0 96 3. Changed the "ucgendat.c" programs to only generate canonical
michael@0 97 decompositions.
michael@0 98
michael@0 99 4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for
michael@0 100 initial and final punctuation characters.
michael@0 101
michael@0 102 5. Minor additions and changes to the documentation.
michael@0 103
michael@0 104 Version 1.5
michael@0 105 -----------
michael@0 106 1. Changed all file open calls to include binary mode with "b" for DOS/WIN
michael@0 107 platforms.
michael@0 108
michael@0 109 2. Wrapped the unistd.h include so it won't be included when compiled under
michael@0 110 Win32.
michael@0 111
michael@0 112 3. Fixed a bad range check for hex digits in ucgendat.c.
michael@0 113
michael@0 114 4. Fixed a bad endian swap for combining classes.
michael@0 115
michael@0 116 5. Added code to make a number table and associated lookup functions.
michael@0 117 Functions added are ucnumber(), ucdigit(), and ucgetnumber(). The last
michael@0 118 function is to maintain compatibility with John Cowan's "uctype" package.
michael@0 119
michael@0 120 Version 1.4
michael@0 121 -----------
michael@0 122 1. Fixed a bug with adding a range.
michael@0 123
michael@0 124 2. Fixed a bug with inserting a range in order.
michael@0 125
michael@0 126 3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros.
michael@0 127
michael@0 128 4. Added the missing unload for the combining class data.
michael@0 129
michael@0 130 5. Fixed a bad macro placement in ucisweak().
michael@0 131
michael@0 132 Version 1.3
michael@0 133 -----------
michael@0 134 1. Bug with case mapping calculations fixed.
michael@0 135
michael@0 136 2. Bug with empty character property entries fixed.
michael@0 137
michael@0 138 3. Bug with incorrect type in the combining class lookup fixed.
michael@0 139
michael@0 140 4. Some corrections done to api.txt.
michael@0 141
michael@0 142 5. Bug in certain character property lookups fixed.
michael@0 143
michael@0 144 6. Added a character property table that records the defined characters.
michael@0 145
michael@0 146 7. Replaced ucisunknown() with ucisdefined() and ucisundefined().
michael@0 147
michael@0 148 Version 1.2
michael@0 149 -----------
michael@0 150 1. Added code to ucgendat to generate a combining class table.
michael@0 151
michael@0 152 2. Fixed an endian problem with the byte count of decompositions.
michael@0 153
michael@0 154 3. Fixed some minor problems in the "format.txt" file.
michael@0 155
michael@0 156 4. Removed some bogus "Ss" values from MUTTUCData.txt file.
michael@0 157
michael@0 158 5. Added API function to get combining class.
michael@0 159
michael@0 160 6. Changed the open mode to "rb" so binary data files will be opened correctly
michael@0 161 on DOS/WIN as well as other platforms.
michael@0 162
michael@0 163 7. Added the "api.txt" file.
michael@0 164
michael@0 165 Version 1.1
michael@0 166 -----------
michael@0 167 1. Added ucisxdigit() which I overlooked.
michael@0 168
michael@0 169 2. Added UC_LT to the ucisalpha() macro which I overlooked.
michael@0 170
michael@0 171 3. Change uciscntrl() to include UC_CF.
michael@0 172
michael@0 173 4. Added ucisocntrl() and ucfntcntrl() macros.
michael@0 174
michael@0 175 5. Added a ucisblank() which I overlooked.
michael@0 176
michael@0 177 6. Added missing properties to ucissymbol() and ucisnumber().
michael@0 178
michael@0 179 7. Added ucisgraph() and ucisprint().
michael@0 180
michael@0 181 8. Changed the "Mr" property to "Sy" to mark this subset of mirroring
michael@0 182 characters as symmetric to avoid trampling the Unicode/ISO10646 sense of
michael@0 183 mirroring.
michael@0 184
michael@0 185 9. Added another property called "Ss" which includes control characters
michael@0 186 traditionally seen as spaces in the isspace() macro.
michael@0 187
michael@0 188 10. Added a bunch of macros to be API compatible with John Cowan's package.
michael@0 189
michael@0 190 ACKNOWLEDGEMENTS
michael@0 191 ================
michael@0 192
michael@0 193 Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of
michael@0 194 missing things and giving me stuff, particularly a bunch of new macros.
michael@0 195
michael@0 196 Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out
michael@0 197 various bugs.
michael@0 198
michael@0 199 Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing
michael@0 200 out that file modes need to have "b" for DOS/WIN machines, pointing out
michael@0 201 unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum().
michael@0 202
michael@0 203 Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused
michael@0 204 incomplete decompositions to be generated by the "ucgendat" program.
michael@0 205
michael@0 206 Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation
michael@0 207 error and an initialization error.

mercurial