|
1 # |
|
2 # $Id: UCDATAREADME.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $ |
|
3 # |
|
4 |
|
5 MUTT UCData Package 1.9 |
|
6 ----------------------- |
|
7 |
|
8 This is a package that supports ctype-like operations for Unicode UCS-2 text |
|
9 (and surrogates), case mapping, and decomposition lookup. To use it, you will |
|
10 need to get the "UnicodeData-2.0.14.txt" (or later) file from the Unicode Web |
|
11 or FTP site. |
|
12 |
|
13 This package consists of two parts: |
|
14 |
|
15 1. A program called "ucgendat" which generates five data files from the |
|
16 UnicodeData-2.*.txt file. The files are: |
|
17 |
|
18 A. case.dat - the case mappings. |
|
19 B. ctype.dat - the character property tables. |
|
20 C. decomp.dat - the character decompositions. |
|
21 D. cmbcl.dat - the non-zero combining classes. |
|
22 E. num.dat - the codes representing numbers. |
|
23 |
|
24 2. The "ucdata.[ch]" files which implement the functions needed to |
|
25 check to see if a character matches groups of properties, to map between |
|
26 upper, lower, and title case, to look up the decomposition of a |
|
27 character, look up the combining class of a character, and get the number |
|
28 value of a character. |
|
29 |
|
30 A short reference to the functions available is in the "api.txt" file. |
|
31 |
|
32 Techie Details |
|
33 ============== |
|
34 |
|
35 The "ucgendat" program parses files from the command line which are all in the |
|
36 Unicode Character Database (UCDB) format. An additional properties file, |
|
37 "MUTTUCData.txt", provides some extra properties for some characters. |
|
38 |
|
39 The program looks for the two character properties fields (2 and 4), the |
|
40 combining class field (3), the decomposition field (5), the numeric value |
|
41 field (8), and the case mapping fields (12, 13, and 14). The decompositions |
|
42 are recursively expanded before being written out. |
|
43 |
|
44 The decomposition table contains all the canonical decompositions. This means |
|
45 all decompositions that do not have tags such as "<compat>" or "<font>". |
|
46 |
|
47 The data is almost all stored as unsigned longs (32-bits assumed) and the |
|
48 routines that load the data take care of endian swaps when necessary. This |
|
49 also means that surrogates (>= 0x10000) can be placed in the data files the |
|
50 "ucgendat" program parses. |
|
51 |
|
52 The data is written as external files and broken into five parts so it can be |
|
53 selectively updated at runtime if necessary. |
|
54 |
|
55 The data files currently generated from the "ucgendat" program total about 56K |
|
56 in size all together. |
|
57 |
|
58 The format of the binary data files is documented in the "format.txt" file. |
|
59 |
|
60 Mark Leisher <mleisher@crl.nmsu.edu> |
|
61 13 December 1998 |
|
62 |
|
63 CHANGES |
|
64 ======= |
|
65 |
|
66 Version 1.9 |
|
67 ----------- |
|
68 1. Fixed a problem with an incorrect amount of storage being allocated for the |
|
69 combining class nodes. |
|
70 |
|
71 2. Fixed an invalid initialization in the number code. |
|
72 |
|
73 3. Changed the Java template file formatting a bit. |
|
74 |
|
75 4. Added tables and function for getting decompositions in the Java class. |
|
76 |
|
77 Version 1.8 |
|
78 ----------- |
|
79 1. Fixed a problem with adding certain ranges. |
|
80 |
|
81 2. Added two more macros for testing for identifiers. |
|
82 |
|
83 3. Tested with the UnicodeData-2.1.5.txt file. |
|
84 |
|
85 Version 1.7 |
|
86 ----------- |
|
87 1. Fixed a problem with looking up decompositions in "ucgendat." |
|
88 |
|
89 Version 1.6 |
|
90 ----------- |
|
91 1. Added two new properties introduced with UnicodeData-2.1.4.txt. |
|
92 |
|
93 2. Changed the "ucgendat.c" program a little to automatically align the |
|
94 property data on a 4-byte boundary when new properties are added. |
|
95 |
|
96 3. Changed the "ucgendat.c" programs to only generate canonical |
|
97 decompositions. |
|
98 |
|
99 4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for |
|
100 initial and final punctuation characters. |
|
101 |
|
102 5. Minor additions and changes to the documentation. |
|
103 |
|
104 Version 1.5 |
|
105 ----------- |
|
106 1. Changed all file open calls to include binary mode with "b" for DOS/WIN |
|
107 platforms. |
|
108 |
|
109 2. Wrapped the unistd.h include so it won't be included when compiled under |
|
110 Win32. |
|
111 |
|
112 3. Fixed a bad range check for hex digits in ucgendat.c. |
|
113 |
|
114 4. Fixed a bad endian swap for combining classes. |
|
115 |
|
116 5. Added code to make a number table and associated lookup functions. |
|
117 Functions added are ucnumber(), ucdigit(), and ucgetnumber(). The last |
|
118 function is to maintain compatibility with John Cowan's "uctype" package. |
|
119 |
|
120 Version 1.4 |
|
121 ----------- |
|
122 1. Fixed a bug with adding a range. |
|
123 |
|
124 2. Fixed a bug with inserting a range in order. |
|
125 |
|
126 3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros. |
|
127 |
|
128 4. Added the missing unload for the combining class data. |
|
129 |
|
130 5. Fixed a bad macro placement in ucisweak(). |
|
131 |
|
132 Version 1.3 |
|
133 ----------- |
|
134 1. Bug with case mapping calculations fixed. |
|
135 |
|
136 2. Bug with empty character property entries fixed. |
|
137 |
|
138 3. Bug with incorrect type in the combining class lookup fixed. |
|
139 |
|
140 4. Some corrections done to api.txt. |
|
141 |
|
142 5. Bug in certain character property lookups fixed. |
|
143 |
|
144 6. Added a character property table that records the defined characters. |
|
145 |
|
146 7. Replaced ucisunknown() with ucisdefined() and ucisundefined(). |
|
147 |
|
148 Version 1.2 |
|
149 ----------- |
|
150 1. Added code to ucgendat to generate a combining class table. |
|
151 |
|
152 2. Fixed an endian problem with the byte count of decompositions. |
|
153 |
|
154 3. Fixed some minor problems in the "format.txt" file. |
|
155 |
|
156 4. Removed some bogus "Ss" values from MUTTUCData.txt file. |
|
157 |
|
158 5. Added API function to get combining class. |
|
159 |
|
160 6. Changed the open mode to "rb" so binary data files will be opened correctly |
|
161 on DOS/WIN as well as other platforms. |
|
162 |
|
163 7. Added the "api.txt" file. |
|
164 |
|
165 Version 1.1 |
|
166 ----------- |
|
167 1. Added ucisxdigit() which I overlooked. |
|
168 |
|
169 2. Added UC_LT to the ucisalpha() macro which I overlooked. |
|
170 |
|
171 3. Change uciscntrl() to include UC_CF. |
|
172 |
|
173 4. Added ucisocntrl() and ucfntcntrl() macros. |
|
174 |
|
175 5. Added a ucisblank() which I overlooked. |
|
176 |
|
177 6. Added missing properties to ucissymbol() and ucisnumber(). |
|
178 |
|
179 7. Added ucisgraph() and ucisprint(). |
|
180 |
|
181 8. Changed the "Mr" property to "Sy" to mark this subset of mirroring |
|
182 characters as symmetric to avoid trampling the Unicode/ISO10646 sense of |
|
183 mirroring. |
|
184 |
|
185 9. Added another property called "Ss" which includes control characters |
|
186 traditionally seen as spaces in the isspace() macro. |
|
187 |
|
188 10. Added a bunch of macros to be API compatible with John Cowan's package. |
|
189 |
|
190 ACKNOWLEDGEMENTS |
|
191 ================ |
|
192 |
|
193 Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of |
|
194 missing things and giving me stuff, particularly a bunch of new macros. |
|
195 |
|
196 Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out |
|
197 various bugs. |
|
198 |
|
199 Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing |
|
200 out that file modes need to have "b" for DOS/WIN machines, pointing out |
|
201 unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum(). |
|
202 |
|
203 Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused |
|
204 incomplete decompositions to be generated by the "ucgendat" program. |
|
205 |
|
206 Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation |
|
207 error and an initialization error. |