1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/intl/unicharutil/tools/UCDATAREADME.txt Wed Dec 31 06:09:35 2014 +0100 1.3 @@ -0,0 +1,207 @@ 1.4 +# 1.5 +# $Id: UCDATAREADME.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $ 1.6 +# 1.7 + 1.8 + MUTT UCData Package 1.9 1.9 + ----------------------- 1.10 + 1.11 +This is a package that supports ctype-like operations for Unicode UCS-2 text 1.12 +(and surrogates), case mapping, and decomposition lookup. To use it, you will 1.13 +need to get the "UnicodeData-2.0.14.txt" (or later) file from the Unicode Web 1.14 +or FTP site. 1.15 + 1.16 +This package consists of two parts: 1.17 + 1.18 + 1. A program called "ucgendat" which generates five data files from the 1.19 + UnicodeData-2.*.txt file. The files are: 1.20 + 1.21 + A. case.dat - the case mappings. 1.22 + B. ctype.dat - the character property tables. 1.23 + C. decomp.dat - the character decompositions. 1.24 + D. cmbcl.dat - the non-zero combining classes. 1.25 + E. num.dat - the codes representing numbers. 1.26 + 1.27 + 2. The "ucdata.[ch]" files which implement the functions needed to 1.28 + check to see if a character matches groups of properties, to map between 1.29 + upper, lower, and title case, to look up the decomposition of a 1.30 + character, look up the combining class of a character, and get the number 1.31 + value of a character. 1.32 + 1.33 +A short reference to the functions available is in the "api.txt" file. 1.34 + 1.35 +Techie Details 1.36 +============== 1.37 + 1.38 +The "ucgendat" program parses files from the command line which are all in the 1.39 +Unicode Character Database (UCDB) format. An additional properties file, 1.40 +"MUTTUCData.txt", provides some extra properties for some characters. 1.41 + 1.42 +The program looks for the two character properties fields (2 and 4), the 1.43 +combining class field (3), the decomposition field (5), the numeric value 1.44 +field (8), and the case mapping fields (12, 13, and 14). The decompositions 1.45 +are recursively expanded before being written out. 1.46 + 1.47 +The decomposition table contains all the canonical decompositions. This means 1.48 +all decompositions that do not have tags such as "<compat>" or "<font>". 1.49 + 1.50 +The data is almost all stored as unsigned longs (32-bits assumed) and the 1.51 +routines that load the data take care of endian swaps when necessary. This 1.52 +also means that surrogates (>= 0x10000) can be placed in the data files the 1.53 +"ucgendat" program parses. 1.54 + 1.55 +The data is written as external files and broken into five parts so it can be 1.56 +selectively updated at runtime if necessary. 1.57 + 1.58 +The data files currently generated from the "ucgendat" program total about 56K 1.59 +in size all together. 1.60 + 1.61 +The format of the binary data files is documented in the "format.txt" file. 1.62 + 1.63 +Mark Leisher <mleisher@crl.nmsu.edu> 1.64 +13 December 1998 1.65 + 1.66 +CHANGES 1.67 +======= 1.68 + 1.69 +Version 1.9 1.70 +----------- 1.71 +1. Fixed a problem with an incorrect amount of storage being allocated for the 1.72 + combining class nodes. 1.73 + 1.74 +2. Fixed an invalid initialization in the number code. 1.75 + 1.76 +3. Changed the Java template file formatting a bit. 1.77 + 1.78 +4. Added tables and function for getting decompositions in the Java class. 1.79 + 1.80 +Version 1.8 1.81 +----------- 1.82 +1. Fixed a problem with adding certain ranges. 1.83 + 1.84 +2. Added two more macros for testing for identifiers. 1.85 + 1.86 +3. Tested with the UnicodeData-2.1.5.txt file. 1.87 + 1.88 +Version 1.7 1.89 +----------- 1.90 +1. Fixed a problem with looking up decompositions in "ucgendat." 1.91 + 1.92 +Version 1.6 1.93 +----------- 1.94 +1. Added two new properties introduced with UnicodeData-2.1.4.txt. 1.95 + 1.96 +2. Changed the "ucgendat.c" program a little to automatically align the 1.97 + property data on a 4-byte boundary when new properties are added. 1.98 + 1.99 +3. Changed the "ucgendat.c" programs to only generate canonical 1.100 + decompositions. 1.101 + 1.102 +4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for 1.103 + initial and final punctuation characters. 1.104 + 1.105 +5. Minor additions and changes to the documentation. 1.106 + 1.107 +Version 1.5 1.108 +----------- 1.109 +1. Changed all file open calls to include binary mode with "b" for DOS/WIN 1.110 + platforms. 1.111 + 1.112 +2. Wrapped the unistd.h include so it won't be included when compiled under 1.113 + Win32. 1.114 + 1.115 +3. Fixed a bad range check for hex digits in ucgendat.c. 1.116 + 1.117 +4. Fixed a bad endian swap for combining classes. 1.118 + 1.119 +5. Added code to make a number table and associated lookup functions. 1.120 + Functions added are ucnumber(), ucdigit(), and ucgetnumber(). The last 1.121 + function is to maintain compatibility with John Cowan's "uctype" package. 1.122 + 1.123 +Version 1.4 1.124 +----------- 1.125 +1. Fixed a bug with adding a range. 1.126 + 1.127 +2. Fixed a bug with inserting a range in order. 1.128 + 1.129 +3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros. 1.130 + 1.131 +4. Added the missing unload for the combining class data. 1.132 + 1.133 +5. Fixed a bad macro placement in ucisweak(). 1.134 + 1.135 +Version 1.3 1.136 +----------- 1.137 +1. Bug with case mapping calculations fixed. 1.138 + 1.139 +2. Bug with empty character property entries fixed. 1.140 + 1.141 +3. Bug with incorrect type in the combining class lookup fixed. 1.142 + 1.143 +4. Some corrections done to api.txt. 1.144 + 1.145 +5. Bug in certain character property lookups fixed. 1.146 + 1.147 +6. Added a character property table that records the defined characters. 1.148 + 1.149 +7. Replaced ucisunknown() with ucisdefined() and ucisundefined(). 1.150 + 1.151 +Version 1.2 1.152 +----------- 1.153 +1. Added code to ucgendat to generate a combining class table. 1.154 + 1.155 +2. Fixed an endian problem with the byte count of decompositions. 1.156 + 1.157 +3. Fixed some minor problems in the "format.txt" file. 1.158 + 1.159 +4. Removed some bogus "Ss" values from MUTTUCData.txt file. 1.160 + 1.161 +5. Added API function to get combining class. 1.162 + 1.163 +6. Changed the open mode to "rb" so binary data files will be opened correctly 1.164 + on DOS/WIN as well as other platforms. 1.165 + 1.166 +7. Added the "api.txt" file. 1.167 + 1.168 +Version 1.1 1.169 +----------- 1.170 +1. Added ucisxdigit() which I overlooked. 1.171 + 1.172 +2. Added UC_LT to the ucisalpha() macro which I overlooked. 1.173 + 1.174 +3. Change uciscntrl() to include UC_CF. 1.175 + 1.176 +4. Added ucisocntrl() and ucfntcntrl() macros. 1.177 + 1.178 +5. Added a ucisblank() which I overlooked. 1.179 + 1.180 +6. Added missing properties to ucissymbol() and ucisnumber(). 1.181 + 1.182 +7. Added ucisgraph() and ucisprint(). 1.183 + 1.184 +8. Changed the "Mr" property to "Sy" to mark this subset of mirroring 1.185 + characters as symmetric to avoid trampling the Unicode/ISO10646 sense of 1.186 + mirroring. 1.187 + 1.188 +9. Added another property called "Ss" which includes control characters 1.189 + traditionally seen as spaces in the isspace() macro. 1.190 + 1.191 +10. Added a bunch of macros to be API compatible with John Cowan's package. 1.192 + 1.193 +ACKNOWLEDGEMENTS 1.194 +================ 1.195 + 1.196 +Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of 1.197 +missing things and giving me stuff, particularly a bunch of new macros. 1.198 + 1.199 +Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out 1.200 +various bugs. 1.201 + 1.202 +Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing 1.203 +out that file modes need to have "b" for DOS/WIN machines, pointing out 1.204 +unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum(). 1.205 + 1.206 +Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused 1.207 +incomplete decompositions to be generated by the "ucgendat" program. 1.208 + 1.209 +Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation 1.210 +error and an initialization error.