michael@0: # michael@0: # $Id: UCDATAREADME.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $ michael@0: # michael@0: michael@0: MUTT UCData Package 1.9 michael@0: ----------------------- michael@0: michael@0: This is a package that supports ctype-like operations for Unicode UCS-2 text michael@0: (and surrogates), case mapping, and decomposition lookup. To use it, you will michael@0: need to get the "UnicodeData-2.0.14.txt" (or later) file from the Unicode Web michael@0: or FTP site. michael@0: michael@0: This package consists of two parts: michael@0: michael@0: 1. A program called "ucgendat" which generates five data files from the michael@0: UnicodeData-2.*.txt file. The files are: michael@0: michael@0: A. case.dat - the case mappings. michael@0: B. ctype.dat - the character property tables. michael@0: C. decomp.dat - the character decompositions. michael@0: D. cmbcl.dat - the non-zero combining classes. michael@0: E. num.dat - the codes representing numbers. michael@0: michael@0: 2. The "ucdata.[ch]" files which implement the functions needed to michael@0: check to see if a character matches groups of properties, to map between michael@0: upper, lower, and title case, to look up the decomposition of a michael@0: character, look up the combining class of a character, and get the number michael@0: value of a character. michael@0: michael@0: A short reference to the functions available is in the "api.txt" file. michael@0: michael@0: Techie Details michael@0: ============== michael@0: michael@0: The "ucgendat" program parses files from the command line which are all in the michael@0: Unicode Character Database (UCDB) format. An additional properties file, michael@0: "MUTTUCData.txt", provides some extra properties for some characters. michael@0: michael@0: The program looks for the two character properties fields (2 and 4), the michael@0: combining class field (3), the decomposition field (5), the numeric value michael@0: field (8), and the case mapping fields (12, 13, and 14). The decompositions michael@0: are recursively expanded before being written out. michael@0: michael@0: The decomposition table contains all the canonical decompositions. This means michael@0: all decompositions that do not have tags such as "" or "". michael@0: michael@0: The data is almost all stored as unsigned longs (32-bits assumed) and the michael@0: routines that load the data take care of endian swaps when necessary. This michael@0: also means that surrogates (>= 0x10000) can be placed in the data files the michael@0: "ucgendat" program parses. michael@0: michael@0: The data is written as external files and broken into five parts so it can be michael@0: selectively updated at runtime if necessary. michael@0: michael@0: The data files currently generated from the "ucgendat" program total about 56K michael@0: in size all together. michael@0: michael@0: The format of the binary data files is documented in the "format.txt" file. michael@0: michael@0: Mark Leisher michael@0: 13 December 1998 michael@0: michael@0: CHANGES michael@0: ======= michael@0: michael@0: Version 1.9 michael@0: ----------- michael@0: 1. Fixed a problem with an incorrect amount of storage being allocated for the michael@0: combining class nodes. michael@0: michael@0: 2. Fixed an invalid initialization in the number code. michael@0: michael@0: 3. Changed the Java template file formatting a bit. michael@0: michael@0: 4. Added tables and function for getting decompositions in the Java class. michael@0: michael@0: Version 1.8 michael@0: ----------- michael@0: 1. Fixed a problem with adding certain ranges. michael@0: michael@0: 2. Added two more macros for testing for identifiers. michael@0: michael@0: 3. Tested with the UnicodeData-2.1.5.txt file. michael@0: michael@0: Version 1.7 michael@0: ----------- michael@0: 1. Fixed a problem with looking up decompositions in "ucgendat." michael@0: michael@0: Version 1.6 michael@0: ----------- michael@0: 1. Added two new properties introduced with UnicodeData-2.1.4.txt. michael@0: michael@0: 2. Changed the "ucgendat.c" program a little to automatically align the michael@0: property data on a 4-byte boundary when new properties are added. michael@0: michael@0: 3. Changed the "ucgendat.c" programs to only generate canonical michael@0: decompositions. michael@0: michael@0: 4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for michael@0: initial and final punctuation characters. michael@0: michael@0: 5. Minor additions and changes to the documentation. michael@0: michael@0: Version 1.5 michael@0: ----------- michael@0: 1. Changed all file open calls to include binary mode with "b" for DOS/WIN michael@0: platforms. michael@0: michael@0: 2. Wrapped the unistd.h include so it won't be included when compiled under michael@0: Win32. michael@0: michael@0: 3. Fixed a bad range check for hex digits in ucgendat.c. michael@0: michael@0: 4. Fixed a bad endian swap for combining classes. michael@0: michael@0: 5. Added code to make a number table and associated lookup functions. michael@0: Functions added are ucnumber(), ucdigit(), and ucgetnumber(). The last michael@0: function is to maintain compatibility with John Cowan's "uctype" package. michael@0: michael@0: Version 1.4 michael@0: ----------- michael@0: 1. Fixed a bug with adding a range. michael@0: michael@0: 2. Fixed a bug with inserting a range in order. michael@0: michael@0: 3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros. michael@0: michael@0: 4. Added the missing unload for the combining class data. michael@0: michael@0: 5. Fixed a bad macro placement in ucisweak(). michael@0: michael@0: Version 1.3 michael@0: ----------- michael@0: 1. Bug with case mapping calculations fixed. michael@0: michael@0: 2. Bug with empty character property entries fixed. michael@0: michael@0: 3. Bug with incorrect type in the combining class lookup fixed. michael@0: michael@0: 4. Some corrections done to api.txt. michael@0: michael@0: 5. Bug in certain character property lookups fixed. michael@0: michael@0: 6. Added a character property table that records the defined characters. michael@0: michael@0: 7. Replaced ucisunknown() with ucisdefined() and ucisundefined(). michael@0: michael@0: Version 1.2 michael@0: ----------- michael@0: 1. Added code to ucgendat to generate a combining class table. michael@0: michael@0: 2. Fixed an endian problem with the byte count of decompositions. michael@0: michael@0: 3. Fixed some minor problems in the "format.txt" file. michael@0: michael@0: 4. Removed some bogus "Ss" values from MUTTUCData.txt file. michael@0: michael@0: 5. Added API function to get combining class. michael@0: michael@0: 6. Changed the open mode to "rb" so binary data files will be opened correctly michael@0: on DOS/WIN as well as other platforms. michael@0: michael@0: 7. Added the "api.txt" file. michael@0: michael@0: Version 1.1 michael@0: ----------- michael@0: 1. Added ucisxdigit() which I overlooked. michael@0: michael@0: 2. Added UC_LT to the ucisalpha() macro which I overlooked. michael@0: michael@0: 3. Change uciscntrl() to include UC_CF. michael@0: michael@0: 4. Added ucisocntrl() and ucfntcntrl() macros. michael@0: michael@0: 5. Added a ucisblank() which I overlooked. michael@0: michael@0: 6. Added missing properties to ucissymbol() and ucisnumber(). michael@0: michael@0: 7. Added ucisgraph() and ucisprint(). michael@0: michael@0: 8. Changed the "Mr" property to "Sy" to mark this subset of mirroring michael@0: characters as symmetric to avoid trampling the Unicode/ISO10646 sense of michael@0: mirroring. michael@0: michael@0: 9. Added another property called "Ss" which includes control characters michael@0: traditionally seen as spaces in the isspace() macro. michael@0: michael@0: 10. Added a bunch of macros to be API compatible with John Cowan's package. michael@0: michael@0: ACKNOWLEDGEMENTS michael@0: ================ michael@0: michael@0: Thanks go to John Cowan for pointing out lots of michael@0: missing things and giving me stuff, particularly a bunch of new macros. michael@0: michael@0: Thanks go to Bob Verbrugge for pointing out michael@0: various bugs. michael@0: michael@0: Thanks go to Christophe Pierret for pointing michael@0: out that file modes need to have "b" for DOS/WIN machines, pointing out michael@0: unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum(). michael@0: michael@0: Thanks go to Kent Johnson for finding a bug that caused michael@0: incomplete decompositions to be generated by the "ucgendat" program. michael@0: michael@0: Thanks go to Valeriy E. Ushakov for spotting an allocation michael@0: error and an initialization error.