Wed, 31 Dec 2014 06:09:35 +0100
Cloned upstream origin tor-browser at tor-browser-31.3.0esr-4.5-1-build1
revision ID fc1c9ff7c1b2defdbc039f12214767608f46423f for hacking purpose.
michael@0 | 1 | # |
michael@0 | 2 | # $Id: UCDATAREADME.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $ |
michael@0 | 3 | # |
michael@0 | 4 | |
michael@0 | 5 | MUTT UCData Package 1.9 |
michael@0 | 6 | ----------------------- |
michael@0 | 7 | |
michael@0 | 8 | This is a package that supports ctype-like operations for Unicode UCS-2 text |
michael@0 | 9 | (and surrogates), case mapping, and decomposition lookup. To use it, you will |
michael@0 | 10 | need to get the "UnicodeData-2.0.14.txt" (or later) file from the Unicode Web |
michael@0 | 11 | or FTP site. |
michael@0 | 12 | |
michael@0 | 13 | This package consists of two parts: |
michael@0 | 14 | |
michael@0 | 15 | 1. A program called "ucgendat" which generates five data files from the |
michael@0 | 16 | UnicodeData-2.*.txt file. The files are: |
michael@0 | 17 | |
michael@0 | 18 | A. case.dat - the case mappings. |
michael@0 | 19 | B. ctype.dat - the character property tables. |
michael@0 | 20 | C. decomp.dat - the character decompositions. |
michael@0 | 21 | D. cmbcl.dat - the non-zero combining classes. |
michael@0 | 22 | E. num.dat - the codes representing numbers. |
michael@0 | 23 | |
michael@0 | 24 | 2. The "ucdata.[ch]" files which implement the functions needed to |
michael@0 | 25 | check to see if a character matches groups of properties, to map between |
michael@0 | 26 | upper, lower, and title case, to look up the decomposition of a |
michael@0 | 27 | character, look up the combining class of a character, and get the number |
michael@0 | 28 | value of a character. |
michael@0 | 29 | |
michael@0 | 30 | A short reference to the functions available is in the "api.txt" file. |
michael@0 | 31 | |
michael@0 | 32 | Techie Details |
michael@0 | 33 | ============== |
michael@0 | 34 | |
michael@0 | 35 | The "ucgendat" program parses files from the command line which are all in the |
michael@0 | 36 | Unicode Character Database (UCDB) format. An additional properties file, |
michael@0 | 37 | "MUTTUCData.txt", provides some extra properties for some characters. |
michael@0 | 38 | |
michael@0 | 39 | The program looks for the two character properties fields (2 and 4), the |
michael@0 | 40 | combining class field (3), the decomposition field (5), the numeric value |
michael@0 | 41 | field (8), and the case mapping fields (12, 13, and 14). The decompositions |
michael@0 | 42 | are recursively expanded before being written out. |
michael@0 | 43 | |
michael@0 | 44 | The decomposition table contains all the canonical decompositions. This means |
michael@0 | 45 | all decompositions that do not have tags such as "<compat>" or "<font>". |
michael@0 | 46 | |
michael@0 | 47 | The data is almost all stored as unsigned longs (32-bits assumed) and the |
michael@0 | 48 | routines that load the data take care of endian swaps when necessary. This |
michael@0 | 49 | also means that surrogates (>= 0x10000) can be placed in the data files the |
michael@0 | 50 | "ucgendat" program parses. |
michael@0 | 51 | |
michael@0 | 52 | The data is written as external files and broken into five parts so it can be |
michael@0 | 53 | selectively updated at runtime if necessary. |
michael@0 | 54 | |
michael@0 | 55 | The data files currently generated from the "ucgendat" program total about 56K |
michael@0 | 56 | in size all together. |
michael@0 | 57 | |
michael@0 | 58 | The format of the binary data files is documented in the "format.txt" file. |
michael@0 | 59 | |
michael@0 | 60 | Mark Leisher <mleisher@crl.nmsu.edu> |
michael@0 | 61 | 13 December 1998 |
michael@0 | 62 | |
michael@0 | 63 | CHANGES |
michael@0 | 64 | ======= |
michael@0 | 65 | |
michael@0 | 66 | Version 1.9 |
michael@0 | 67 | ----------- |
michael@0 | 68 | 1. Fixed a problem with an incorrect amount of storage being allocated for the |
michael@0 | 69 | combining class nodes. |
michael@0 | 70 | |
michael@0 | 71 | 2. Fixed an invalid initialization in the number code. |
michael@0 | 72 | |
michael@0 | 73 | 3. Changed the Java template file formatting a bit. |
michael@0 | 74 | |
michael@0 | 75 | 4. Added tables and function for getting decompositions in the Java class. |
michael@0 | 76 | |
michael@0 | 77 | Version 1.8 |
michael@0 | 78 | ----------- |
michael@0 | 79 | 1. Fixed a problem with adding certain ranges. |
michael@0 | 80 | |
michael@0 | 81 | 2. Added two more macros for testing for identifiers. |
michael@0 | 82 | |
michael@0 | 83 | 3. Tested with the UnicodeData-2.1.5.txt file. |
michael@0 | 84 | |
michael@0 | 85 | Version 1.7 |
michael@0 | 86 | ----------- |
michael@0 | 87 | 1. Fixed a problem with looking up decompositions in "ucgendat." |
michael@0 | 88 | |
michael@0 | 89 | Version 1.6 |
michael@0 | 90 | ----------- |
michael@0 | 91 | 1. Added two new properties introduced with UnicodeData-2.1.4.txt. |
michael@0 | 92 | |
michael@0 | 93 | 2. Changed the "ucgendat.c" program a little to automatically align the |
michael@0 | 94 | property data on a 4-byte boundary when new properties are added. |
michael@0 | 95 | |
michael@0 | 96 | 3. Changed the "ucgendat.c" programs to only generate canonical |
michael@0 | 97 | decompositions. |
michael@0 | 98 | |
michael@0 | 99 | 4. Added two new macros ucisinitialpunct() and ucisfinalpunct() to check for |
michael@0 | 100 | initial and final punctuation characters. |
michael@0 | 101 | |
michael@0 | 102 | 5. Minor additions and changes to the documentation. |
michael@0 | 103 | |
michael@0 | 104 | Version 1.5 |
michael@0 | 105 | ----------- |
michael@0 | 106 | 1. Changed all file open calls to include binary mode with "b" for DOS/WIN |
michael@0 | 107 | platforms. |
michael@0 | 108 | |
michael@0 | 109 | 2. Wrapped the unistd.h include so it won't be included when compiled under |
michael@0 | 110 | Win32. |
michael@0 | 111 | |
michael@0 | 112 | 3. Fixed a bad range check for hex digits in ucgendat.c. |
michael@0 | 113 | |
michael@0 | 114 | 4. Fixed a bad endian swap for combining classes. |
michael@0 | 115 | |
michael@0 | 116 | 5. Added code to make a number table and associated lookup functions. |
michael@0 | 117 | Functions added are ucnumber(), ucdigit(), and ucgetnumber(). The last |
michael@0 | 118 | function is to maintain compatibility with John Cowan's "uctype" package. |
michael@0 | 119 | |
michael@0 | 120 | Version 1.4 |
michael@0 | 121 | ----------- |
michael@0 | 122 | 1. Fixed a bug with adding a range. |
michael@0 | 123 | |
michael@0 | 124 | 2. Fixed a bug with inserting a range in order. |
michael@0 | 125 | |
michael@0 | 126 | 3. Fixed incorrectly specified ucisdefined() and ucisundefined() macros. |
michael@0 | 127 | |
michael@0 | 128 | 4. Added the missing unload for the combining class data. |
michael@0 | 129 | |
michael@0 | 130 | 5. Fixed a bad macro placement in ucisweak(). |
michael@0 | 131 | |
michael@0 | 132 | Version 1.3 |
michael@0 | 133 | ----------- |
michael@0 | 134 | 1. Bug with case mapping calculations fixed. |
michael@0 | 135 | |
michael@0 | 136 | 2. Bug with empty character property entries fixed. |
michael@0 | 137 | |
michael@0 | 138 | 3. Bug with incorrect type in the combining class lookup fixed. |
michael@0 | 139 | |
michael@0 | 140 | 4. Some corrections done to api.txt. |
michael@0 | 141 | |
michael@0 | 142 | 5. Bug in certain character property lookups fixed. |
michael@0 | 143 | |
michael@0 | 144 | 6. Added a character property table that records the defined characters. |
michael@0 | 145 | |
michael@0 | 146 | 7. Replaced ucisunknown() with ucisdefined() and ucisundefined(). |
michael@0 | 147 | |
michael@0 | 148 | Version 1.2 |
michael@0 | 149 | ----------- |
michael@0 | 150 | 1. Added code to ucgendat to generate a combining class table. |
michael@0 | 151 | |
michael@0 | 152 | 2. Fixed an endian problem with the byte count of decompositions. |
michael@0 | 153 | |
michael@0 | 154 | 3. Fixed some minor problems in the "format.txt" file. |
michael@0 | 155 | |
michael@0 | 156 | 4. Removed some bogus "Ss" values from MUTTUCData.txt file. |
michael@0 | 157 | |
michael@0 | 158 | 5. Added API function to get combining class. |
michael@0 | 159 | |
michael@0 | 160 | 6. Changed the open mode to "rb" so binary data files will be opened correctly |
michael@0 | 161 | on DOS/WIN as well as other platforms. |
michael@0 | 162 | |
michael@0 | 163 | 7. Added the "api.txt" file. |
michael@0 | 164 | |
michael@0 | 165 | Version 1.1 |
michael@0 | 166 | ----------- |
michael@0 | 167 | 1. Added ucisxdigit() which I overlooked. |
michael@0 | 168 | |
michael@0 | 169 | 2. Added UC_LT to the ucisalpha() macro which I overlooked. |
michael@0 | 170 | |
michael@0 | 171 | 3. Change uciscntrl() to include UC_CF. |
michael@0 | 172 | |
michael@0 | 173 | 4. Added ucisocntrl() and ucfntcntrl() macros. |
michael@0 | 174 | |
michael@0 | 175 | 5. Added a ucisblank() which I overlooked. |
michael@0 | 176 | |
michael@0 | 177 | 6. Added missing properties to ucissymbol() and ucisnumber(). |
michael@0 | 178 | |
michael@0 | 179 | 7. Added ucisgraph() and ucisprint(). |
michael@0 | 180 | |
michael@0 | 181 | 8. Changed the "Mr" property to "Sy" to mark this subset of mirroring |
michael@0 | 182 | characters as symmetric to avoid trampling the Unicode/ISO10646 sense of |
michael@0 | 183 | mirroring. |
michael@0 | 184 | |
michael@0 | 185 | 9. Added another property called "Ss" which includes control characters |
michael@0 | 186 | traditionally seen as spaces in the isspace() macro. |
michael@0 | 187 | |
michael@0 | 188 | 10. Added a bunch of macros to be API compatible with John Cowan's package. |
michael@0 | 189 | |
michael@0 | 190 | ACKNOWLEDGEMENTS |
michael@0 | 191 | ================ |
michael@0 | 192 | |
michael@0 | 193 | Thanks go to John Cowan <cowan@locke.ccil.org> for pointing out lots of |
michael@0 | 194 | missing things and giving me stuff, particularly a bunch of new macros. |
michael@0 | 195 | |
michael@0 | 196 | Thanks go to Bob Verbrugge <bob_verbrugge@nl.compuware.com> for pointing out |
michael@0 | 197 | various bugs. |
michael@0 | 198 | |
michael@0 | 199 | Thanks go to Christophe Pierret <cpierret@businessobjects.com> for pointing |
michael@0 | 200 | out that file modes need to have "b" for DOS/WIN machines, pointing out |
michael@0 | 201 | unistd.h is not a Win 32 header, and pointing out a problem with ucisalnum(). |
michael@0 | 202 | |
michael@0 | 203 | Thanks go to Kent Johnson <kent@pondview.mv.com> for finding a bug that caused |
michael@0 | 204 | incomplete decompositions to be generated by the "ucgendat" program. |
michael@0 | 205 | |
michael@0 | 206 | Thanks go to Valeriy E. Ushakov <uwe@ptc.spbu.ru> for spotting an allocation |
michael@0 | 207 | error and an initialization error. |