Wed, 31 Dec 2014 06:09:35 +0100
Cloned upstream origin tor-browser at tor-browser-31.3.0esr-4.5-1-build1
revision ID fc1c9ff7c1b2defdbc039f12214767608f46423f for hacking purpose.
michael@0 | 1 | # |
michael@0 | 2 | # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $ |
michael@0 | 3 | # |
michael@0 | 4 | |
michael@0 | 5 | CHARACTER DATA |
michael@0 | 6 | ============== |
michael@0 | 7 | |
michael@0 | 8 | This package generates some data files that contain character properties useful |
michael@0 | 9 | for text processing. |
michael@0 | 10 | |
michael@0 | 11 | CHARACTER PROPERTIES |
michael@0 | 12 | ==================== |
michael@0 | 13 | |
michael@0 | 14 | The first data file is called "ctype.dat" and contains a compressed form of |
michael@0 | 15 | the character properties found in the Unicode Character Database (UCDB). |
michael@0 | 16 | Additional properties can be specified in limited UCDB format in another file |
michael@0 | 17 | to avoid modifying the original UCDB. |
michael@0 | 18 | |
michael@0 | 19 | The following is a property name and code table to be used with the character |
michael@0 | 20 | data: |
michael@0 | 21 | |
michael@0 | 22 | NAME CODE DESCRIPTION |
michael@0 | 23 | --------------------- |
michael@0 | 24 | Mn 0 Mark, Non-Spacing |
michael@0 | 25 | Mc 1 Mark, Spacing Combining |
michael@0 | 26 | Me 2 Mark, Enclosing |
michael@0 | 27 | Nd 3 Number, Decimal Digit |
michael@0 | 28 | Nl 4 Number, Letter |
michael@0 | 29 | No 5 Number, Other |
michael@0 | 30 | Zs 6 Separator, Space |
michael@0 | 31 | Zl 7 Separator, Line |
michael@0 | 32 | Zp 8 Separator, Paragraph |
michael@0 | 33 | Cc 9 Other, Control |
michael@0 | 34 | Cf 10 Other, Format |
michael@0 | 35 | Cs 11 Other, Surrogate |
michael@0 | 36 | Co 12 Other, Private Use |
michael@0 | 37 | Cn 13 Other, Not Assigned |
michael@0 | 38 | Lu 14 Letter, Uppercase |
michael@0 | 39 | Ll 15 Letter, Lowercase |
michael@0 | 40 | Lt 16 Letter, Titlecase |
michael@0 | 41 | Lm 17 Letter, Modifier |
michael@0 | 42 | Lo 18 Letter, Other |
michael@0 | 43 | Pc 19 Punctuation, Connector |
michael@0 | 44 | Pd 20 Punctuation, Dash |
michael@0 | 45 | Ps 21 Punctuation, Open |
michael@0 | 46 | Pe 22 Punctuation, Close |
michael@0 | 47 | Po 23 Punctuation, Other |
michael@0 | 48 | Sm 24 Symbol, Math |
michael@0 | 49 | Sc 25 Symbol, Currency |
michael@0 | 50 | Sk 26 Symbol, Modifier |
michael@0 | 51 | So 27 Symbol, Other |
michael@0 | 52 | L 28 Left-To-Right |
michael@0 | 53 | R 29 Right-To-Left |
michael@0 | 54 | EN 30 European Number |
michael@0 | 55 | ES 31 European Number Separator |
michael@0 | 56 | ET 32 European Number Terminator |
michael@0 | 57 | AN 33 Arabic Number |
michael@0 | 58 | CS 34 Common Number Separator |
michael@0 | 59 | B 35 Block Separator |
michael@0 | 60 | S 36 Segment Separator |
michael@0 | 61 | WS 37 Whitespace |
michael@0 | 62 | ON 38 Other Neutrals |
michael@0 | 63 | Pi 47 Punctuation, Initial |
michael@0 | 64 | Pf 48 Punctuation, Final |
michael@0 | 65 | # |
michael@0 | 66 | # Implementation specific properties. |
michael@0 | 67 | # |
michael@0 | 68 | Cm 39 Composite |
michael@0 | 69 | Nb 40 Non-Breaking |
michael@0 | 70 | Sy 41 Symmetric (characters which are part of open/close pairs) |
michael@0 | 71 | Hd 42 Hex Digit |
michael@0 | 72 | Qm 43 Quote Mark |
michael@0 | 73 | Mr 44 Mirroring |
michael@0 | 74 | Ss 45 Space, Other (controls viewed as spaces in ctype isspace()) |
michael@0 | 75 | Cp 46 Defined character |
michael@0 | 76 | |
michael@0 | 77 | The actual binary data is formatted as follows: |
michael@0 | 78 | |
michael@0 | 79 | Assumptions: unsigned short is at least 16-bits in size and unsigned long |
michael@0 | 80 | is at least 32-bits in size. |
michael@0 | 81 | |
michael@0 | 82 | unsigned short ByteOrderMark |
michael@0 | 83 | unsigned short OffsetArraySize |
michael@0 | 84 | unsigned long Bytes |
michael@0 | 85 | unsigned short Offsets[OffsetArraySize + 1] |
michael@0 | 86 | unsigned long Ranges[N], N = value of Offsets[OffsetArraySize] |
michael@0 | 87 | |
michael@0 | 88 | The Bytes field provides the total byte count used for the Offsets[] and |
michael@0 | 89 | Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and |
michael@0 | 90 | there is always one extra node on the end to hold the final index of the |
michael@0 | 91 | Ranges[] array. The Ranges[] array contains pairs of 4-byte values |
michael@0 | 92 | representing a range of Unicode characters. The pairs are arranged in |
michael@0 | 93 | increasing order by the first character code in the range. |
michael@0 | 94 | |
michael@0 | 95 | Determining if a particular character is in the property list requires a |
michael@0 | 96 | simple binary search to determine if a character is in any of the ranges |
michael@0 | 97 | for the property. |
michael@0 | 98 | |
michael@0 | 99 | If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a |
michael@0 | 100 | machine with a different endian order and the values must be byte-swapped. |
michael@0 | 101 | |
michael@0 | 102 | To swap a 16-bit value: |
michael@0 | 103 | c = (c >> 8) | ((c & 0xff) << 8) |
michael@0 | 104 | |
michael@0 | 105 | To swap a 32-bit value: |
michael@0 | 106 | c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) | |
michael@0 | 107 | (((c >> 16) & 0xff) << 8) | (c >> 24) |
michael@0 | 108 | |
michael@0 | 109 | CASE MAPPINGS |
michael@0 | 110 | ============= |
michael@0 | 111 | |
michael@0 | 112 | The next data file is called "case.dat" and contains three case mapping tables |
michael@0 | 113 | in the following order: upper, lower, and title case. Each table is in |
michael@0 | 114 | increasing order by character code and each mapping contains 3 unsigned longs |
michael@0 | 115 | which represent the possible mappings. |
michael@0 | 116 | |
michael@0 | 117 | The format for the binary form of these tables is: |
michael@0 | 118 | |
michael@0 | 119 | unsigned short ByteOrderMark |
michael@0 | 120 | unsigned short NumMappingNodes, count of all mapping nodes |
michael@0 | 121 | unsigned short CaseTableSizes[2], upper and lower mapping node counts |
michael@0 | 122 | unsigned long CaseTables[NumMappingNodes] |
michael@0 | 123 | |
michael@0 | 124 | The starting indexes of the case tables are calculated as following: |
michael@0 | 125 | |
michael@0 | 126 | UpperIndex = 0; |
michael@0 | 127 | LowerIndex = CaseTableSizes[0] * 3; |
michael@0 | 128 | TitleIndex = LowerIndex + CaseTableSizes[1] * 3; |
michael@0 | 129 | |
michael@0 | 130 | The order of the fields for the three tables are: |
michael@0 | 131 | |
michael@0 | 132 | Upper case |
michael@0 | 133 | ---------- |
michael@0 | 134 | unsigned long upper; |
michael@0 | 135 | unsigned long lower; |
michael@0 | 136 | unsigned long title; |
michael@0 | 137 | |
michael@0 | 138 | Lower case |
michael@0 | 139 | ---------- |
michael@0 | 140 | unsigned long lower; |
michael@0 | 141 | unsigned long upper; |
michael@0 | 142 | unsigned long title; |
michael@0 | 143 | |
michael@0 | 144 | Title case |
michael@0 | 145 | ---------- |
michael@0 | 146 | unsigned long title; |
michael@0 | 147 | unsigned long upper; |
michael@0 | 148 | unsigned long lower; |
michael@0 | 149 | |
michael@0 | 150 | If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the |
michael@0 | 151 | same way as described in the CHARACTER PROPERTIES section. |
michael@0 | 152 | |
michael@0 | 153 | Because the tables are in increasing order by character code, locating a |
michael@0 | 154 | mapping requires a simple binary search on one of the 3 codes that make up |
michael@0 | 155 | each node. |
michael@0 | 156 | |
michael@0 | 157 | It is important to note that there can only be 65536 mapping nodes which |
michael@0 | 158 | divided into 3 portions allows 21845 nodes for each case mapping table. The |
michael@0 | 159 | distribution of mappings may be more or less than 21845 per table, but only |
michael@0 | 160 | 65536 are allowed. |
michael@0 | 161 | |
michael@0 | 162 | DECOMPOSITIONS |
michael@0 | 163 | ============== |
michael@0 | 164 | |
michael@0 | 165 | The next data file is called "decomp.dat" and contains the decomposition data |
michael@0 | 166 | for all characters with decompositions containing more than one character and |
michael@0 | 167 | are *not* compatibility decompositions. Compatibility decompositions are |
michael@0 | 168 | signaled in the UCDB format by the use of the <compat> tag in the |
michael@0 | 169 | decomposition field. Each list of character codes represents a full |
michael@0 | 170 | decomposition of a composite character. The nodes are arranged in increasing |
michael@0 | 171 | order by character code. |
michael@0 | 172 | |
michael@0 | 173 | The format for the binary form of this table is: |
michael@0 | 174 | |
michael@0 | 175 | unsigned short ByteOrderMark |
michael@0 | 176 | unsigned short NumDecompNodes, count of all decomposition nodes |
michael@0 | 177 | unsigned long Bytes |
michael@0 | 178 | unsigned long DecompNodes[(NumDecompNodes * 2) + 1] |
michael@0 | 179 | unsigned long Decomp[N], N = sum of all counts in DecompNodes[] |
michael@0 | 180 | |
michael@0 | 181 | If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the |
michael@0 | 182 | same way as described in the CHARACTER PROPERTIES section. |
michael@0 | 183 | |
michael@0 | 184 | The DecompNodes[] array consists of pairs of unsigned longs, the first of |
michael@0 | 185 | which is the character code and the second is the initial index of the list |
michael@0 | 186 | of character codes representing the decomposition. |
michael@0 | 187 | |
michael@0 | 188 | Locating the decomposition of a composite character requires a binary search |
michael@0 | 189 | for a character code in the DecompNodes[] array and using its index to |
michael@0 | 190 | locate the start of the decomposition. The length of the decomposition list |
michael@0 | 191 | is the index in the following element in DecompNode[] minus the current |
michael@0 | 192 | index. |
michael@0 | 193 | |
michael@0 | 194 | COMBINING CLASSES |
michael@0 | 195 | ================= |
michael@0 | 196 | |
michael@0 | 197 | The fourth data file is called "cmbcl.dat" and contains the characters with |
michael@0 | 198 | non-zero combining classes. |
michael@0 | 199 | |
michael@0 | 200 | The format for the binary form of this table is: |
michael@0 | 201 | |
michael@0 | 202 | unsigned short ByteOrderMark |
michael@0 | 203 | unsigned short NumCCLNodes |
michael@0 | 204 | unsigned long Bytes |
michael@0 | 205 | unsigned long CCLNodes[NumCCLNodes * 3] |
michael@0 | 206 | |
michael@0 | 207 | If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the |
michael@0 | 208 | same way as described in the CHARACTER PROPERTIES section. |
michael@0 | 209 | |
michael@0 | 210 | The CCLNodes[] array consists of groups of three unsigned longs. The first |
michael@0 | 211 | and second are the beginning and ending of a range and the third is the |
michael@0 | 212 | combining class of that range. |
michael@0 | 213 | |
michael@0 | 214 | If a character is not found in this table, then the combining class is |
michael@0 | 215 | assumed to be 0. |
michael@0 | 216 | |
michael@0 | 217 | It is important to note that only 65536 distinct ranges plus combining class |
michael@0 | 218 | can be specified because the NumCCLNodes is usually a 16-bit number. |
michael@0 | 219 | |
michael@0 | 220 | NUMBER TABLE |
michael@0 | 221 | ============ |
michael@0 | 222 | |
michael@0 | 223 | The final data file is called "num.dat" and contains the characters that have |
michael@0 | 224 | a numeric value associated with them. |
michael@0 | 225 | |
michael@0 | 226 | The format for the binary form of the table is: |
michael@0 | 227 | |
michael@0 | 228 | unsigned short ByteOrderMark |
michael@0 | 229 | unsigned short NumNumberNodes |
michael@0 | 230 | unsigned long Bytes |
michael@0 | 231 | unsigned long NumberNodes[NumNumberNodes] |
michael@0 | 232 | unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long))) |
michael@0 | 233 | / sizeof(short)] |
michael@0 | 234 | |
michael@0 | 235 | If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the |
michael@0 | 236 | same way as described in the CHARACTER PROPERTIES section. |
michael@0 | 237 | |
michael@0 | 238 | The NumberNodes array contains pairs of values, the first of which is the |
michael@0 | 239 | character code and the second an index into the ValueNodes array. The |
michael@0 | 240 | ValueNodes array contains pairs of integers which represent the numerator |
michael@0 | 241 | and denominator of the numeric value of the character. If the character |
michael@0 | 242 | happens to map to an integer, both the values in ValueNodes will be the |
michael@0 | 243 | same. |