intl/unicharutil/tools/format.txt

Sat, 03 Jan 2015 20:18:00 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Sat, 03 Jan 2015 20:18:00 +0100
branch
TOR_BUG_3246
changeset 7
129ffea94266
permissions
-rw-r--r--

Conditionally enable double key logic according to:
private browsing mode or privacy.thirdparty.isolate preference and
implement in GetCookieStringCommon and FindCookie where it counts...
With some reservations of how to convince FindCookie users to test
condition and pass a nullptr when disabling double key logic.

michael@0 1 #
michael@0 2 # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
michael@0 3 #
michael@0 4
michael@0 5 CHARACTER DATA
michael@0 6 ==============
michael@0 7
michael@0 8 This package generates some data files that contain character properties useful
michael@0 9 for text processing.
michael@0 10
michael@0 11 CHARACTER PROPERTIES
michael@0 12 ====================
michael@0 13
michael@0 14 The first data file is called "ctype.dat" and contains a compressed form of
michael@0 15 the character properties found in the Unicode Character Database (UCDB).
michael@0 16 Additional properties can be specified in limited UCDB format in another file
michael@0 17 to avoid modifying the original UCDB.
michael@0 18
michael@0 19 The following is a property name and code table to be used with the character
michael@0 20 data:
michael@0 21
michael@0 22 NAME CODE DESCRIPTION
michael@0 23 ---------------------
michael@0 24 Mn 0 Mark, Non-Spacing
michael@0 25 Mc 1 Mark, Spacing Combining
michael@0 26 Me 2 Mark, Enclosing
michael@0 27 Nd 3 Number, Decimal Digit
michael@0 28 Nl 4 Number, Letter
michael@0 29 No 5 Number, Other
michael@0 30 Zs 6 Separator, Space
michael@0 31 Zl 7 Separator, Line
michael@0 32 Zp 8 Separator, Paragraph
michael@0 33 Cc 9 Other, Control
michael@0 34 Cf 10 Other, Format
michael@0 35 Cs 11 Other, Surrogate
michael@0 36 Co 12 Other, Private Use
michael@0 37 Cn 13 Other, Not Assigned
michael@0 38 Lu 14 Letter, Uppercase
michael@0 39 Ll 15 Letter, Lowercase
michael@0 40 Lt 16 Letter, Titlecase
michael@0 41 Lm 17 Letter, Modifier
michael@0 42 Lo 18 Letter, Other
michael@0 43 Pc 19 Punctuation, Connector
michael@0 44 Pd 20 Punctuation, Dash
michael@0 45 Ps 21 Punctuation, Open
michael@0 46 Pe 22 Punctuation, Close
michael@0 47 Po 23 Punctuation, Other
michael@0 48 Sm 24 Symbol, Math
michael@0 49 Sc 25 Symbol, Currency
michael@0 50 Sk 26 Symbol, Modifier
michael@0 51 So 27 Symbol, Other
michael@0 52 L 28 Left-To-Right
michael@0 53 R 29 Right-To-Left
michael@0 54 EN 30 European Number
michael@0 55 ES 31 European Number Separator
michael@0 56 ET 32 European Number Terminator
michael@0 57 AN 33 Arabic Number
michael@0 58 CS 34 Common Number Separator
michael@0 59 B 35 Block Separator
michael@0 60 S 36 Segment Separator
michael@0 61 WS 37 Whitespace
michael@0 62 ON 38 Other Neutrals
michael@0 63 Pi 47 Punctuation, Initial
michael@0 64 Pf 48 Punctuation, Final
michael@0 65 #
michael@0 66 # Implementation specific properties.
michael@0 67 #
michael@0 68 Cm 39 Composite
michael@0 69 Nb 40 Non-Breaking
michael@0 70 Sy 41 Symmetric (characters which are part of open/close pairs)
michael@0 71 Hd 42 Hex Digit
michael@0 72 Qm 43 Quote Mark
michael@0 73 Mr 44 Mirroring
michael@0 74 Ss 45 Space, Other (controls viewed as spaces in ctype isspace())
michael@0 75 Cp 46 Defined character
michael@0 76
michael@0 77 The actual binary data is formatted as follows:
michael@0 78
michael@0 79 Assumptions: unsigned short is at least 16-bits in size and unsigned long
michael@0 80 is at least 32-bits in size.
michael@0 81
michael@0 82 unsigned short ByteOrderMark
michael@0 83 unsigned short OffsetArraySize
michael@0 84 unsigned long Bytes
michael@0 85 unsigned short Offsets[OffsetArraySize + 1]
michael@0 86 unsigned long Ranges[N], N = value of Offsets[OffsetArraySize]
michael@0 87
michael@0 88 The Bytes field provides the total byte count used for the Offsets[] and
michael@0 89 Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and
michael@0 90 there is always one extra node on the end to hold the final index of the
michael@0 91 Ranges[] array. The Ranges[] array contains pairs of 4-byte values
michael@0 92 representing a range of Unicode characters. The pairs are arranged in
michael@0 93 increasing order by the first character code in the range.
michael@0 94
michael@0 95 Determining if a particular character is in the property list requires a
michael@0 96 simple binary search to determine if a character is in any of the ranges
michael@0 97 for the property.
michael@0 98
michael@0 99 If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
michael@0 100 machine with a different endian order and the values must be byte-swapped.
michael@0 101
michael@0 102 To swap a 16-bit value:
michael@0 103 c = (c >> 8) | ((c & 0xff) << 8)
michael@0 104
michael@0 105 To swap a 32-bit value:
michael@0 106 c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
michael@0 107 (((c >> 16) & 0xff) << 8) | (c >> 24)
michael@0 108
michael@0 109 CASE MAPPINGS
michael@0 110 =============
michael@0 111
michael@0 112 The next data file is called "case.dat" and contains three case mapping tables
michael@0 113 in the following order: upper, lower, and title case. Each table is in
michael@0 114 increasing order by character code and each mapping contains 3 unsigned longs
michael@0 115 which represent the possible mappings.
michael@0 116
michael@0 117 The format for the binary form of these tables is:
michael@0 118
michael@0 119 unsigned short ByteOrderMark
michael@0 120 unsigned short NumMappingNodes, count of all mapping nodes
michael@0 121 unsigned short CaseTableSizes[2], upper and lower mapping node counts
michael@0 122 unsigned long CaseTables[NumMappingNodes]
michael@0 123
michael@0 124 The starting indexes of the case tables are calculated as following:
michael@0 125
michael@0 126 UpperIndex = 0;
michael@0 127 LowerIndex = CaseTableSizes[0] * 3;
michael@0 128 TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
michael@0 129
michael@0 130 The order of the fields for the three tables are:
michael@0 131
michael@0 132 Upper case
michael@0 133 ----------
michael@0 134 unsigned long upper;
michael@0 135 unsigned long lower;
michael@0 136 unsigned long title;
michael@0 137
michael@0 138 Lower case
michael@0 139 ----------
michael@0 140 unsigned long lower;
michael@0 141 unsigned long upper;
michael@0 142 unsigned long title;
michael@0 143
michael@0 144 Title case
michael@0 145 ----------
michael@0 146 unsigned long title;
michael@0 147 unsigned long upper;
michael@0 148 unsigned long lower;
michael@0 149
michael@0 150 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
michael@0 151 same way as described in the CHARACTER PROPERTIES section.
michael@0 152
michael@0 153 Because the tables are in increasing order by character code, locating a
michael@0 154 mapping requires a simple binary search on one of the 3 codes that make up
michael@0 155 each node.
michael@0 156
michael@0 157 It is important to note that there can only be 65536 mapping nodes which
michael@0 158 divided into 3 portions allows 21845 nodes for each case mapping table. The
michael@0 159 distribution of mappings may be more or less than 21845 per table, but only
michael@0 160 65536 are allowed.
michael@0 161
michael@0 162 DECOMPOSITIONS
michael@0 163 ==============
michael@0 164
michael@0 165 The next data file is called "decomp.dat" and contains the decomposition data
michael@0 166 for all characters with decompositions containing more than one character and
michael@0 167 are *not* compatibility decompositions. Compatibility decompositions are
michael@0 168 signaled in the UCDB format by the use of the <compat> tag in the
michael@0 169 decomposition field. Each list of character codes represents a full
michael@0 170 decomposition of a composite character. The nodes are arranged in increasing
michael@0 171 order by character code.
michael@0 172
michael@0 173 The format for the binary form of this table is:
michael@0 174
michael@0 175 unsigned short ByteOrderMark
michael@0 176 unsigned short NumDecompNodes, count of all decomposition nodes
michael@0 177 unsigned long Bytes
michael@0 178 unsigned long DecompNodes[(NumDecompNodes * 2) + 1]
michael@0 179 unsigned long Decomp[N], N = sum of all counts in DecompNodes[]
michael@0 180
michael@0 181 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
michael@0 182 same way as described in the CHARACTER PROPERTIES section.
michael@0 183
michael@0 184 The DecompNodes[] array consists of pairs of unsigned longs, the first of
michael@0 185 which is the character code and the second is the initial index of the list
michael@0 186 of character codes representing the decomposition.
michael@0 187
michael@0 188 Locating the decomposition of a composite character requires a binary search
michael@0 189 for a character code in the DecompNodes[] array and using its index to
michael@0 190 locate the start of the decomposition. The length of the decomposition list
michael@0 191 is the index in the following element in DecompNode[] minus the current
michael@0 192 index.
michael@0 193
michael@0 194 COMBINING CLASSES
michael@0 195 =================
michael@0 196
michael@0 197 The fourth data file is called "cmbcl.dat" and contains the characters with
michael@0 198 non-zero combining classes.
michael@0 199
michael@0 200 The format for the binary form of this table is:
michael@0 201
michael@0 202 unsigned short ByteOrderMark
michael@0 203 unsigned short NumCCLNodes
michael@0 204 unsigned long Bytes
michael@0 205 unsigned long CCLNodes[NumCCLNodes * 3]
michael@0 206
michael@0 207 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
michael@0 208 same way as described in the CHARACTER PROPERTIES section.
michael@0 209
michael@0 210 The CCLNodes[] array consists of groups of three unsigned longs. The first
michael@0 211 and second are the beginning and ending of a range and the third is the
michael@0 212 combining class of that range.
michael@0 213
michael@0 214 If a character is not found in this table, then the combining class is
michael@0 215 assumed to be 0.
michael@0 216
michael@0 217 It is important to note that only 65536 distinct ranges plus combining class
michael@0 218 can be specified because the NumCCLNodes is usually a 16-bit number.
michael@0 219
michael@0 220 NUMBER TABLE
michael@0 221 ============
michael@0 222
michael@0 223 The final data file is called "num.dat" and contains the characters that have
michael@0 224 a numeric value associated with them.
michael@0 225
michael@0 226 The format for the binary form of the table is:
michael@0 227
michael@0 228 unsigned short ByteOrderMark
michael@0 229 unsigned short NumNumberNodes
michael@0 230 unsigned long Bytes
michael@0 231 unsigned long NumberNodes[NumNumberNodes]
michael@0 232 unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
michael@0 233 / sizeof(short)]
michael@0 234
michael@0 235 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
michael@0 236 same way as described in the CHARACTER PROPERTIES section.
michael@0 237
michael@0 238 The NumberNodes array contains pairs of values, the first of which is the
michael@0 239 character code and the second an index into the ValueNodes array. The
michael@0 240 ValueNodes array contains pairs of integers which represent the numerator
michael@0 241 and denominator of the numeric value of the character. If the character
michael@0 242 happens to map to an integer, both the values in ValueNodes will be the
michael@0 243 same.

mercurial