Wed, 31 Dec 2014 07:22:50 +0100
Correct previous dual key logic pending first delivery installment.
1 #
2 # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
3 #
5 CHARACTER DATA
6 ==============
8 This package generates some data files that contain character properties useful
9 for text processing.
11 CHARACTER PROPERTIES
12 ====================
14 The first data file is called "ctype.dat" and contains a compressed form of
15 the character properties found in the Unicode Character Database (UCDB).
16 Additional properties can be specified in limited UCDB format in another file
17 to avoid modifying the original UCDB.
19 The following is a property name and code table to be used with the character
20 data:
22 NAME CODE DESCRIPTION
23 ---------------------
24 Mn 0 Mark, Non-Spacing
25 Mc 1 Mark, Spacing Combining
26 Me 2 Mark, Enclosing
27 Nd 3 Number, Decimal Digit
28 Nl 4 Number, Letter
29 No 5 Number, Other
30 Zs 6 Separator, Space
31 Zl 7 Separator, Line
32 Zp 8 Separator, Paragraph
33 Cc 9 Other, Control
34 Cf 10 Other, Format
35 Cs 11 Other, Surrogate
36 Co 12 Other, Private Use
37 Cn 13 Other, Not Assigned
38 Lu 14 Letter, Uppercase
39 Ll 15 Letter, Lowercase
40 Lt 16 Letter, Titlecase
41 Lm 17 Letter, Modifier
42 Lo 18 Letter, Other
43 Pc 19 Punctuation, Connector
44 Pd 20 Punctuation, Dash
45 Ps 21 Punctuation, Open
46 Pe 22 Punctuation, Close
47 Po 23 Punctuation, Other
48 Sm 24 Symbol, Math
49 Sc 25 Symbol, Currency
50 Sk 26 Symbol, Modifier
51 So 27 Symbol, Other
52 L 28 Left-To-Right
53 R 29 Right-To-Left
54 EN 30 European Number
55 ES 31 European Number Separator
56 ET 32 European Number Terminator
57 AN 33 Arabic Number
58 CS 34 Common Number Separator
59 B 35 Block Separator
60 S 36 Segment Separator
61 WS 37 Whitespace
62 ON 38 Other Neutrals
63 Pi 47 Punctuation, Initial
64 Pf 48 Punctuation, Final
65 #
66 # Implementation specific properties.
67 #
68 Cm 39 Composite
69 Nb 40 Non-Breaking
70 Sy 41 Symmetric (characters which are part of open/close pairs)
71 Hd 42 Hex Digit
72 Qm 43 Quote Mark
73 Mr 44 Mirroring
74 Ss 45 Space, Other (controls viewed as spaces in ctype isspace())
75 Cp 46 Defined character
77 The actual binary data is formatted as follows:
79 Assumptions: unsigned short is at least 16-bits in size and unsigned long
80 is at least 32-bits in size.
82 unsigned short ByteOrderMark
83 unsigned short OffsetArraySize
84 unsigned long Bytes
85 unsigned short Offsets[OffsetArraySize + 1]
86 unsigned long Ranges[N], N = value of Offsets[OffsetArraySize]
88 The Bytes field provides the total byte count used for the Offsets[] and
89 Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and
90 there is always one extra node on the end to hold the final index of the
91 Ranges[] array. The Ranges[] array contains pairs of 4-byte values
92 representing a range of Unicode characters. The pairs are arranged in
93 increasing order by the first character code in the range.
95 Determining if a particular character is in the property list requires a
96 simple binary search to determine if a character is in any of the ranges
97 for the property.
99 If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
100 machine with a different endian order and the values must be byte-swapped.
102 To swap a 16-bit value:
103 c = (c >> 8) | ((c & 0xff) << 8)
105 To swap a 32-bit value:
106 c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
107 (((c >> 16) & 0xff) << 8) | (c >> 24)
109 CASE MAPPINGS
110 =============
112 The next data file is called "case.dat" and contains three case mapping tables
113 in the following order: upper, lower, and title case. Each table is in
114 increasing order by character code and each mapping contains 3 unsigned longs
115 which represent the possible mappings.
117 The format for the binary form of these tables is:
119 unsigned short ByteOrderMark
120 unsigned short NumMappingNodes, count of all mapping nodes
121 unsigned short CaseTableSizes[2], upper and lower mapping node counts
122 unsigned long CaseTables[NumMappingNodes]
124 The starting indexes of the case tables are calculated as following:
126 UpperIndex = 0;
127 LowerIndex = CaseTableSizes[0] * 3;
128 TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
130 The order of the fields for the three tables are:
132 Upper case
133 ----------
134 unsigned long upper;
135 unsigned long lower;
136 unsigned long title;
138 Lower case
139 ----------
140 unsigned long lower;
141 unsigned long upper;
142 unsigned long title;
144 Title case
145 ----------
146 unsigned long title;
147 unsigned long upper;
148 unsigned long lower;
150 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
151 same way as described in the CHARACTER PROPERTIES section.
153 Because the tables are in increasing order by character code, locating a
154 mapping requires a simple binary search on one of the 3 codes that make up
155 each node.
157 It is important to note that there can only be 65536 mapping nodes which
158 divided into 3 portions allows 21845 nodes for each case mapping table. The
159 distribution of mappings may be more or less than 21845 per table, but only
160 65536 are allowed.
162 DECOMPOSITIONS
163 ==============
165 The next data file is called "decomp.dat" and contains the decomposition data
166 for all characters with decompositions containing more than one character and
167 are *not* compatibility decompositions. Compatibility decompositions are
168 signaled in the UCDB format by the use of the <compat> tag in the
169 decomposition field. Each list of character codes represents a full
170 decomposition of a composite character. The nodes are arranged in increasing
171 order by character code.
173 The format for the binary form of this table is:
175 unsigned short ByteOrderMark
176 unsigned short NumDecompNodes, count of all decomposition nodes
177 unsigned long Bytes
178 unsigned long DecompNodes[(NumDecompNodes * 2) + 1]
179 unsigned long Decomp[N], N = sum of all counts in DecompNodes[]
181 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
182 same way as described in the CHARACTER PROPERTIES section.
184 The DecompNodes[] array consists of pairs of unsigned longs, the first of
185 which is the character code and the second is the initial index of the list
186 of character codes representing the decomposition.
188 Locating the decomposition of a composite character requires a binary search
189 for a character code in the DecompNodes[] array and using its index to
190 locate the start of the decomposition. The length of the decomposition list
191 is the index in the following element in DecompNode[] minus the current
192 index.
194 COMBINING CLASSES
195 =================
197 The fourth data file is called "cmbcl.dat" and contains the characters with
198 non-zero combining classes.
200 The format for the binary form of this table is:
202 unsigned short ByteOrderMark
203 unsigned short NumCCLNodes
204 unsigned long Bytes
205 unsigned long CCLNodes[NumCCLNodes * 3]
207 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
208 same way as described in the CHARACTER PROPERTIES section.
210 The CCLNodes[] array consists of groups of three unsigned longs. The first
211 and second are the beginning and ending of a range and the third is the
212 combining class of that range.
214 If a character is not found in this table, then the combining class is
215 assumed to be 0.
217 It is important to note that only 65536 distinct ranges plus combining class
218 can be specified because the NumCCLNodes is usually a 16-bit number.
220 NUMBER TABLE
221 ============
223 The final data file is called "num.dat" and contains the characters that have
224 a numeric value associated with them.
226 The format for the binary form of the table is:
228 unsigned short ByteOrderMark
229 unsigned short NumNumberNodes
230 unsigned long Bytes
231 unsigned long NumberNodes[NumNumberNodes]
232 unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
233 / sizeof(short)]
235 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
236 same way as described in the CHARACTER PROPERTIES section.
238 The NumberNodes array contains pairs of values, the first of which is the
239 character code and the second an index into the ValueNodes array. The
240 ValueNodes array contains pairs of integers which represent the numerator
241 and denominator of the numeric value of the character. If the character
242 happens to map to an integer, both the values in ValueNodes will be the
243 same.