|
1 # |
|
2 # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $ |
|
3 # |
|
4 |
|
5 CHARACTER DATA |
|
6 ============== |
|
7 |
|
8 This package generates some data files that contain character properties useful |
|
9 for text processing. |
|
10 |
|
11 CHARACTER PROPERTIES |
|
12 ==================== |
|
13 |
|
14 The first data file is called "ctype.dat" and contains a compressed form of |
|
15 the character properties found in the Unicode Character Database (UCDB). |
|
16 Additional properties can be specified in limited UCDB format in another file |
|
17 to avoid modifying the original UCDB. |
|
18 |
|
19 The following is a property name and code table to be used with the character |
|
20 data: |
|
21 |
|
22 NAME CODE DESCRIPTION |
|
23 --------------------- |
|
24 Mn 0 Mark, Non-Spacing |
|
25 Mc 1 Mark, Spacing Combining |
|
26 Me 2 Mark, Enclosing |
|
27 Nd 3 Number, Decimal Digit |
|
28 Nl 4 Number, Letter |
|
29 No 5 Number, Other |
|
30 Zs 6 Separator, Space |
|
31 Zl 7 Separator, Line |
|
32 Zp 8 Separator, Paragraph |
|
33 Cc 9 Other, Control |
|
34 Cf 10 Other, Format |
|
35 Cs 11 Other, Surrogate |
|
36 Co 12 Other, Private Use |
|
37 Cn 13 Other, Not Assigned |
|
38 Lu 14 Letter, Uppercase |
|
39 Ll 15 Letter, Lowercase |
|
40 Lt 16 Letter, Titlecase |
|
41 Lm 17 Letter, Modifier |
|
42 Lo 18 Letter, Other |
|
43 Pc 19 Punctuation, Connector |
|
44 Pd 20 Punctuation, Dash |
|
45 Ps 21 Punctuation, Open |
|
46 Pe 22 Punctuation, Close |
|
47 Po 23 Punctuation, Other |
|
48 Sm 24 Symbol, Math |
|
49 Sc 25 Symbol, Currency |
|
50 Sk 26 Symbol, Modifier |
|
51 So 27 Symbol, Other |
|
52 L 28 Left-To-Right |
|
53 R 29 Right-To-Left |
|
54 EN 30 European Number |
|
55 ES 31 European Number Separator |
|
56 ET 32 European Number Terminator |
|
57 AN 33 Arabic Number |
|
58 CS 34 Common Number Separator |
|
59 B 35 Block Separator |
|
60 S 36 Segment Separator |
|
61 WS 37 Whitespace |
|
62 ON 38 Other Neutrals |
|
63 Pi 47 Punctuation, Initial |
|
64 Pf 48 Punctuation, Final |
|
65 # |
|
66 # Implementation specific properties. |
|
67 # |
|
68 Cm 39 Composite |
|
69 Nb 40 Non-Breaking |
|
70 Sy 41 Symmetric (characters which are part of open/close pairs) |
|
71 Hd 42 Hex Digit |
|
72 Qm 43 Quote Mark |
|
73 Mr 44 Mirroring |
|
74 Ss 45 Space, Other (controls viewed as spaces in ctype isspace()) |
|
75 Cp 46 Defined character |
|
76 |
|
77 The actual binary data is formatted as follows: |
|
78 |
|
79 Assumptions: unsigned short is at least 16-bits in size and unsigned long |
|
80 is at least 32-bits in size. |
|
81 |
|
82 unsigned short ByteOrderMark |
|
83 unsigned short OffsetArraySize |
|
84 unsigned long Bytes |
|
85 unsigned short Offsets[OffsetArraySize + 1] |
|
86 unsigned long Ranges[N], N = value of Offsets[OffsetArraySize] |
|
87 |
|
88 The Bytes field provides the total byte count used for the Offsets[] and |
|
89 Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and |
|
90 there is always one extra node on the end to hold the final index of the |
|
91 Ranges[] array. The Ranges[] array contains pairs of 4-byte values |
|
92 representing a range of Unicode characters. The pairs are arranged in |
|
93 increasing order by the first character code in the range. |
|
94 |
|
95 Determining if a particular character is in the property list requires a |
|
96 simple binary search to determine if a character is in any of the ranges |
|
97 for the property. |
|
98 |
|
99 If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a |
|
100 machine with a different endian order and the values must be byte-swapped. |
|
101 |
|
102 To swap a 16-bit value: |
|
103 c = (c >> 8) | ((c & 0xff) << 8) |
|
104 |
|
105 To swap a 32-bit value: |
|
106 c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) | |
|
107 (((c >> 16) & 0xff) << 8) | (c >> 24) |
|
108 |
|
109 CASE MAPPINGS |
|
110 ============= |
|
111 |
|
112 The next data file is called "case.dat" and contains three case mapping tables |
|
113 in the following order: upper, lower, and title case. Each table is in |
|
114 increasing order by character code and each mapping contains 3 unsigned longs |
|
115 which represent the possible mappings. |
|
116 |
|
117 The format for the binary form of these tables is: |
|
118 |
|
119 unsigned short ByteOrderMark |
|
120 unsigned short NumMappingNodes, count of all mapping nodes |
|
121 unsigned short CaseTableSizes[2], upper and lower mapping node counts |
|
122 unsigned long CaseTables[NumMappingNodes] |
|
123 |
|
124 The starting indexes of the case tables are calculated as following: |
|
125 |
|
126 UpperIndex = 0; |
|
127 LowerIndex = CaseTableSizes[0] * 3; |
|
128 TitleIndex = LowerIndex + CaseTableSizes[1] * 3; |
|
129 |
|
130 The order of the fields for the three tables are: |
|
131 |
|
132 Upper case |
|
133 ---------- |
|
134 unsigned long upper; |
|
135 unsigned long lower; |
|
136 unsigned long title; |
|
137 |
|
138 Lower case |
|
139 ---------- |
|
140 unsigned long lower; |
|
141 unsigned long upper; |
|
142 unsigned long title; |
|
143 |
|
144 Title case |
|
145 ---------- |
|
146 unsigned long title; |
|
147 unsigned long upper; |
|
148 unsigned long lower; |
|
149 |
|
150 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the |
|
151 same way as described in the CHARACTER PROPERTIES section. |
|
152 |
|
153 Because the tables are in increasing order by character code, locating a |
|
154 mapping requires a simple binary search on one of the 3 codes that make up |
|
155 each node. |
|
156 |
|
157 It is important to note that there can only be 65536 mapping nodes which |
|
158 divided into 3 portions allows 21845 nodes for each case mapping table. The |
|
159 distribution of mappings may be more or less than 21845 per table, but only |
|
160 65536 are allowed. |
|
161 |
|
162 DECOMPOSITIONS |
|
163 ============== |
|
164 |
|
165 The next data file is called "decomp.dat" and contains the decomposition data |
|
166 for all characters with decompositions containing more than one character and |
|
167 are *not* compatibility decompositions. Compatibility decompositions are |
|
168 signaled in the UCDB format by the use of the <compat> tag in the |
|
169 decomposition field. Each list of character codes represents a full |
|
170 decomposition of a composite character. The nodes are arranged in increasing |
|
171 order by character code. |
|
172 |
|
173 The format for the binary form of this table is: |
|
174 |
|
175 unsigned short ByteOrderMark |
|
176 unsigned short NumDecompNodes, count of all decomposition nodes |
|
177 unsigned long Bytes |
|
178 unsigned long DecompNodes[(NumDecompNodes * 2) + 1] |
|
179 unsigned long Decomp[N], N = sum of all counts in DecompNodes[] |
|
180 |
|
181 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the |
|
182 same way as described in the CHARACTER PROPERTIES section. |
|
183 |
|
184 The DecompNodes[] array consists of pairs of unsigned longs, the first of |
|
185 which is the character code and the second is the initial index of the list |
|
186 of character codes representing the decomposition. |
|
187 |
|
188 Locating the decomposition of a composite character requires a binary search |
|
189 for a character code in the DecompNodes[] array and using its index to |
|
190 locate the start of the decomposition. The length of the decomposition list |
|
191 is the index in the following element in DecompNode[] minus the current |
|
192 index. |
|
193 |
|
194 COMBINING CLASSES |
|
195 ================= |
|
196 |
|
197 The fourth data file is called "cmbcl.dat" and contains the characters with |
|
198 non-zero combining classes. |
|
199 |
|
200 The format for the binary form of this table is: |
|
201 |
|
202 unsigned short ByteOrderMark |
|
203 unsigned short NumCCLNodes |
|
204 unsigned long Bytes |
|
205 unsigned long CCLNodes[NumCCLNodes * 3] |
|
206 |
|
207 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the |
|
208 same way as described in the CHARACTER PROPERTIES section. |
|
209 |
|
210 The CCLNodes[] array consists of groups of three unsigned longs. The first |
|
211 and second are the beginning and ending of a range and the third is the |
|
212 combining class of that range. |
|
213 |
|
214 If a character is not found in this table, then the combining class is |
|
215 assumed to be 0. |
|
216 |
|
217 It is important to note that only 65536 distinct ranges plus combining class |
|
218 can be specified because the NumCCLNodes is usually a 16-bit number. |
|
219 |
|
220 NUMBER TABLE |
|
221 ============ |
|
222 |
|
223 The final data file is called "num.dat" and contains the characters that have |
|
224 a numeric value associated with them. |
|
225 |
|
226 The format for the binary form of the table is: |
|
227 |
|
228 unsigned short ByteOrderMark |
|
229 unsigned short NumNumberNodes |
|
230 unsigned long Bytes |
|
231 unsigned long NumberNodes[NumNumberNodes] |
|
232 unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long))) |
|
233 / sizeof(short)] |
|
234 |
|
235 If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the |
|
236 same way as described in the CHARACTER PROPERTIES section. |
|
237 |
|
238 The NumberNodes array contains pairs of values, the first of which is the |
|
239 character code and the second an index into the ValueNodes array. The |
|
240 ValueNodes array contains pairs of integers which represent the numerator |
|
241 and denominator of the numeric value of the character. If the character |
|
242 happens to map to an integer, both the values in ValueNodes will be the |
|
243 same. |