1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/intl/unicharutil/tools/format.txt Wed Dec 31 06:09:35 2014 +0100 1.3 @@ -0,0 +1,243 @@ 1.4 +# 1.5 +# $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $ 1.6 +# 1.7 + 1.8 +CHARACTER DATA 1.9 +============== 1.10 + 1.11 +This package generates some data files that contain character properties useful 1.12 +for text processing. 1.13 + 1.14 +CHARACTER PROPERTIES 1.15 +==================== 1.16 + 1.17 +The first data file is called "ctype.dat" and contains a compressed form of 1.18 +the character properties found in the Unicode Character Database (UCDB). 1.19 +Additional properties can be specified in limited UCDB format in another file 1.20 +to avoid modifying the original UCDB. 1.21 + 1.22 +The following is a property name and code table to be used with the character 1.23 +data: 1.24 + 1.25 +NAME CODE DESCRIPTION 1.26 +--------------------- 1.27 +Mn 0 Mark, Non-Spacing 1.28 +Mc 1 Mark, Spacing Combining 1.29 +Me 2 Mark, Enclosing 1.30 +Nd 3 Number, Decimal Digit 1.31 +Nl 4 Number, Letter 1.32 +No 5 Number, Other 1.33 +Zs 6 Separator, Space 1.34 +Zl 7 Separator, Line 1.35 +Zp 8 Separator, Paragraph 1.36 +Cc 9 Other, Control 1.37 +Cf 10 Other, Format 1.38 +Cs 11 Other, Surrogate 1.39 +Co 12 Other, Private Use 1.40 +Cn 13 Other, Not Assigned 1.41 +Lu 14 Letter, Uppercase 1.42 +Ll 15 Letter, Lowercase 1.43 +Lt 16 Letter, Titlecase 1.44 +Lm 17 Letter, Modifier 1.45 +Lo 18 Letter, Other 1.46 +Pc 19 Punctuation, Connector 1.47 +Pd 20 Punctuation, Dash 1.48 +Ps 21 Punctuation, Open 1.49 +Pe 22 Punctuation, Close 1.50 +Po 23 Punctuation, Other 1.51 +Sm 24 Symbol, Math 1.52 +Sc 25 Symbol, Currency 1.53 +Sk 26 Symbol, Modifier 1.54 +So 27 Symbol, Other 1.55 +L 28 Left-To-Right 1.56 +R 29 Right-To-Left 1.57 +EN 30 European Number 1.58 +ES 31 European Number Separator 1.59 +ET 32 European Number Terminator 1.60 +AN 33 Arabic Number 1.61 +CS 34 Common Number Separator 1.62 +B 35 Block Separator 1.63 +S 36 Segment Separator 1.64 +WS 37 Whitespace 1.65 +ON 38 Other Neutrals 1.66 +Pi 47 Punctuation, Initial 1.67 +Pf 48 Punctuation, Final 1.68 +# 1.69 +# Implementation specific properties. 1.70 +# 1.71 +Cm 39 Composite 1.72 +Nb 40 Non-Breaking 1.73 +Sy 41 Symmetric (characters which are part of open/close pairs) 1.74 +Hd 42 Hex Digit 1.75 +Qm 43 Quote Mark 1.76 +Mr 44 Mirroring 1.77 +Ss 45 Space, Other (controls viewed as spaces in ctype isspace()) 1.78 +Cp 46 Defined character 1.79 + 1.80 +The actual binary data is formatted as follows: 1.81 + 1.82 + Assumptions: unsigned short is at least 16-bits in size and unsigned long 1.83 + is at least 32-bits in size. 1.84 + 1.85 + unsigned short ByteOrderMark 1.86 + unsigned short OffsetArraySize 1.87 + unsigned long Bytes 1.88 + unsigned short Offsets[OffsetArraySize + 1] 1.89 + unsigned long Ranges[N], N = value of Offsets[OffsetArraySize] 1.90 + 1.91 + The Bytes field provides the total byte count used for the Offsets[] and 1.92 + Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and 1.93 + there is always one extra node on the end to hold the final index of the 1.94 + Ranges[] array. The Ranges[] array contains pairs of 4-byte values 1.95 + representing a range of Unicode characters. The pairs are arranged in 1.96 + increasing order by the first character code in the range. 1.97 + 1.98 + Determining if a particular character is in the property list requires a 1.99 + simple binary search to determine if a character is in any of the ranges 1.100 + for the property. 1.101 + 1.102 + If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a 1.103 + machine with a different endian order and the values must be byte-swapped. 1.104 + 1.105 + To swap a 16-bit value: 1.106 + c = (c >> 8) | ((c & 0xff) << 8) 1.107 + 1.108 + To swap a 32-bit value: 1.109 + c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) | 1.110 + (((c >> 16) & 0xff) << 8) | (c >> 24) 1.111 + 1.112 +CASE MAPPINGS 1.113 +============= 1.114 + 1.115 +The next data file is called "case.dat" and contains three case mapping tables 1.116 +in the following order: upper, lower, and title case. Each table is in 1.117 +increasing order by character code and each mapping contains 3 unsigned longs 1.118 +which represent the possible mappings. 1.119 + 1.120 +The format for the binary form of these tables is: 1.121 + 1.122 + unsigned short ByteOrderMark 1.123 + unsigned short NumMappingNodes, count of all mapping nodes 1.124 + unsigned short CaseTableSizes[2], upper and lower mapping node counts 1.125 + unsigned long CaseTables[NumMappingNodes] 1.126 + 1.127 + The starting indexes of the case tables are calculated as following: 1.128 + 1.129 + UpperIndex = 0; 1.130 + LowerIndex = CaseTableSizes[0] * 3; 1.131 + TitleIndex = LowerIndex + CaseTableSizes[1] * 3; 1.132 + 1.133 + The order of the fields for the three tables are: 1.134 + 1.135 + Upper case 1.136 + ---------- 1.137 + unsigned long upper; 1.138 + unsigned long lower; 1.139 + unsigned long title; 1.140 + 1.141 + Lower case 1.142 + ---------- 1.143 + unsigned long lower; 1.144 + unsigned long upper; 1.145 + unsigned long title; 1.146 + 1.147 + Title case 1.148 + ---------- 1.149 + unsigned long title; 1.150 + unsigned long upper; 1.151 + unsigned long lower; 1.152 + 1.153 + If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 1.154 + same way as described in the CHARACTER PROPERTIES section. 1.155 + 1.156 + Because the tables are in increasing order by character code, locating a 1.157 + mapping requires a simple binary search on one of the 3 codes that make up 1.158 + each node. 1.159 + 1.160 + It is important to note that there can only be 65536 mapping nodes which 1.161 + divided into 3 portions allows 21845 nodes for each case mapping table. The 1.162 + distribution of mappings may be more or less than 21845 per table, but only 1.163 + 65536 are allowed. 1.164 + 1.165 +DECOMPOSITIONS 1.166 +============== 1.167 + 1.168 +The next data file is called "decomp.dat" and contains the decomposition data 1.169 +for all characters with decompositions containing more than one character and 1.170 +are *not* compatibility decompositions. Compatibility decompositions are 1.171 +signaled in the UCDB format by the use of the <compat> tag in the 1.172 +decomposition field. Each list of character codes represents a full 1.173 +decomposition of a composite character. The nodes are arranged in increasing 1.174 +order by character code. 1.175 + 1.176 +The format for the binary form of this table is: 1.177 + 1.178 + unsigned short ByteOrderMark 1.179 + unsigned short NumDecompNodes, count of all decomposition nodes 1.180 + unsigned long Bytes 1.181 + unsigned long DecompNodes[(NumDecompNodes * 2) + 1] 1.182 + unsigned long Decomp[N], N = sum of all counts in DecompNodes[] 1.183 + 1.184 + If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 1.185 + same way as described in the CHARACTER PROPERTIES section. 1.186 + 1.187 + The DecompNodes[] array consists of pairs of unsigned longs, the first of 1.188 + which is the character code and the second is the initial index of the list 1.189 + of character codes representing the decomposition. 1.190 + 1.191 + Locating the decomposition of a composite character requires a binary search 1.192 + for a character code in the DecompNodes[] array and using its index to 1.193 + locate the start of the decomposition. The length of the decomposition list 1.194 + is the index in the following element in DecompNode[] minus the current 1.195 + index. 1.196 + 1.197 +COMBINING CLASSES 1.198 +================= 1.199 + 1.200 +The fourth data file is called "cmbcl.dat" and contains the characters with 1.201 +non-zero combining classes. 1.202 + 1.203 +The format for the binary form of this table is: 1.204 + 1.205 + unsigned short ByteOrderMark 1.206 + unsigned short NumCCLNodes 1.207 + unsigned long Bytes 1.208 + unsigned long CCLNodes[NumCCLNodes * 3] 1.209 + 1.210 + If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 1.211 + same way as described in the CHARACTER PROPERTIES section. 1.212 + 1.213 + The CCLNodes[] array consists of groups of three unsigned longs. The first 1.214 + and second are the beginning and ending of a range and the third is the 1.215 + combining class of that range. 1.216 + 1.217 + If a character is not found in this table, then the combining class is 1.218 + assumed to be 0. 1.219 + 1.220 + It is important to note that only 65536 distinct ranges plus combining class 1.221 + can be specified because the NumCCLNodes is usually a 16-bit number. 1.222 + 1.223 +NUMBER TABLE 1.224 +============ 1.225 + 1.226 +The final data file is called "num.dat" and contains the characters that have 1.227 +a numeric value associated with them. 1.228 + 1.229 +The format for the binary form of the table is: 1.230 + 1.231 + unsigned short ByteOrderMark 1.232 + unsigned short NumNumberNodes 1.233 + unsigned long Bytes 1.234 + unsigned long NumberNodes[NumNumberNodes] 1.235 + unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long))) 1.236 + / sizeof(short)] 1.237 + 1.238 + If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the 1.239 + same way as described in the CHARACTER PROPERTIES section. 1.240 + 1.241 + The NumberNodes array contains pairs of values, the first of which is the 1.242 + character code and the second an index into the ValueNodes array. The 1.243 + ValueNodes array contains pairs of integers which represent the numerator 1.244 + and denominator of the numeric value of the character. If the character 1.245 + happens to map to an integer, both the values in ValueNodes will be the 1.246 + same.