michael@0: # michael@0: # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $ michael@0: # michael@0: michael@0: CHARACTER DATA michael@0: ============== michael@0: michael@0: This package generates some data files that contain character properties useful michael@0: for text processing. michael@0: michael@0: CHARACTER PROPERTIES michael@0: ==================== michael@0: michael@0: The first data file is called "ctype.dat" and contains a compressed form of michael@0: the character properties found in the Unicode Character Database (UCDB). michael@0: Additional properties can be specified in limited UCDB format in another file michael@0: to avoid modifying the original UCDB. michael@0: michael@0: The following is a property name and code table to be used with the character michael@0: data: michael@0: michael@0: NAME CODE DESCRIPTION michael@0: --------------------- michael@0: Mn 0 Mark, Non-Spacing michael@0: Mc 1 Mark, Spacing Combining michael@0: Me 2 Mark, Enclosing michael@0: Nd 3 Number, Decimal Digit michael@0: Nl 4 Number, Letter michael@0: No 5 Number, Other michael@0: Zs 6 Separator, Space michael@0: Zl 7 Separator, Line michael@0: Zp 8 Separator, Paragraph michael@0: Cc 9 Other, Control michael@0: Cf 10 Other, Format michael@0: Cs 11 Other, Surrogate michael@0: Co 12 Other, Private Use michael@0: Cn 13 Other, Not Assigned michael@0: Lu 14 Letter, Uppercase michael@0: Ll 15 Letter, Lowercase michael@0: Lt 16 Letter, Titlecase michael@0: Lm 17 Letter, Modifier michael@0: Lo 18 Letter, Other michael@0: Pc 19 Punctuation, Connector michael@0: Pd 20 Punctuation, Dash michael@0: Ps 21 Punctuation, Open michael@0: Pe 22 Punctuation, Close michael@0: Po 23 Punctuation, Other michael@0: Sm 24 Symbol, Math michael@0: Sc 25 Symbol, Currency michael@0: Sk 26 Symbol, Modifier michael@0: So 27 Symbol, Other michael@0: L 28 Left-To-Right michael@0: R 29 Right-To-Left michael@0: EN 30 European Number michael@0: ES 31 European Number Separator michael@0: ET 32 European Number Terminator michael@0: AN 33 Arabic Number michael@0: CS 34 Common Number Separator michael@0: B 35 Block Separator michael@0: S 36 Segment Separator michael@0: WS 37 Whitespace michael@0: ON 38 Other Neutrals michael@0: Pi 47 Punctuation, Initial michael@0: Pf 48 Punctuation, Final michael@0: # michael@0: # Implementation specific properties. michael@0: # michael@0: Cm 39 Composite michael@0: Nb 40 Non-Breaking michael@0: Sy 41 Symmetric (characters which are part of open/close pairs) michael@0: Hd 42 Hex Digit michael@0: Qm 43 Quote Mark michael@0: Mr 44 Mirroring michael@0: Ss 45 Space, Other (controls viewed as spaces in ctype isspace()) michael@0: Cp 46 Defined character michael@0: michael@0: The actual binary data is formatted as follows: michael@0: michael@0: Assumptions: unsigned short is at least 16-bits in size and unsigned long michael@0: is at least 32-bits in size. michael@0: michael@0: unsigned short ByteOrderMark michael@0: unsigned short OffsetArraySize michael@0: unsigned long Bytes michael@0: unsigned short Offsets[OffsetArraySize + 1] michael@0: unsigned long Ranges[N], N = value of Offsets[OffsetArraySize] michael@0: michael@0: The Bytes field provides the total byte count used for the Offsets[] and michael@0: Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and michael@0: there is always one extra node on the end to hold the final index of the michael@0: Ranges[] array. The Ranges[] array contains pairs of 4-byte values michael@0: representing a range of Unicode characters. The pairs are arranged in michael@0: increasing order by the first character code in the range. michael@0: michael@0: Determining if a particular character is in the property list requires a michael@0: simple binary search to determine if a character is in any of the ranges michael@0: for the property. michael@0: michael@0: If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a michael@0: machine with a different endian order and the values must be byte-swapped. michael@0: michael@0: To swap a 16-bit value: michael@0: c = (c >> 8) | ((c & 0xff) << 8) michael@0: michael@0: To swap a 32-bit value: michael@0: c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) | michael@0: (((c >> 16) & 0xff) << 8) | (c >> 24) michael@0: michael@0: CASE MAPPINGS michael@0: ============= michael@0: michael@0: The next data file is called "case.dat" and contains three case mapping tables michael@0: in the following order: upper, lower, and title case. Each table is in michael@0: increasing order by character code and each mapping contains 3 unsigned longs michael@0: which represent the possible mappings. michael@0: michael@0: The format for the binary form of these tables is: michael@0: michael@0: unsigned short ByteOrderMark michael@0: unsigned short NumMappingNodes, count of all mapping nodes michael@0: unsigned short CaseTableSizes[2], upper and lower mapping node counts michael@0: unsigned long CaseTables[NumMappingNodes] michael@0: michael@0: The starting indexes of the case tables are calculated as following: michael@0: michael@0: UpperIndex = 0; michael@0: LowerIndex = CaseTableSizes[0] * 3; michael@0: TitleIndex = LowerIndex + CaseTableSizes[1] * 3; michael@0: michael@0: The order of the fields for the three tables are: michael@0: michael@0: Upper case michael@0: ---------- michael@0: unsigned long upper; michael@0: unsigned long lower; michael@0: unsigned long title; michael@0: michael@0: Lower case michael@0: ---------- michael@0: unsigned long lower; michael@0: unsigned long upper; michael@0: unsigned long title; michael@0: michael@0: Title case michael@0: ---------- michael@0: unsigned long title; michael@0: unsigned long upper; michael@0: unsigned long lower; michael@0: michael@0: If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the michael@0: same way as described in the CHARACTER PROPERTIES section. michael@0: michael@0: Because the tables are in increasing order by character code, locating a michael@0: mapping requires a simple binary search on one of the 3 codes that make up michael@0: each node. michael@0: michael@0: It is important to note that there can only be 65536 mapping nodes which michael@0: divided into 3 portions allows 21845 nodes for each case mapping table. The michael@0: distribution of mappings may be more or less than 21845 per table, but only michael@0: 65536 are allowed. michael@0: michael@0: DECOMPOSITIONS michael@0: ============== michael@0: michael@0: The next data file is called "decomp.dat" and contains the decomposition data michael@0: for all characters with decompositions containing more than one character and michael@0: are *not* compatibility decompositions. Compatibility decompositions are michael@0: signaled in the UCDB format by the use of the tag in the michael@0: decomposition field. Each list of character codes represents a full michael@0: decomposition of a composite character. The nodes are arranged in increasing michael@0: order by character code. michael@0: michael@0: The format for the binary form of this table is: michael@0: michael@0: unsigned short ByteOrderMark michael@0: unsigned short NumDecompNodes, count of all decomposition nodes michael@0: unsigned long Bytes michael@0: unsigned long DecompNodes[(NumDecompNodes * 2) + 1] michael@0: unsigned long Decomp[N], N = sum of all counts in DecompNodes[] michael@0: michael@0: If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the michael@0: same way as described in the CHARACTER PROPERTIES section. michael@0: michael@0: The DecompNodes[] array consists of pairs of unsigned longs, the first of michael@0: which is the character code and the second is the initial index of the list michael@0: of character codes representing the decomposition. michael@0: michael@0: Locating the decomposition of a composite character requires a binary search michael@0: for a character code in the DecompNodes[] array and using its index to michael@0: locate the start of the decomposition. The length of the decomposition list michael@0: is the index in the following element in DecompNode[] minus the current michael@0: index. michael@0: michael@0: COMBINING CLASSES michael@0: ================= michael@0: michael@0: The fourth data file is called "cmbcl.dat" and contains the characters with michael@0: non-zero combining classes. michael@0: michael@0: The format for the binary form of this table is: michael@0: michael@0: unsigned short ByteOrderMark michael@0: unsigned short NumCCLNodes michael@0: unsigned long Bytes michael@0: unsigned long CCLNodes[NumCCLNodes * 3] michael@0: michael@0: If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the michael@0: same way as described in the CHARACTER PROPERTIES section. michael@0: michael@0: The CCLNodes[] array consists of groups of three unsigned longs. The first michael@0: and second are the beginning and ending of a range and the third is the michael@0: combining class of that range. michael@0: michael@0: If a character is not found in this table, then the combining class is michael@0: assumed to be 0. michael@0: michael@0: It is important to note that only 65536 distinct ranges plus combining class michael@0: can be specified because the NumCCLNodes is usually a 16-bit number. michael@0: michael@0: NUMBER TABLE michael@0: ============ michael@0: michael@0: The final data file is called "num.dat" and contains the characters that have michael@0: a numeric value associated with them. michael@0: michael@0: The format for the binary form of the table is: michael@0: michael@0: unsigned short ByteOrderMark michael@0: unsigned short NumNumberNodes michael@0: unsigned long Bytes michael@0: unsigned long NumberNodes[NumNumberNodes] michael@0: unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long))) michael@0: / sizeof(short)] michael@0: michael@0: If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the michael@0: same way as described in the CHARACTER PROPERTIES section. michael@0: michael@0: The NumberNodes array contains pairs of values, the first of which is the michael@0: character code and the second an index into the ValueNodes array. The michael@0: ValueNodes array contains pairs of integers which represent the numerator michael@0: and denominator of the numeric value of the character. If the character michael@0: happens to map to an integer, both the values in ValueNodes will be the michael@0: same.