michael@0: #
michael@0: # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
michael@0: #
michael@0: 
michael@0: CHARACTER DATA
michael@0: ==============
michael@0: 
michael@0: This package generates some data files that contain character properties useful
michael@0: for text processing.
michael@0: 
michael@0: CHARACTER PROPERTIES
michael@0: ====================
michael@0: 
michael@0: The first data file is called "ctype.dat" and contains a compressed form of
michael@0: the character properties found in the Unicode Character Database (UCDB).
michael@0: Additional properties can be specified in limited UCDB format in another file
michael@0: to avoid modifying the original UCDB.
michael@0: 
michael@0: The following is a property name and code table to be used with the character
michael@0: data:
michael@0: 
michael@0: NAME CODE DESCRIPTION
michael@0: ---------------------
michael@0: Mn   0    Mark, Non-Spacing
michael@0: Mc   1    Mark, Spacing Combining
michael@0: Me   2    Mark, Enclosing
michael@0: Nd   3    Number, Decimal Digit
michael@0: Nl   4    Number, Letter
michael@0: No   5    Number, Other
michael@0: Zs   6    Separator, Space
michael@0: Zl   7    Separator, Line
michael@0: Zp   8    Separator, Paragraph
michael@0: Cc   9    Other, Control
michael@0: Cf   10   Other, Format
michael@0: Cs   11   Other, Surrogate
michael@0: Co   12   Other, Private Use
michael@0: Cn   13   Other, Not Assigned
michael@0: Lu   14   Letter, Uppercase
michael@0: Ll   15   Letter, Lowercase
michael@0: Lt   16   Letter, Titlecase
michael@0: Lm   17   Letter, Modifier
michael@0: Lo   18   Letter, Other
michael@0: Pc   19   Punctuation, Connector
michael@0: Pd   20   Punctuation, Dash
michael@0: Ps   21   Punctuation, Open
michael@0: Pe   22   Punctuation, Close
michael@0: Po   23   Punctuation, Other
michael@0: Sm   24   Symbol, Math
michael@0: Sc   25   Symbol, Currency
michael@0: Sk   26   Symbol, Modifier
michael@0: So   27   Symbol, Other
michael@0: L    28   Left-To-Right
michael@0: R    29   Right-To-Left
michael@0: EN   30   European Number
michael@0: ES   31   European Number Separator
michael@0: ET   32   European Number Terminator
michael@0: AN   33   Arabic Number
michael@0: CS   34   Common Number Separator
michael@0: B    35   Block Separator
michael@0: S    36   Segment Separator
michael@0: WS   37   Whitespace
michael@0: ON   38   Other Neutrals
michael@0: Pi   47   Punctuation, Initial
michael@0: Pf   48   Punctuation, Final
michael@0: #
michael@0: # Implementation specific properties.
michael@0: #
michael@0: Cm   39   Composite
michael@0: Nb   40   Non-Breaking
michael@0: Sy   41   Symmetric (characters which are part of open/close pairs)
michael@0: Hd   42   Hex Digit
michael@0: Qm   43   Quote Mark
michael@0: Mr   44   Mirroring
michael@0: Ss   45   Space, Other (controls viewed as spaces in ctype isspace())
michael@0: Cp   46   Defined character
michael@0: 
michael@0: The actual binary data is formatted as follows:
michael@0: 
michael@0:   Assumptions: unsigned short is at least 16-bits in size and unsigned long
michael@0:                is at least 32-bits in size.
michael@0: 
michael@0:     unsigned short ByteOrderMark
michael@0:     unsigned short OffsetArraySize
michael@0:     unsigned long  Bytes
michael@0:     unsigned short Offsets[OffsetArraySize + 1]
michael@0:     unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]
michael@0: 
michael@0:   The Bytes field provides the total byte count used for the Offsets[] and
michael@0:   Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and
michael@0:   there is always one extra node on the end to hold the final index of the
michael@0:   Ranges[] array.  The Ranges[] array contains pairs of 4-byte values
michael@0:   representing a range of Unicode characters.  The pairs are arranged in
michael@0:   increasing order by the first character code in the range.
michael@0: 
michael@0:   Determining if a particular character is in the property list requires a
michael@0:   simple binary search to determine if a character is in any of the ranges
michael@0:   for the property.
michael@0: 
michael@0:   If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
michael@0:   machine with a different endian order and the values must be byte-swapped.
michael@0: 
michael@0:   To swap a 16-bit value:
michael@0:      c = (c >> 8) | ((c & 0xff) << 8)
michael@0: 
michael@0:   To swap a 32-bit value:
michael@0:      c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
michael@0:          (((c >> 16) & 0xff) << 8) | (c >> 24)
michael@0: 
michael@0: CASE MAPPINGS
michael@0: =============
michael@0: 
michael@0: The next data file is called "case.dat" and contains three case mapping tables
michael@0: in the following order: upper, lower, and title case.  Each table is in
michael@0: increasing order by character code and each mapping contains 3 unsigned longs
michael@0: which represent the possible mappings.
michael@0: 
michael@0: The format for the binary form of these tables is:
michael@0: 
michael@0:   unsigned short ByteOrderMark
michael@0:   unsigned short NumMappingNodes, count of all mapping nodes
michael@0:   unsigned short CaseTableSizes[2], upper and lower mapping node counts
michael@0:   unsigned long  CaseTables[NumMappingNodes]
michael@0: 
michael@0:   The starting indexes of the case tables are calculated as following:
michael@0: 
michael@0:     UpperIndex = 0;
michael@0:     LowerIndex = CaseTableSizes[0] * 3;
michael@0:     TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
michael@0: 
michael@0:   The order of the fields for the three tables are:
michael@0: 
michael@0:     Upper case
michael@0:     ----------
michael@0:     unsigned long upper;
michael@0:     unsigned long lower;
michael@0:     unsigned long title;
michael@0: 
michael@0:     Lower case
michael@0:     ----------
michael@0:     unsigned long lower;
michael@0:     unsigned long upper;
michael@0:     unsigned long title;
michael@0: 
michael@0:     Title case
michael@0:     ----------
michael@0:     unsigned long title;
michael@0:     unsigned long upper;
michael@0:     unsigned long lower;
michael@0: 
michael@0:   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
michael@0:   same way as described in the CHARACTER PROPERTIES section.
michael@0: 
michael@0:   Because the tables are in increasing order by character code, locating a
michael@0:   mapping requires a simple binary search on one of the 3 codes that make up
michael@0:   each node.
michael@0: 
michael@0:   It is important to note that there can only be 65536 mapping nodes which
michael@0:   divided into 3 portions allows 21845 nodes for each case mapping table.  The
michael@0:   distribution of mappings may be more or less than 21845 per table, but only
michael@0:   65536 are allowed.
michael@0: 
michael@0: DECOMPOSITIONS
michael@0: ==============
michael@0: 
michael@0: The next data file is called "decomp.dat" and contains the decomposition data
michael@0: for all characters with decompositions containing more than one character and
michael@0: are *not* compatibility decompositions.  Compatibility decompositions are
michael@0: signaled in the UCDB format by the use of the <compat> tag in the
michael@0: decomposition field.  Each list of character codes represents a full
michael@0: decomposition of a composite character.  The nodes are arranged in increasing
michael@0: order by character code.
michael@0: 
michael@0: The format for the binary form of this table is:
michael@0: 
michael@0:   unsigned short ByteOrderMark
michael@0:   unsigned short NumDecompNodes, count of all decomposition nodes
michael@0:   unsigned long  Bytes
michael@0:   unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]
michael@0:   unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]
michael@0: 
michael@0:   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
michael@0:   same way as described in the CHARACTER PROPERTIES section.
michael@0: 
michael@0:   The DecompNodes[] array consists of pairs of unsigned longs, the first of
michael@0:   which is the character code and the second is the initial index of the list
michael@0:   of character codes representing the decomposition.
michael@0: 
michael@0:   Locating the decomposition of a composite character requires a binary search
michael@0:   for a character code in the DecompNodes[] array and using its index to
michael@0:   locate the start of the decomposition.  The length of the decomposition list
michael@0:   is the index in the following element in DecompNode[] minus the current
michael@0:   index.
michael@0: 
michael@0: COMBINING CLASSES
michael@0: =================
michael@0: 
michael@0: The fourth data file is called "cmbcl.dat" and contains the characters with
michael@0: non-zero combining classes.
michael@0: 
michael@0: The format for the binary form of this table is:
michael@0: 
michael@0:   unsigned short ByteOrderMark
michael@0:   unsigned short NumCCLNodes
michael@0:   unsigned long  Bytes
michael@0:   unsigned long  CCLNodes[NumCCLNodes * 3]
michael@0: 
michael@0:   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
michael@0:   same way as described in the CHARACTER PROPERTIES section.
michael@0: 
michael@0:   The CCLNodes[] array consists of groups of three unsigned longs.  The first
michael@0:   and second are the beginning and ending of a range and the third is the
michael@0:   combining class of that range.
michael@0: 
michael@0:   If a character is not found in this table, then the combining class is
michael@0:   assumed to be 0.
michael@0: 
michael@0:   It is important to note that only 65536 distinct ranges plus combining class
michael@0:   can be specified because the NumCCLNodes is usually a 16-bit number.
michael@0: 
michael@0: NUMBER TABLE
michael@0: ============
michael@0: 
michael@0: The final data file is called "num.dat" and contains the characters that have
michael@0: a numeric value associated with them.
michael@0: 
michael@0: The format for the binary form of the table is:
michael@0: 
michael@0:   unsigned short ByteOrderMark
michael@0:   unsigned short NumNumberNodes
michael@0:   unsigned long  Bytes
michael@0:   unsigned long  NumberNodes[NumNumberNodes]
michael@0:   unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
michael@0:                             / sizeof(short)]
michael@0: 
michael@0:   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
michael@0:   same way as described in the CHARACTER PROPERTIES section.
michael@0: 
michael@0:   The NumberNodes array contains pairs of values, the first of which is the
michael@0:   character code and the second an index into the ValueNodes array.  The
michael@0:   ValueNodes array contains pairs of integers which represent the numerator
michael@0:   and denominator of the numeric value of the character.  If the character
michael@0:   happens to map to an integer, both the values in ValueNodes will be the
michael@0:   same.