intl/unicharutil/tools/format.txt

changeset 0
6474c204b198
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/intl/unicharutil/tools/format.txt	Wed Dec 31 06:09:35 2014 +0100
     1.3 @@ -0,0 +1,243 @@
     1.4 +#
     1.5 +# $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
     1.6 +#
     1.7 +
     1.8 +CHARACTER DATA
     1.9 +==============
    1.10 +
    1.11 +This package generates some data files that contain character properties useful
    1.12 +for text processing.
    1.13 +
    1.14 +CHARACTER PROPERTIES
    1.15 +====================
    1.16 +
    1.17 +The first data file is called "ctype.dat" and contains a compressed form of
    1.18 +the character properties found in the Unicode Character Database (UCDB).
    1.19 +Additional properties can be specified in limited UCDB format in another file
    1.20 +to avoid modifying the original UCDB.
    1.21 +
    1.22 +The following is a property name and code table to be used with the character
    1.23 +data:
    1.24 +
    1.25 +NAME CODE DESCRIPTION
    1.26 +---------------------
    1.27 +Mn   0    Mark, Non-Spacing
    1.28 +Mc   1    Mark, Spacing Combining
    1.29 +Me   2    Mark, Enclosing
    1.30 +Nd   3    Number, Decimal Digit
    1.31 +Nl   4    Number, Letter
    1.32 +No   5    Number, Other
    1.33 +Zs   6    Separator, Space
    1.34 +Zl   7    Separator, Line
    1.35 +Zp   8    Separator, Paragraph
    1.36 +Cc   9    Other, Control
    1.37 +Cf   10   Other, Format
    1.38 +Cs   11   Other, Surrogate
    1.39 +Co   12   Other, Private Use
    1.40 +Cn   13   Other, Not Assigned
    1.41 +Lu   14   Letter, Uppercase
    1.42 +Ll   15   Letter, Lowercase
    1.43 +Lt   16   Letter, Titlecase
    1.44 +Lm   17   Letter, Modifier
    1.45 +Lo   18   Letter, Other
    1.46 +Pc   19   Punctuation, Connector
    1.47 +Pd   20   Punctuation, Dash
    1.48 +Ps   21   Punctuation, Open
    1.49 +Pe   22   Punctuation, Close
    1.50 +Po   23   Punctuation, Other
    1.51 +Sm   24   Symbol, Math
    1.52 +Sc   25   Symbol, Currency
    1.53 +Sk   26   Symbol, Modifier
    1.54 +So   27   Symbol, Other
    1.55 +L    28   Left-To-Right
    1.56 +R    29   Right-To-Left
    1.57 +EN   30   European Number
    1.58 +ES   31   European Number Separator
    1.59 +ET   32   European Number Terminator
    1.60 +AN   33   Arabic Number
    1.61 +CS   34   Common Number Separator
    1.62 +B    35   Block Separator
    1.63 +S    36   Segment Separator
    1.64 +WS   37   Whitespace
    1.65 +ON   38   Other Neutrals
    1.66 +Pi   47   Punctuation, Initial
    1.67 +Pf   48   Punctuation, Final
    1.68 +#
    1.69 +# Implementation specific properties.
    1.70 +#
    1.71 +Cm   39   Composite
    1.72 +Nb   40   Non-Breaking
    1.73 +Sy   41   Symmetric (characters which are part of open/close pairs)
    1.74 +Hd   42   Hex Digit
    1.75 +Qm   43   Quote Mark
    1.76 +Mr   44   Mirroring
    1.77 +Ss   45   Space, Other (controls viewed as spaces in ctype isspace())
    1.78 +Cp   46   Defined character
    1.79 +
    1.80 +The actual binary data is formatted as follows:
    1.81 +
    1.82 +  Assumptions: unsigned short is at least 16-bits in size and unsigned long
    1.83 +               is at least 32-bits in size.
    1.84 +
    1.85 +    unsigned short ByteOrderMark
    1.86 +    unsigned short OffsetArraySize
    1.87 +    unsigned long  Bytes
    1.88 +    unsigned short Offsets[OffsetArraySize + 1]
    1.89 +    unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]
    1.90 +
    1.91 +  The Bytes field provides the total byte count used for the Offsets[] and
    1.92 +  Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and
    1.93 +  there is always one extra node on the end to hold the final index of the
    1.94 +  Ranges[] array.  The Ranges[] array contains pairs of 4-byte values
    1.95 +  representing a range of Unicode characters.  The pairs are arranged in
    1.96 +  increasing order by the first character code in the range.
    1.97 +
    1.98 +  Determining if a particular character is in the property list requires a
    1.99 +  simple binary search to determine if a character is in any of the ranges
   1.100 +  for the property.
   1.101 +
   1.102 +  If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
   1.103 +  machine with a different endian order and the values must be byte-swapped.
   1.104 +
   1.105 +  To swap a 16-bit value:
   1.106 +     c = (c >> 8) | ((c & 0xff) << 8)
   1.107 +
   1.108 +  To swap a 32-bit value:
   1.109 +     c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
   1.110 +         (((c >> 16) & 0xff) << 8) | (c >> 24)
   1.111 +
   1.112 +CASE MAPPINGS
   1.113 +=============
   1.114 +
   1.115 +The next data file is called "case.dat" and contains three case mapping tables
   1.116 +in the following order: upper, lower, and title case.  Each table is in
   1.117 +increasing order by character code and each mapping contains 3 unsigned longs
   1.118 +which represent the possible mappings.
   1.119 +
   1.120 +The format for the binary form of these tables is:
   1.121 +
   1.122 +  unsigned short ByteOrderMark
   1.123 +  unsigned short NumMappingNodes, count of all mapping nodes
   1.124 +  unsigned short CaseTableSizes[2], upper and lower mapping node counts
   1.125 +  unsigned long  CaseTables[NumMappingNodes]
   1.126 +
   1.127 +  The starting indexes of the case tables are calculated as following:
   1.128 +
   1.129 +    UpperIndex = 0;
   1.130 +    LowerIndex = CaseTableSizes[0] * 3;
   1.131 +    TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
   1.132 +
   1.133 +  The order of the fields for the three tables are:
   1.134 +
   1.135 +    Upper case
   1.136 +    ----------
   1.137 +    unsigned long upper;
   1.138 +    unsigned long lower;
   1.139 +    unsigned long title;
   1.140 +
   1.141 +    Lower case
   1.142 +    ----------
   1.143 +    unsigned long lower;
   1.144 +    unsigned long upper;
   1.145 +    unsigned long title;
   1.146 +
   1.147 +    Title case
   1.148 +    ----------
   1.149 +    unsigned long title;
   1.150 +    unsigned long upper;
   1.151 +    unsigned long lower;
   1.152 +
   1.153 +  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
   1.154 +  same way as described in the CHARACTER PROPERTIES section.
   1.155 +
   1.156 +  Because the tables are in increasing order by character code, locating a
   1.157 +  mapping requires a simple binary search on one of the 3 codes that make up
   1.158 +  each node.
   1.159 +
   1.160 +  It is important to note that there can only be 65536 mapping nodes which
   1.161 +  divided into 3 portions allows 21845 nodes for each case mapping table.  The
   1.162 +  distribution of mappings may be more or less than 21845 per table, but only
   1.163 +  65536 are allowed.
   1.164 +
   1.165 +DECOMPOSITIONS
   1.166 +==============
   1.167 +
   1.168 +The next data file is called "decomp.dat" and contains the decomposition data
   1.169 +for all characters with decompositions containing more than one character and
   1.170 +are *not* compatibility decompositions.  Compatibility decompositions are
   1.171 +signaled in the UCDB format by the use of the <compat> tag in the
   1.172 +decomposition field.  Each list of character codes represents a full
   1.173 +decomposition of a composite character.  The nodes are arranged in increasing
   1.174 +order by character code.
   1.175 +
   1.176 +The format for the binary form of this table is:
   1.177 +
   1.178 +  unsigned short ByteOrderMark
   1.179 +  unsigned short NumDecompNodes, count of all decomposition nodes
   1.180 +  unsigned long  Bytes
   1.181 +  unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]
   1.182 +  unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]
   1.183 +
   1.184 +  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
   1.185 +  same way as described in the CHARACTER PROPERTIES section.
   1.186 +
   1.187 +  The DecompNodes[] array consists of pairs of unsigned longs, the first of
   1.188 +  which is the character code and the second is the initial index of the list
   1.189 +  of character codes representing the decomposition.
   1.190 +
   1.191 +  Locating the decomposition of a composite character requires a binary search
   1.192 +  for a character code in the DecompNodes[] array and using its index to
   1.193 +  locate the start of the decomposition.  The length of the decomposition list
   1.194 +  is the index in the following element in DecompNode[] minus the current
   1.195 +  index.
   1.196 +
   1.197 +COMBINING CLASSES
   1.198 +=================
   1.199 +
   1.200 +The fourth data file is called "cmbcl.dat" and contains the characters with
   1.201 +non-zero combining classes.
   1.202 +
   1.203 +The format for the binary form of this table is:
   1.204 +
   1.205 +  unsigned short ByteOrderMark
   1.206 +  unsigned short NumCCLNodes
   1.207 +  unsigned long  Bytes
   1.208 +  unsigned long  CCLNodes[NumCCLNodes * 3]
   1.209 +
   1.210 +  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
   1.211 +  same way as described in the CHARACTER PROPERTIES section.
   1.212 +
   1.213 +  The CCLNodes[] array consists of groups of three unsigned longs.  The first
   1.214 +  and second are the beginning and ending of a range and the third is the
   1.215 +  combining class of that range.
   1.216 +
   1.217 +  If a character is not found in this table, then the combining class is
   1.218 +  assumed to be 0.
   1.219 +
   1.220 +  It is important to note that only 65536 distinct ranges plus combining class
   1.221 +  can be specified because the NumCCLNodes is usually a 16-bit number.
   1.222 +
   1.223 +NUMBER TABLE
   1.224 +============
   1.225 +
   1.226 +The final data file is called "num.dat" and contains the characters that have
   1.227 +a numeric value associated with them.
   1.228 +
   1.229 +The format for the binary form of the table is:
   1.230 +
   1.231 +  unsigned short ByteOrderMark
   1.232 +  unsigned short NumNumberNodes
   1.233 +  unsigned long  Bytes
   1.234 +  unsigned long  NumberNodes[NumNumberNodes]
   1.235 +  unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
   1.236 +                            / sizeof(short)]
   1.237 +
   1.238 +  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
   1.239 +  same way as described in the CHARACTER PROPERTIES section.
   1.240 +
   1.241 +  The NumberNodes array contains pairs of values, the first of which is the
   1.242 +  character code and the second an index into the ValueNodes array.  The
   1.243 +  ValueNodes array contains pairs of integers which represent the numerator
   1.244 +  and denominator of the numeric value of the character.  If the character
   1.245 +  happens to map to an integer, both the values in ValueNodes will be the
   1.246 +  same.

mercurial