The Tor Browser: intl/unicharutil/tools/format.txt@fc2d59ddac77

1 #

     2 # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $

3 #

     5 CHARACTER DATA

     6 ==============

     8 This package generates some data files that contain character properties useful

     9 for text processing.

    11 CHARACTER PROPERTIES

    12 ====================

    14 The first data file is called "ctype.dat" and contains a compressed form of

    15 the character properties found in the Unicode Character Database (UCDB).

    16 Additional properties can be specified in limited UCDB format in another file

    17 to avoid modifying the original UCDB.

    19 The following is a property name and code table to be used with the character

    20 data:

    22 NAME CODE DESCRIPTION

    23 ---------------------

    24 Mn   0    Mark, Non-Spacing

    25 Mc   1    Mark, Spacing Combining

    26 Me   2    Mark, Enclosing

    27 Nd   3    Number, Decimal Digit

    28 Nl   4    Number, Letter

    29 No   5    Number, Other

    30 Zs   6    Separator, Space

    31 Zl   7    Separator, Line

    32 Zp   8    Separator, Paragraph

    33 Cc   9    Other, Control

    34 Cf   10   Other, Format

    35 Cs   11   Other, Surrogate

    36 Co   12   Other, Private Use

    37 Cn   13   Other, Not Assigned

    38 Lu   14   Letter, Uppercase

    39 Ll   15   Letter, Lowercase

    40 Lt   16   Letter, Titlecase

    41 Lm   17   Letter, Modifier

    42 Lo   18   Letter, Other

    43 Pc   19   Punctuation, Connector

    44 Pd   20   Punctuation, Dash

    45 Ps   21   Punctuation, Open

    46 Pe   22   Punctuation, Close

    47 Po   23   Punctuation, Other

    48 Sm   24   Symbol, Math

    49 Sc   25   Symbol, Currency

    50 Sk   26   Symbol, Modifier

    51 So   27   Symbol, Other

    52 L    28   Left-To-Right

    53 R    29   Right-To-Left

    54 EN   30   European Number

    55 ES   31   European Number Separator

    56 ET   32   European Number Terminator

    57 AN   33   Arabic Number

    58 CS   34   Common Number Separator

    59 B    35   Block Separator

    60 S    36   Segment Separator

    61 WS   37   Whitespace

    62 ON   38   Other Neutrals

    63 Pi   47   Punctuation, Initial

    64 Pf   48   Punctuation, Final

    65 #

    66 # Implementation specific properties.

    67 #

    68 Cm   39   Composite

    69 Nb   40   Non-Breaking

    70 Sy   41   Symmetric (characters which are part of open/close pairs)

    71 Hd   42   Hex Digit

    72 Qm   43   Quote Mark

    73 Mr   44   Mirroring

    74 Ss   45   Space, Other (controls viewed as spaces in ctype isspace())

    75 Cp   46   Defined character

    77 The actual binary data is formatted as follows:

    79   Assumptions: unsigned short is at least 16-bits in size and unsigned long

    80                is at least 32-bits in size.

    82     unsigned short ByteOrderMark

    83     unsigned short OffsetArraySize

    84     unsigned long  Bytes

    85     unsigned short Offsets[OffsetArraySize + 1]

    86     unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]

    88   The Bytes field provides the total byte count used for the Offsets[] and

    89   Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and

    90   there is always one extra node on the end to hold the final index of the

    91   Ranges[] array.  The Ranges[] array contains pairs of 4-byte values

    92   representing a range of Unicode characters.  The pairs are arranged in

    93   increasing order by the first character code in the range.

    95   Determining if a particular character is in the property list requires a

    96   simple binary search to determine if a character is in any of the ranges

    97   for the property.

    99   If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a

   100   machine with a different endian order and the values must be byte-swapped.

   102   To swap a 16-bit value:

   103      c = (c >> 8) | ((c & 0xff) << 8)

   105   To swap a 32-bit value:

   106      c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |

   107          (((c >> 16) & 0xff) << 8) | (c >> 24)

   109 CASE MAPPINGS

   110 =============

   112 The next data file is called "case.dat" and contains three case mapping tables

   113 in the following order: upper, lower, and title case.  Each table is in

   114 increasing order by character code and each mapping contains 3 unsigned longs

   115 which represent the possible mappings.

   117 The format for the binary form of these tables is:

   119   unsigned short ByteOrderMark

   120   unsigned short NumMappingNodes, count of all mapping nodes

   121   unsigned short CaseTableSizes[2], upper and lower mapping node counts

   122   unsigned long  CaseTables[NumMappingNodes]

   124   The starting indexes of the case tables are calculated as following:

   126     UpperIndex = 0;

   127     LowerIndex = CaseTableSizes[0] * 3;

   128     TitleIndex = LowerIndex + CaseTableSizes[1] * 3;

   130   The order of the fields for the three tables are:

   132     Upper case

   133     ----------

   134     unsigned long upper;

   135     unsigned long lower;

   136     unsigned long title;

   138     Lower case

   139     ----------

   140     unsigned long lower;

   141     unsigned long upper;

   142     unsigned long title;

   144     Title case

   145     ----------

   146     unsigned long title;

   147     unsigned long upper;

   148     unsigned long lower;

   150   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the

   151   same way as described in the CHARACTER PROPERTIES section.

   153   Because the tables are in increasing order by character code, locating a

   154   mapping requires a simple binary search on one of the 3 codes that make up

   155   each node.

   157   It is important to note that there can only be 65536 mapping nodes which

   158   divided into 3 portions allows 21845 nodes for each case mapping table.  The

   159   distribution of mappings may be more or less than 21845 per table, but only

   160   65536 are allowed.

   162 DECOMPOSITIONS

   163 ==============

   165 The next data file is called "decomp.dat" and contains the decomposition data

   166 for all characters with decompositions containing more than one character and

   167 are *not* compatibility decompositions.  Compatibility decompositions are

   168 signaled in the UCDB format by the use of the <compat> tag in the

   169 decomposition field.  Each list of character codes represents a full

   170 decomposition of a composite character.  The nodes are arranged in increasing

   171 order by character code.

   173 The format for the binary form of this table is:

   175   unsigned short ByteOrderMark

   176   unsigned short NumDecompNodes, count of all decomposition nodes

   177   unsigned long  Bytes

   178   unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]

   179   unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]

   181   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the

   182   same way as described in the CHARACTER PROPERTIES section.

   184   The DecompNodes[] array consists of pairs of unsigned longs, the first of

   185   which is the character code and the second is the initial index of the list

   186   of character codes representing the decomposition.

   188   Locating the decomposition of a composite character requires a binary search

   189   for a character code in the DecompNodes[] array and using its index to

   190   locate the start of the decomposition.  The length of the decomposition list

   191   is the index in the following element in DecompNode[] minus the current

   192   index.

   194 COMBINING CLASSES

   195 =================

   197 The fourth data file is called "cmbcl.dat" and contains the characters with

   198 non-zero combining classes.

   200 The format for the binary form of this table is:

   202   unsigned short ByteOrderMark

   203   unsigned short NumCCLNodes

   204   unsigned long  Bytes

   205   unsigned long  CCLNodes[NumCCLNodes * 3]

   207   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the

   208   same way as described in the CHARACTER PROPERTIES section.

   210   The CCLNodes[] array consists of groups of three unsigned longs.  The first

   211   and second are the beginning and ending of a range and the third is the

   212   combining class of that range.

   214   If a character is not found in this table, then the combining class is

   215   assumed to be 0.

   217   It is important to note that only 65536 distinct ranges plus combining class

   218   can be specified because the NumCCLNodes is usually a 16-bit number.

   220 NUMBER TABLE

   221 ============

   223 The final data file is called "num.dat" and contains the characters that have

   224 a numeric value associated with them.

   226 The format for the binary form of the table is:

   228   unsigned short ByteOrderMark

   229   unsigned short NumNumberNodes

   230   unsigned long  Bytes

   231   unsigned long  NumberNodes[NumNumberNodes]

   232   unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))

   233                             / sizeof(short)]

   235   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the

   236   same way as described in the CHARACTER PROPERTIES section.

   238   The NumberNodes array contains pairs of values, the first of which is the

   239   character code and the second an index into the ValueNodes array.  The

   240   ValueNodes array contains pairs of integers which represent the numerator

   241   and denominator of the numeric value of the character.  If the character

   242   happens to map to an integer, both the values in ValueNodes will be the

   243   same.

The Tor Browser / file revision

intl/unicharutil/tools/format.txt@fc2d59ddac77

intl/unicharutil/tools/format.txt