The Tor Browser: intl/unicharutil/tools/format.txt@6474c204b198 (annotated)

intl/unicharutil/tools/format.txt@6474c204b198 (annotated)

intl/unicharutil/tools/format.txt

Wed, 31 Dec 2014 06:09:35 +0100

author: Michael Schloh von Bennewitz <michael@schloh.com>
date: Wed, 31 Dec 2014 06:09:35 +0100
changeset 0: 6474c204b198
permissions: -rw-r--r--

Cloned upstream origin tor-browser at tor-browser-31.3.0esr-4.5-1-build1
revision ID fc1c9ff7c1b2defdbc039f12214767608f46423f for hacking purpose.

 #
 # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
 #
 CHARACTER DATA
 ==============
 This package generates some data files that contain character properties useful
 for text processing.
 CHARACTER PROPERTIES
 ====================
 The first data file is called "ctype.dat" and contains a compressed form of
 the character properties found in the Unicode Character Database (UCDB).
 Additional properties can be specified in limited UCDB format in another file
 to avoid modifying the original UCDB.
 The following is a property name and code table to be used with the character
 data:
 NAME CODE DESCRIPTION
 ---------------------
 Mn   0    Mark, Non-Spacing
 Mc   1    Mark, Spacing Combining
 Me   2    Mark, Enclosing
 Nd   3    Number, Decimal Digit
 Nl   4    Number, Letter
 No   5    Number, Other
 Zs   6    Separator, Space
 Zl   7    Separator, Line
 Zp   8    Separator, Paragraph
 Cc   9    Other, Control
 Cf   10   Other, Format
 Cs   11   Other, Surrogate
 Co   12   Other, Private Use
 Cn   13   Other, Not Assigned
 Lu   14   Letter, Uppercase
 Ll   15   Letter, Lowercase
 Lt   16   Letter, Titlecase
 Lm   17   Letter, Modifier
 Lo   18   Letter, Other
 Pc   19   Punctuation, Connector
 Pd   20   Punctuation, Dash
 Ps   21   Punctuation, Open
 Pe   22   Punctuation, Close
 Po   23   Punctuation, Other
 Sm   24   Symbol, Math
 Sc   25   Symbol, Currency
 Sk   26   Symbol, Modifier
 So   27   Symbol, Other
 L    28   Left-To-Right
 R    29   Right-To-Left
 EN   30   European Number
 ES   31   European Number Separator
 ET   32   European Number Terminator
 AN   33   Arabic Number
 CS   34   Common Number Separator
 B    35   Block Separator
 S    36   Segment Separator
 WS   37   Whitespace
 ON   38   Other Neutrals
 Pi   47   Punctuation, Initial
 Pf   48   Punctuation, Final
 #
 # Implementation specific properties.
 #
 Cm   39   Composite
 Nb   40   Non-Breaking
 Sy   41   Symmetric (characters which are part of open/close pairs)
 Hd   42   Hex Digit
 Qm   43   Quote Mark
 Mr   44   Mirroring
 Ss   45   Space, Other (controls viewed as spaces in ctype isspace())
 Cp   46   Defined character
 The actual binary data is formatted as follows:
   Assumptions: unsigned short is at least 16-bits in size and unsigned long
                is at least 32-bits in size.
     unsigned short ByteOrderMark
     unsigned short OffsetArraySize
     unsigned long  Bytes
     unsigned short Offsets[OffsetArraySize + 1]
     unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]
   The Bytes field provides the total byte count used for the Offsets[] and
   Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and
   there is always one extra node on the end to hold the final index of the
   Ranges[] array.  The Ranges[] array contains pairs of 4-byte values
   representing a range of Unicode characters.  The pairs are arranged in
   increasing order by the first character code in the range.
   Determining if a particular character is in the property list requires a
   simple binary search to determine if a character is in any of the ranges
   for the property.
   If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
   machine with a different endian order and the values must be byte-swapped.
   To swap a 16-bit value:
      c = (c >> 8) | ((c & 0xff) << 8)
   To swap a 32-bit value:
      c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
          (((c >> 16) & 0xff) << 8) | (c >> 24)
 CASE MAPPINGS
 =============
 The next data file is called "case.dat" and contains three case mapping tables
 in the following order: upper, lower, and title case.  Each table is in
 increasing order by character code and each mapping contains 3 unsigned longs
 which represent the possible mappings.
 The format for the binary form of these tables is:
   unsigned short ByteOrderMark
   unsigned short NumMappingNodes, count of all mapping nodes
   unsigned short CaseTableSizes[2], upper and lower mapping node counts
   unsigned long  CaseTables[NumMappingNodes]
   The starting indexes of the case tables are calculated as following:
     UpperIndex = 0;
     LowerIndex = CaseTableSizes[0] * 3;
     TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
   The order of the fields for the three tables are:
     Upper case
     ----------
     unsigned long upper;
     unsigned long lower;
     unsigned long title;
     Lower case
     ----------
     unsigned long lower;
     unsigned long upper;
     unsigned long title;
     Title case
     ----------
     unsigned long title;
     unsigned long upper;
     unsigned long lower;
   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
   same way as described in the CHARACTER PROPERTIES section.
   Because the tables are in increasing order by character code, locating a
   mapping requires a simple binary search on one of the 3 codes that make up
   each node.
   It is important to note that there can only be 65536 mapping nodes which
   divided into 3 portions allows 21845 nodes for each case mapping table.  The
   distribution of mappings may be more or less than 21845 per table, but only
 are allowed.
 DECOMPOSITIONS
 ==============
 The next data file is called "decomp.dat" and contains the decomposition data
 for all characters with decompositions containing more than one character and
 are *not* compatibility decompositions.  Compatibility decompositions are
 signaled in the UCDB format by the use of the <compat> tag in the
 decomposition field.  Each list of character codes represents a full
 decomposition of a composite character.  The nodes are arranged in increasing
 order by character code.
 The format for the binary form of this table is:
   unsigned short ByteOrderMark
   unsigned short NumDecompNodes, count of all decomposition nodes
   unsigned long  Bytes
   unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]
   unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]
   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
   same way as described in the CHARACTER PROPERTIES section.
   The DecompNodes[] array consists of pairs of unsigned longs, the first of
   which is the character code and the second is the initial index of the list
   of character codes representing the decomposition.
   Locating the decomposition of a composite character requires a binary search
   for a character code in the DecompNodes[] array and using its index to
   locate the start of the decomposition.  The length of the decomposition list
   is the index in the following element in DecompNode[] minus the current
   index.
 COMBINING CLASSES
 =================
 The fourth data file is called "cmbcl.dat" and contains the characters with
 non-zero combining classes.
 The format for the binary form of this table is:
   unsigned short ByteOrderMark
   unsigned short NumCCLNodes
   unsigned long  Bytes
   unsigned long  CCLNodes[NumCCLNodes * 3]
   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
   same way as described in the CHARACTER PROPERTIES section.
   The CCLNodes[] array consists of groups of three unsigned longs.  The first
   and second are the beginning and ending of a range and the third is the
   combining class of that range.
   If a character is not found in this table, then the combining class is
   assumed to be 0.
   It is important to note that only 65536 distinct ranges plus combining class
   can be specified because the NumCCLNodes is usually a 16-bit number.
 NUMBER TABLE
 ============
 The final data file is called "num.dat" and contains the characters that have
 a numeric value associated with them.
 The format for the binary form of the table is:
   unsigned short ByteOrderMark
   unsigned short NumNumberNodes
   unsigned long  Bytes
   unsigned long  NumberNodes[NumNumberNodes]
   unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
                             / sizeof(short)]
   If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
   same way as described in the CHARACTER PROPERTIES section.
   The NumberNodes array contains pairs of values, the first of which is the
   character code and the second an index into the ValueNodes array.  The
   ValueNodes array contains pairs of integers which represent the numerator
   and denominator of the numeric value of the character.  If the character
   happens to map to an integer, both the values in ValueNodes will be the
   same.

The Tor Browser / annotate

intl/unicharutil/tools/format.txt@6474c204b198 (annotated)

intl/unicharutil/tools/format.txt