The Tor Browser: comparison intl/unicharutil/tools/format.txt

--1:000000000000
+:324fd6b23a08
+#
+# $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
+#
+CHARACTER DATA
+==============
+This package generates some data files that contain character properties useful
+for text processing.
+CHARACTER PROPERTIES
+====================
+The first data file is called "ctype.dat" and contains a compressed form of
+the character properties found in the Unicode Character Database (UCDB).
+Additional properties can be specified in limited UCDB format in another file
+to avoid modifying the original UCDB.
+The following is a property name and code table to be used with the character
+data:
+NAME CODE DESCRIPTION
+---------------------
+Mn   0    Mark, Non-Spacing
+Mc   1    Mark, Spacing Combining
+Me   2    Mark, Enclosing
+Nd   3    Number, Decimal Digit
+Nl   4    Number, Letter
+No   5    Number, Other
+Zs   6    Separator, Space
+Zl   7    Separator, Line
+Zp   8    Separator, Paragraph
+Cc   9    Other, Control
+Cf   10   Other, Format
+Cs   11   Other, Surrogate
+Co   12   Other, Private Use
+Cn   13   Other, Not Assigned
+Lu   14   Letter, Uppercase
+Ll   15   Letter, Lowercase
+Lt   16   Letter, Titlecase
+Lm   17   Letter, Modifier
+Lo   18   Letter, Other
+Pc   19   Punctuation, Connector
+Pd   20   Punctuation, Dash
+Ps   21   Punctuation, Open
+Pe   22   Punctuation, Close
+Po   23   Punctuation, Other
+Sm   24   Symbol, Math
+Sc   25   Symbol, Currency
+Sk   26   Symbol, Modifier
+So   27   Symbol, Other
+L    28   Left-To-Right
+R    29   Right-To-Left
+EN   30   European Number
+ES   31   European Number Separator
+ET   32   European Number Terminator
+AN   33   Arabic Number
+CS   34   Common Number Separator
+B    35   Block Separator
+S    36   Segment Separator
+WS   37   Whitespace
+ON   38   Other Neutrals
+Pi   47   Punctuation, Initial
+Pf   48   Punctuation, Final
+#
+# Implementation specific properties.
+#
+Cm   39   Composite
+Nb   40   Non-Breaking
+Sy   41   Symmetric (characters which are part of open/close pairs)
+Hd   42   Hex Digit
+Qm   43   Quote Mark
+Mr   44   Mirroring
+Ss   45   Space, Other (controls viewed as spaces in ctype isspace())
+Cp   46   Defined character
+The actual binary data is formatted as follows:
+Assumptions: unsigned short is at least 16-bits in size and unsigned long
+is at least 32-bits in size.
+unsigned short ByteOrderMark
+unsigned short OffsetArraySize
+unsigned long  Bytes
+unsigned short Offsets[OffsetArraySize + 1]
+unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]
+The Bytes field provides the total byte count used for the Offsets[] and
+Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and
+there is always one extra node on the end to hold the final index of the
+Ranges[] array.  The Ranges[] array contains pairs of 4-byte values
+representing a range of Unicode characters.  The pairs are arranged in
+increasing order by the first character code in the range.
+Determining if a particular character is in the property list requires a
+simple binary search to determine if a character is in any of the ranges
+for the property.
+If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
+machine with a different endian order and the values must be byte-swapped.
+To swap a 16-bit value:
+c = (c >> 8) | ((c & 0xff) << 8)
+To swap a 32-bit value:
+c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
+(((c >> 16) & 0xff) << 8) | (c >> 24)
+CASE MAPPINGS
+=============
+The next data file is called "case.dat" and contains three case mapping tables
+in the following order: upper, lower, and title case.  Each table is in
+increasing order by character code and each mapping contains 3 unsigned longs
+which represent the possible mappings.
+The format for the binary form of these tables is:
+unsigned short ByteOrderMark
+unsigned short NumMappingNodes, count of all mapping nodes
+unsigned short CaseTableSizes[2], upper and lower mapping node counts
+unsigned long  CaseTables[NumMappingNodes]
+The starting indexes of the case tables are calculated as following:
+UpperIndex = 0;
+LowerIndex = CaseTableSizes[0] * 3;
+TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
+The order of the fields for the three tables are:
+Upper case
+----------
+unsigned long upper;
+unsigned long lower;
+unsigned long title;
+Lower case
+----------
+unsigned long lower;
+unsigned long upper;
+unsigned long title;
+Title case
+----------
+unsigned long title;
+unsigned long upper;
+unsigned long lower;
+If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+same way as described in the CHARACTER PROPERTIES section.
+Because the tables are in increasing order by character code, locating a
+mapping requires a simple binary search on one of the 3 codes that make up
+each node.
+It is important to note that there can only be 65536 mapping nodes which
+divided into 3 portions allows 21845 nodes for each case mapping table.  The
+distribution of mappings may be more or less than 21845 per table, but only
+65536 are allowed.
+DECOMPOSITIONS
+==============
+The next data file is called "decomp.dat" and contains the decomposition data
+for all characters with decompositions containing more than one character and
+are *not* compatibility decompositions.  Compatibility decompositions are
+signaled in the UCDB format by the use of the <compat> tag in the
+decomposition field.  Each list of character codes represents a full
+decomposition of a composite character.  The nodes are arranged in increasing
+order by character code.
+The format for the binary form of this table is:
+unsigned short ByteOrderMark
+unsigned short NumDecompNodes, count of all decomposition nodes
+unsigned long  Bytes
+unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]
+unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]
+If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+same way as described in the CHARACTER PROPERTIES section.
+The DecompNodes[] array consists of pairs of unsigned longs, the first of
+which is the character code and the second is the initial index of the list
+of character codes representing the decomposition.
+Locating the decomposition of a composite character requires a binary search
+for a character code in the DecompNodes[] array and using its index to
+locate the start of the decomposition.  The length of the decomposition list
+is the index in the following element in DecompNode[] minus the current
+index.
+COMBINING CLASSES
+=================
+The fourth data file is called "cmbcl.dat" and contains the characters with
+non-zero combining classes.
+The format for the binary form of this table is:
+unsigned short ByteOrderMark
+unsigned short NumCCLNodes
+unsigned long  Bytes
+unsigned long  CCLNodes[NumCCLNodes * 3]
+If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+same way as described in the CHARACTER PROPERTIES section.
+The CCLNodes[] array consists of groups of three unsigned longs.  The first
+and second are the beginning and ending of a range and the third is the
+combining class of that range.
+If a character is not found in this table, then the combining class is
+assumed to be 0.
+It is important to note that only 65536 distinct ranges plus combining class
+can be specified because the NumCCLNodes is usually a 16-bit number.
+NUMBER TABLE
+============
+The final data file is called "num.dat" and contains the characters that have
+a numeric value associated with them.
+The format for the binary form of the table is:
+unsigned short ByteOrderMark
+unsigned short NumNumberNodes
+unsigned long  Bytes
+unsigned long  NumberNodes[NumNumberNodes]
+unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
+/ sizeof(short)]
+If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
+same way as described in the CHARACTER PROPERTIES section.
+The NumberNodes array contains pairs of values, the first of which is the
+character code and the second an index into the ValueNodes array.  The
+ValueNodes array contains pairs of integers which represent the numerator
+and denominator of the numeric value of the character.  If the character
+happens to map to an integer, both the values in ValueNodes will be the
+same.

The Tor Browser / file comparison

comparison: intl/unicharutil/tools/format.txt

intl/unicharutil/tools/format.txt