other-licenses/snappy/src/framing_format.txt

changeset 0
6474c204b198
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/other-licenses/snappy/src/framing_format.txt	Wed Dec 31 06:09:35 2014 +0100
     1.3 @@ -0,0 +1,124 @@
     1.4 +Snappy framing format description
     1.5 +Last revised: 2011-12-15
     1.6 +
     1.7 +This format decribes a framing format for Snappy, allowing compressing to
     1.8 +files or streams that can then more easily be decompressed without having
     1.9 +to hold the entire stream in memory. It also provides data checksums to
    1.10 +help verify integrity. It does not provide metadata checksums, so it does
    1.11 +not protect against e.g. all forms of truncations.
    1.12 +
    1.13 +Implementation of the framing format is optional for Snappy compressors and
    1.14 +decompressor; it is not part of the Snappy core specification.
    1.15 +
    1.16 +
    1.17 +1. General structure
    1.18 +
    1.19 +The file consists solely of chunks, lying back-to-back with no padding
    1.20 +in between. Each chunk consists first a single byte of chunk identifier,
    1.21 +then a two-byte little-endian length of the chunk in bytes (from 0 to 65535,
    1.22 +inclusive), and then the data if any. The three bytes of chunk header is not
    1.23 +counted in the data length.
    1.24 +
    1.25 +The different chunk types are listed below. The first chunk must always
    1.26 +be the stream identifier chunk (see section 4.1, below). The stream
    1.27 +ends when the file ends -- there is no explicit end-of-file marker.
    1.28 +
    1.29 +
    1.30 +2. File type identification
    1.31 +
    1.32 +The following identifiers for this format are recommended where appropriate.
    1.33 +However, note that none have been registered officially, so this is only to
    1.34 +be taken as a guideline. We use "Snappy framed" to distinguish between this
    1.35 +format and raw Snappy data.
    1.36 +
    1.37 +  File extension:         .sz
    1.38 +  MIME type:              application/x-snappy-framed
    1.39 +  HTTP Content-Encoding:  x-snappy-framed
    1.40 +
    1.41 +
    1.42 +3. Checksum format
    1.43 +
    1.44 +Some chunks have data protected by a checksum (the ones that do will say so
    1.45 +explicitly). The checksums are always masked CRC-32Cs.
    1.46 +
    1.47 +A description of CRC-32C can be found in RFC 3720, section 12.1, with
    1.48 +examples in section B.4.
    1.49 +
    1.50 +Checksums are not stored directly, but masked, as checksumming data and
    1.51 +then its own checksum can be problematic. The masking is the same as used
    1.52 +in Apache Hadoop: Rotate the checksum by 15 bits, then add the constant
    1.53 +0xa282ead8 (using wraparound as normal for unsigned integers). This is
    1.54 +equivalent to the following C code:
    1.55 +
    1.56 +  uint32_t mask_checksum(uint32_t x) {
    1.57 +    return ((x >> 15) | (x << 17)) + 0xa282ead8;
    1.58 +  }
    1.59 +
    1.60 +Note that the masking is reversible.
    1.61 +
    1.62 +The checksum is always stored as a four bytes long integer, in little-endian.
    1.63 +
    1.64 +
    1.65 +4. Chunk types
    1.66 +
    1.67 +The currently supported chunk types are described below. The list may
    1.68 +be extended in the future.
    1.69 +
    1.70 +
    1.71 +4.1. Stream identifier (chunk type 0xff)
    1.72 +
    1.73 +The stream identifier is always the first element in the stream.
    1.74 +It is exactly six bytes long and contains "sNaPpY" in ASCII. This means that
    1.75 +a valid Snappy framed stream always starts with the bytes
    1.76 +
    1.77 +  0xff 0x06 0x00 0x73 0x4e 0x61 0x50 0x70 0x59
    1.78 +
    1.79 +The stream identifier chunk can come multiple times in the stream besides
    1.80 +the first; if such a chunk shows up, it should simply be ignored, assuming
    1.81 +it has the right length and contents. This allows for easy concatenation of
    1.82 +compressed files without the need for re-framing.
    1.83 +
    1.84 +
    1.85 +4.2. Compressed data (chunk type 0x00)
    1.86 +
    1.87 +Compressed data chunks contain a normal Snappy compressed bitstream;
    1.88 +see the compressed format specification. The compressed data is preceded by
    1.89 +the CRC-32C (see section 3) of the _uncompressed_ data.
    1.90 +
    1.91 +Note that the data portion of the chunk, i.e., the compressed contents,
    1.92 +can be at most 65531 bytes (2^16 - 1, minus the checksum).
    1.93 +However, we place an additional restriction that the uncompressed data
    1.94 +in a chunk must be no longer than 32768 bytes. This allows consumers to
    1.95 +easily use small fixed-size buffers.
    1.96 +
    1.97 +
    1.98 +4.3. Uncompressed data (chunk type 0x01)
    1.99 +
   1.100 +Uncompressed data chunks allow a compressor to send uncompressed,
   1.101 +raw data; this is useful if, for instance, uncompressible or
   1.102 +near-incompressible data is detected, and faster decompression is desired.
   1.103 +
   1.104 +As in the compressed chunks, the data is preceded by its own masked
   1.105 +CRC-32C (see section 3).
   1.106 +
   1.107 +An uncompressed data chunk, like compressed data chunks, should contain
   1.108 +no more than 32768 data bytes, so the maximum legal chunk length with the
   1.109 +checksum is 32772.
   1.110 +
   1.111 +
   1.112 +4.4. Reserved unskippable chunks (chunk types 0x02-0x7f)
   1.113 +
   1.114 +These are reserved for future expansion. A decoder that sees such a chunk
   1.115 +should immediately return an error, as it must assume it cannot decode the
   1.116 +stream correctly.
   1.117 +
   1.118 +Future versions of this specification may define meanings for these chunks.
   1.119 +
   1.120 +
   1.121 +4.5. Reserved skippable chunks (chunk types 0x80-0xfe)
   1.122 +
   1.123 +These are also reserved for future expansion, but unlike the chunks
   1.124 +described in 4.4, a decoder seeing these must skip them and continue
   1.125 +decoding.
   1.126 +
   1.127 +Future versions of this specification may define meanings for these chunks.

mercurial