other-licenses/snappy/src/format_description.txt

Tue, 06 Jan 2015 21:39:09 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Tue, 06 Jan 2015 21:39:09 +0100
branch
TOR_BUG_9701
changeset 8
97036ab72558
permissions
-rw-r--r--

Conditionally force memory storage according to privacy.thirdparty.isolate;
This solves Tor bug #9701, complying with disk avoidance documented in
https://www.torproject.org/projects/torbrowser/design/#disk-avoidance.

michael@0 1 Snappy compressed format description
michael@0 2 Last revised: 2011-10-05
michael@0 3
michael@0 4
michael@0 5 This is not a formal specification, but should suffice to explain most
michael@0 6 relevant parts of how the Snappy format works. It is originally based on
michael@0 7 text by Zeev Tarantov.
michael@0 8
michael@0 9 Snappy is a LZ77-type compressor with a fixed, byte-oriented encoding.
michael@0 10 There is no entropy encoder backend nor framing layer -- the latter is
michael@0 11 assumed to be handled by other parts of the system.
michael@0 12
michael@0 13 This document only describes the format, not how the Snappy compressor nor
michael@0 14 decompressor actually works. The correctness of the decompressor should not
michael@0 15 depend on implementation details of the compressor, and vice versa.
michael@0 16
michael@0 17
michael@0 18 1. Preamble
michael@0 19
michael@0 20 The stream starts with the uncompressed length (up to a maximum of 2^32 - 1),
michael@0 21 stored as a little-endian varint. Varints consist of a series of bytes,
michael@0 22 where the lower 7 bits are data and the upper bit is set iff there are
michael@0 23 more bytes to be read. In other words, an uncompressed length of 64 would
michael@0 24 be stored as 0x40, and an uncompressed length of 2097150 (0x1FFFFE)
michael@0 25 would be stored as 0xFE 0xFF 0x7F.
michael@0 26
michael@0 27
michael@0 28 2. The compressed stream itself
michael@0 29
michael@0 30 There are two types of elements in a Snappy stream: Literals and
michael@0 31 copies (backreferences). There is no restriction on the order of elements,
michael@0 32 except that the stream naturally cannot start with a copy. (Having
michael@0 33 two literals in a row is never optimal from a compression point of
michael@0 34 view, but nevertheless fully permitted.) Each element starts with a tag byte,
michael@0 35 and the lower two bits of this tag byte signal what type of element will
michael@0 36 follow:
michael@0 37
michael@0 38 00: Literal
michael@0 39 01: Copy with 1-byte offset
michael@0 40 10: Copy with 2-byte offset
michael@0 41 11: Copy with 4-byte offset
michael@0 42
michael@0 43 The interpretation of the upper six bits are element-dependent.
michael@0 44
michael@0 45
michael@0 46 2.1. Literals (00)
michael@0 47
michael@0 48 Literals are uncompressed data stored directly in the byte stream.
michael@0 49 The literal length is stored differently depending on the length
michael@0 50 of the literal:
michael@0 51
michael@0 52 - For literals up to and including 60 bytes in length, the upper
michael@0 53 six bits of the tag byte contain (len-1). The literal follows
michael@0 54 immediately thereafter in the bytestream.
michael@0 55 - For longer literals, the (len-1) value is stored after the tag byte,
michael@0 56 little-endian. The upper six bits of the tag byte describe how
michael@0 57 many bytes are used for the length; 60, 61, 62 or 63 for
michael@0 58 1-4 bytes, respectively. The literal itself follows after the
michael@0 59 length.
michael@0 60
michael@0 61
michael@0 62 2.2. Copies
michael@0 63
michael@0 64 Copies are references back into previous decompressed data, telling
michael@0 65 the decompressor to reuse data it has previously decoded.
michael@0 66 They encode two values: The _offset_, saying how many bytes back
michael@0 67 from the current position to read, and the _length_, how many bytes
michael@0 68 to copy. Offsets of zero can be encoded, but are not legal;
michael@0 69 similarly, it is possible to encode backreferences that would
michael@0 70 go past the end of the block (offset > current decompressed position),
michael@0 71 which is also nonsensical and thus not allowed.
michael@0 72
michael@0 73 As in most LZ77-based compressors, the length can be larger than the offset,
michael@0 74 yielding a form of run-length encoding (RLE). For instance,
michael@0 75 "xababab" could be encoded as
michael@0 76
michael@0 77 <literal: "xab"> <copy: offset=2 length=4>
michael@0 78
michael@0 79 Note that since the current Snappy compressor works in 32 kB
michael@0 80 blocks and does not do matching across blocks, it will never produce
michael@0 81 a bitstream with offsets larger than about 32768. However, the
michael@0 82 decompressor should not rely on this, as it may change in the future.
michael@0 83
michael@0 84 There are several different kinds of copy elements, depending on
michael@0 85 the amount of bytes to be copied (length), and how far back the
michael@0 86 data to be copied is (offset).
michael@0 87
michael@0 88
michael@0 89 2.2.1. Copy with 1-byte offset (01)
michael@0 90
michael@0 91 These elements can encode lengths between [4..11] bytes and offsets
michael@0 92 between [0..2047] bytes. (len-4) occupies three bits and is stored
michael@0 93 in bits [2..4] of the tag byte. The offset occupies 11 bits, of which the
michael@0 94 upper three are stored in the upper three bits ([5..7]) of the tag byte,
michael@0 95 and the lower eight are stored in a byte following the tag byte.
michael@0 96
michael@0 97
michael@0 98 2.2.2. Copy with 2-byte offset (10)
michael@0 99
michael@0 100 These elements can encode lengths between [1..64] and offsets from
michael@0 101 [0..65535]. (len-1) occupies six bits and is stored in the upper
michael@0 102 six bits ([2..7]) of the tag byte. The offset is stored as a
michael@0 103 little-endian 16-bit integer in the two bytes following the tag byte.
michael@0 104
michael@0 105
michael@0 106 2.2.3. Copy with 4-byte offset (11)
michael@0 107
michael@0 108 These are like the copies with 2-byte offsets (see previous subsection),
michael@0 109 except that the offset is stored as a 32-bit integer instead of a
michael@0 110 16-bit integer (and thus will occupy four bytes).

mercurial