Tue, 06 Jan 2015 21:39:09 +0100
Conditionally force memory storage according to privacy.thirdparty.isolate;
This solves Tor bug #9701, complying with disk avoidance documented in
https://www.torproject.org/projects/torbrowser/design/#disk-avoidance.
michael@0 | 1 | Snappy compressed format description |
michael@0 | 2 | Last revised: 2011-10-05 |
michael@0 | 3 | |
michael@0 | 4 | |
michael@0 | 5 | This is not a formal specification, but should suffice to explain most |
michael@0 | 6 | relevant parts of how the Snappy format works. It is originally based on |
michael@0 | 7 | text by Zeev Tarantov. |
michael@0 | 8 | |
michael@0 | 9 | Snappy is a LZ77-type compressor with a fixed, byte-oriented encoding. |
michael@0 | 10 | There is no entropy encoder backend nor framing layer -- the latter is |
michael@0 | 11 | assumed to be handled by other parts of the system. |
michael@0 | 12 | |
michael@0 | 13 | This document only describes the format, not how the Snappy compressor nor |
michael@0 | 14 | decompressor actually works. The correctness of the decompressor should not |
michael@0 | 15 | depend on implementation details of the compressor, and vice versa. |
michael@0 | 16 | |
michael@0 | 17 | |
michael@0 | 18 | 1. Preamble |
michael@0 | 19 | |
michael@0 | 20 | The stream starts with the uncompressed length (up to a maximum of 2^32 - 1), |
michael@0 | 21 | stored as a little-endian varint. Varints consist of a series of bytes, |
michael@0 | 22 | where the lower 7 bits are data and the upper bit is set iff there are |
michael@0 | 23 | more bytes to be read. In other words, an uncompressed length of 64 would |
michael@0 | 24 | be stored as 0x40, and an uncompressed length of 2097150 (0x1FFFFE) |
michael@0 | 25 | would be stored as 0xFE 0xFF 0x7F. |
michael@0 | 26 | |
michael@0 | 27 | |
michael@0 | 28 | 2. The compressed stream itself |
michael@0 | 29 | |
michael@0 | 30 | There are two types of elements in a Snappy stream: Literals and |
michael@0 | 31 | copies (backreferences). There is no restriction on the order of elements, |
michael@0 | 32 | except that the stream naturally cannot start with a copy. (Having |
michael@0 | 33 | two literals in a row is never optimal from a compression point of |
michael@0 | 34 | view, but nevertheless fully permitted.) Each element starts with a tag byte, |
michael@0 | 35 | and the lower two bits of this tag byte signal what type of element will |
michael@0 | 36 | follow: |
michael@0 | 37 | |
michael@0 | 38 | 00: Literal |
michael@0 | 39 | 01: Copy with 1-byte offset |
michael@0 | 40 | 10: Copy with 2-byte offset |
michael@0 | 41 | 11: Copy with 4-byte offset |
michael@0 | 42 | |
michael@0 | 43 | The interpretation of the upper six bits are element-dependent. |
michael@0 | 44 | |
michael@0 | 45 | |
michael@0 | 46 | 2.1. Literals (00) |
michael@0 | 47 | |
michael@0 | 48 | Literals are uncompressed data stored directly in the byte stream. |
michael@0 | 49 | The literal length is stored differently depending on the length |
michael@0 | 50 | of the literal: |
michael@0 | 51 | |
michael@0 | 52 | - For literals up to and including 60 bytes in length, the upper |
michael@0 | 53 | six bits of the tag byte contain (len-1). The literal follows |
michael@0 | 54 | immediately thereafter in the bytestream. |
michael@0 | 55 | - For longer literals, the (len-1) value is stored after the tag byte, |
michael@0 | 56 | little-endian. The upper six bits of the tag byte describe how |
michael@0 | 57 | many bytes are used for the length; 60, 61, 62 or 63 for |
michael@0 | 58 | 1-4 bytes, respectively. The literal itself follows after the |
michael@0 | 59 | length. |
michael@0 | 60 | |
michael@0 | 61 | |
michael@0 | 62 | 2.2. Copies |
michael@0 | 63 | |
michael@0 | 64 | Copies are references back into previous decompressed data, telling |
michael@0 | 65 | the decompressor to reuse data it has previously decoded. |
michael@0 | 66 | They encode two values: The _offset_, saying how many bytes back |
michael@0 | 67 | from the current position to read, and the _length_, how many bytes |
michael@0 | 68 | to copy. Offsets of zero can be encoded, but are not legal; |
michael@0 | 69 | similarly, it is possible to encode backreferences that would |
michael@0 | 70 | go past the end of the block (offset > current decompressed position), |
michael@0 | 71 | which is also nonsensical and thus not allowed. |
michael@0 | 72 | |
michael@0 | 73 | As in most LZ77-based compressors, the length can be larger than the offset, |
michael@0 | 74 | yielding a form of run-length encoding (RLE). For instance, |
michael@0 | 75 | "xababab" could be encoded as |
michael@0 | 76 | |
michael@0 | 77 | <literal: "xab"> <copy: offset=2 length=4> |
michael@0 | 78 | |
michael@0 | 79 | Note that since the current Snappy compressor works in 32 kB |
michael@0 | 80 | blocks and does not do matching across blocks, it will never produce |
michael@0 | 81 | a bitstream with offsets larger than about 32768. However, the |
michael@0 | 82 | decompressor should not rely on this, as it may change in the future. |
michael@0 | 83 | |
michael@0 | 84 | There are several different kinds of copy elements, depending on |
michael@0 | 85 | the amount of bytes to be copied (length), and how far back the |
michael@0 | 86 | data to be copied is (offset). |
michael@0 | 87 | |
michael@0 | 88 | |
michael@0 | 89 | 2.2.1. Copy with 1-byte offset (01) |
michael@0 | 90 | |
michael@0 | 91 | These elements can encode lengths between [4..11] bytes and offsets |
michael@0 | 92 | between [0..2047] bytes. (len-4) occupies three bits and is stored |
michael@0 | 93 | in bits [2..4] of the tag byte. The offset occupies 11 bits, of which the |
michael@0 | 94 | upper three are stored in the upper three bits ([5..7]) of the tag byte, |
michael@0 | 95 | and the lower eight are stored in a byte following the tag byte. |
michael@0 | 96 | |
michael@0 | 97 | |
michael@0 | 98 | 2.2.2. Copy with 2-byte offset (10) |
michael@0 | 99 | |
michael@0 | 100 | These elements can encode lengths between [1..64] and offsets from |
michael@0 | 101 | [0..65535]. (len-1) occupies six bits and is stored in the upper |
michael@0 | 102 | six bits ([2..7]) of the tag byte. The offset is stored as a |
michael@0 | 103 | little-endian 16-bit integer in the two bytes following the tag byte. |
michael@0 | 104 | |
michael@0 | 105 | |
michael@0 | 106 | 2.2.3. Copy with 4-byte offset (11) |
michael@0 | 107 | |
michael@0 | 108 | These are like the copies with 2-byte offsets (see previous subsection), |
michael@0 | 109 | except that the offset is stored as a 32-bit integer instead of a |
michael@0 | 110 | 16-bit integer (and thus will occupy four bytes). |