michael@0: Revised: 03/01/1999 michael@0: michael@0: Disclaimer michael@0: ---------- michael@0: michael@0: Although PKWARE will attempt to supply current and accurate michael@0: information relating to its file formats, algorithms, and the michael@0: subject programs, the possibility of error can not be eliminated. michael@0: PKWARE therefore expressly disclaims any warranty that the michael@0: information contained in the associated materials relating to the michael@0: subject programs and/or the format of the files created or michael@0: accessed by the subject programs and/or the algorithms used by michael@0: the subject programs, or any other matter, is current, correct or michael@0: accurate as delivered. Any risk of damage due to any possible michael@0: inaccurate information is assumed by the user of the information. michael@0: Furthermore, the information relating to the subject programs michael@0: and/or the file formats created or accessed by the subject michael@0: programs and/or the algorithms used by the subject programs is michael@0: subject to change without notice. michael@0: michael@0: General Format of a ZIP file michael@0: ---------------------------- michael@0: michael@0: Files stored in arbitrary order. Large zipfiles can span multiple michael@0: diskette media. michael@0: michael@0: Overall zipfile format: michael@0: michael@0: [local file header + file data + data_descriptor] . . . michael@0: [central directory] end of central directory record michael@0: michael@0: michael@0: A. Local file header: michael@0: michael@0: local file header signature 4 bytes (0x04034b50) michael@0: version needed to extract 2 bytes michael@0: general purpose bit flag 2 bytes michael@0: compression method 2 bytes michael@0: last mod file time 2 bytes michael@0: last mod file date 2 bytes michael@0: crc-32 4 bytes michael@0: compressed size 4 bytes michael@0: uncompressed size 4 bytes michael@0: filename length 2 bytes michael@0: extra field length 2 bytes michael@0: michael@0: filename (variable size) michael@0: extra field (variable size) michael@0: michael@0: B. Data descriptor: michael@0: michael@0: crc-32 4 bytes michael@0: compressed size 4 bytes michael@0: uncompressed size 4 bytes michael@0: michael@0: This descriptor exists only if bit 3 of the general michael@0: purpose bit flag is set (see below). It is byte aligned michael@0: and immediately follows the last byte of compressed data. michael@0: This descriptor is used only when it was not possible to michael@0: seek in the output zip file, e.g., when the output zip file michael@0: was standard output or a non seekable device. michael@0: michael@0: C. Central directory structure: michael@0: michael@0: [file header] . . . end of central dir record michael@0: michael@0: File header: michael@0: michael@0: central file header signature 4 bytes (0x02014b50) michael@0: version made by 2 bytes michael@0: version needed to extract 2 bytes michael@0: general purpose bit flag 2 bytes michael@0: compression method 2 bytes michael@0: last mod file time 2 bytes michael@0: last mod file date 2 bytes michael@0: crc-32 4 bytes michael@0: compressed size 4 bytes michael@0: uncompressed size 4 bytes michael@0: filename length 2 bytes michael@0: extra field length 2 bytes michael@0: file comment length 2 bytes michael@0: disk number start 2 bytes michael@0: internal file attributes 2 bytes michael@0: external file attributes 4 bytes michael@0: relative offset of local header 4 bytes michael@0: michael@0: filename (variable size) michael@0: extra field (variable size) michael@0: file comment (variable size) michael@0: michael@0: End of central dir record: michael@0: michael@0: end of central dir signature 4 bytes (0x06054b50) michael@0: number of this disk 2 bytes michael@0: number of the disk with the michael@0: start of the central directory 2 bytes michael@0: total number of entries in michael@0: the central dir on this disk 2 bytes michael@0: total number of entries in michael@0: the central dir 2 bytes michael@0: size of the central directory 4 bytes michael@0: offset of start of central michael@0: directory with respect to michael@0: the starting disk number 4 bytes michael@0: zipfile comment length 2 bytes michael@0: zipfile comment (variable size) michael@0: michael@0: D. Explanation of fields: michael@0: michael@0: version made by (2 bytes) michael@0: michael@0: The upper byte indicates the compatibility of the file michael@0: attribute information. If the external file attributes michael@0: are compatible with MS-DOS and can be read by PKZIP for michael@0: DOS version 2.04g then this value will be zero. If these michael@0: attributes are not compatible, then this value will michael@0: identify the host system on which the attributes are michael@0: compatible. Software can use this information to determine michael@0: the line record format for text files etc. The current michael@0: mappings are: michael@0: michael@0: 0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems) michael@0: 1 - Amiga 2 - VAX/VMS michael@0: 3 - Unix 4 - VM/CMS michael@0: 5 - Atari ST 6 - OS/2 H.P.F.S. michael@0: 7 - Macintosh 8 - Z-System michael@0: 9 - CP/M 10 - Windows NTFS michael@0: 11 thru 255 - unused michael@0: michael@0: The lower byte indicates the version number of the michael@0: software used to encode the file. The value/10 michael@0: indicates the major version number, and the value michael@0: mod 10 is the minor version number. michael@0: michael@0: version needed to extract (2 bytes) michael@0: michael@0: The minimum software version needed to extract the michael@0: file, mapped as above. michael@0: michael@0: general purpose bit flag: (2 bytes) michael@0: michael@0: Bit 0: If set, indicates that the file is encrypted. michael@0: michael@0: (For Method 6 - Imploding) michael@0: Bit 1: If the compression method used was type 6, michael@0: Imploding, then this bit, if set, indicates michael@0: an 8K sliding dictionary was used. If clear, michael@0: then a 4K sliding dictionary was used. michael@0: Bit 2: If the compression method used was type 6, michael@0: Imploding, then this bit, if set, indicates michael@0: 3 Shannon-Fano trees were used to encode the michael@0: sliding dictionary output. If clear, then 2 michael@0: Shannon-Fano trees were used. michael@0: michael@0: (For Method 8 - Deflating) michael@0: Bit 2 Bit 1 michael@0: 0 0 Normal (-en) compression option was used. michael@0: 0 1 Maximum (-ex) compression option was used. michael@0: 1 0 Fast (-ef) compression option was used. michael@0: 1 1 Super Fast (-es) compression option was used. michael@0: michael@0: Note: Bits 1 and 2 are undefined if the compression michael@0: method is any other. michael@0: michael@0: Bit 3: If this bit is set, the fields crc-32, compressed michael@0: size and uncompressed size are set to zero in the michael@0: local header. The correct values are put in the michael@0: data descriptor immediately following the compressed michael@0: data. (Note: PKZIP version 2.04g for DOS only michael@0: recognizes this bit for method 8 compression, newer michael@0: versions of PKZIP recognize this bit for any michael@0: compression method.) michael@0: michael@0: Bit 4: Reserved for use with method 8, for enhanced michael@0: deflating. michael@0: michael@0: Bit 5: If this bit is set, this indicates that the file is michael@0: compressed patched data. (Note: Requires PKZIP michael@0: version 2.70 or greater) michael@0: michael@0: Bit 6: Currently unused. michael@0: michael@0: Bit 7: Currently unused. michael@0: michael@0: Bit 8: Currently unused. michael@0: michael@0: Bit 9: Currently unused. michael@0: michael@0: Bit 10: Currently unused. michael@0: michael@0: Bit 11: Currently unused. michael@0: michael@0: Bit 12: Reserved by PKWARE for enhanced compression. michael@0: michael@0: Bit 13: Reserved by PKWARE. michael@0: michael@0: Bit 14: Reserved by PKWARE. michael@0: michael@0: Bit 15: Reserved by PKWARE. michael@0: michael@0: compression method: (2 bytes) michael@0: michael@0: (see accompanying documentation for algorithm michael@0: descriptions) michael@0: michael@0: 0 - The file is stored (no compression) michael@0: 1 - The file is Shrunk michael@0: 2 - The file is Reduced with compression factor 1 michael@0: 3 - The file is Reduced with compression factor 2 michael@0: 4 - The file is Reduced with compression factor 3 michael@0: 5 - The file is Reduced with compression factor 4 michael@0: 6 - The file is Imploded michael@0: 7 - Reserved for Tokenizing compression algorithm michael@0: 8 - The file is Deflated michael@0: 9 - Reserved for enhanced Deflating michael@0: 10 - PKWARE Date Compression Library Imploding michael@0: michael@0: date and time fields: (2 bytes each) michael@0: michael@0: The date and time are encoded in standard MS-DOS format. michael@0: If input came from standard input, the date and time are michael@0: those at which compression was started for this data. michael@0: michael@0: CRC-32: (4 bytes) michael@0: michael@0: The CRC-32 algorithm was generously contributed by michael@0: David Schwaderer and can be found in his excellent michael@0: book "C Programmers Guide to NetBIOS" published by michael@0: Howard W. Sams & Co. Inc. The 'magic number' for michael@0: the CRC is 0xdebb20e3. The proper CRC pre and post michael@0: conditioning is used, meaning that the CRC register michael@0: is pre-conditioned with all ones (a starting value michael@0: of 0xffffffff) and the value is post-conditioned by michael@0: taking the one's complement of the CRC residual. michael@0: If bit 3 of the general purpose flag is set, this michael@0: field is set to zero in the local header and the correct michael@0: value is put in the data descriptor and in the central michael@0: directory. michael@0: michael@0: compressed size: (4 bytes) michael@0: uncompressed size: (4 bytes) michael@0: michael@0: The size of the file compressed and uncompressed, michael@0: respectively. If bit 3 of the general purpose bit flag michael@0: is set, these fields are set to zero in the local header michael@0: and the correct values are put in the data descriptor and michael@0: in the central directory. michael@0: michael@0: filename length: (2 bytes) michael@0: extra field length: (2 bytes) michael@0: file comment length: (2 bytes) michael@0: michael@0: The length of the filename, extra field, and comment michael@0: fields respectively. The combined length of any michael@0: directory record and these three fields should not michael@0: generally exceed 65,535 bytes. If input came from standard michael@0: input, the filename length is set to zero. michael@0: michael@0: disk number start: (2 bytes) michael@0: michael@0: The number of the disk on which this file begins. michael@0: michael@0: internal file attributes: (2 bytes) michael@0: michael@0: The lowest bit of this field indicates, if set, that michael@0: the file is apparently an ASCII or text file. If not michael@0: set, that the file apparently contains binary data. michael@0: The remaining bits are unused in version 1.0. michael@0: michael@0: Bits 1 and 2 are reserved for use by PKWARE. michael@0: michael@0: external file attributes: (4 bytes) michael@0: michael@0: The mapping of the external attributes is michael@0: host-system dependent (see 'version made by'). For michael@0: MS-DOS, the low order byte is the MS-DOS directory michael@0: attribute byte. If input came from standard input, this michael@0: field is set to zero. michael@0: michael@0: relative offset of local header: (4 bytes) michael@0: michael@0: This is the offset from the start of the first disk on michael@0: which this file appears, to where the local header should michael@0: be found. michael@0: michael@0: filename: (Variable) michael@0: michael@0: The name of the file, with optional relative path. michael@0: The path stored should not contain a drive or michael@0: device letter, or a leading slash. All slashes michael@0: should be forward slashes '/' as opposed to michael@0: backwards slashes '\' for compatibility with Amiga michael@0: and Unix file systems etc. If input came from standard michael@0: input, there is no filename field. michael@0: michael@0: extra field: (Variable) michael@0: michael@0: This is for future expansion. If additional information michael@0: needs to be stored in the future, it should be stored michael@0: here. Earlier versions of the software can then safely michael@0: skip this file, and find the next file or header. This michael@0: field will be 0 length in version 1.0. michael@0: michael@0: In order to allow different programs and different types michael@0: of information to be stored in the 'extra' field in .ZIP michael@0: files, the following structure should be used for all michael@0: programs storing data in this field: michael@0: michael@0: header1+data1 + header2+data2 . . . michael@0: michael@0: Each header should consist of: michael@0: michael@0: Header ID - 2 bytes michael@0: Data Size - 2 bytes michael@0: michael@0: Note: all fields stored in Intel low-byte/high-byte order. michael@0: michael@0: The Header ID field indicates the type of data that is in michael@0: the following data block. michael@0: michael@0: Header ID's of 0 thru 31 are reserved for use by PKWARE. michael@0: The remaining ID's can be used by third party vendors for michael@0: proprietary usage. michael@0: michael@0: The current Header ID mappings defined by PKWARE are: michael@0: michael@0: 0x0007 AV Info michael@0: 0x0009 OS/2 michael@0: 0x000a NTFS michael@0: 0x000c VAX/VMS michael@0: 0x000d Unix michael@0: 0x000f Patch Descriptor michael@0: michael@0: Several third party mappings commonly used are: michael@0: michael@0: 0x4b46 FWKCS MD5 (see below) michael@0: 0x07c8 Macintosh michael@0: 0x4341 Acorn/SparkFS michael@0: 0x4453 Windows NT security descriptor (binary ACL) michael@0: 0x4704 VM/CMS michael@0: 0x470f MVS michael@0: 0x4c41 OS/2 access control list (text ACL) michael@0: 0x4d49 Info-ZIP VMS (VAX or Alpha) michael@0: 0x5455 extended timestamp michael@0: 0x5855 Info-ZIP Unix (original, also OS/2, NT, etc) michael@0: 0x6542 BeOS/BeBox michael@0: 0x756e ASi Unix michael@0: 0x7855 Info-ZIP Unix (new) michael@0: 0xfd4a SMS/QDOS michael@0: michael@0: The Data Size field indicates the size of the following michael@0: data block. Programs can use this value to skip to the michael@0: next header block, passing over any data blocks that are michael@0: not of interest. michael@0: michael@0: Note: As stated above, the size of the entire .ZIP file michael@0: header, including the filename, comment, and extra michael@0: field should not exceed 64K in size. michael@0: michael@0: In case two different programs should appropriate the same michael@0: Header ID value, it is strongly recommended that each michael@0: program place a unique signature of at least two bytes in michael@0: size (and preferably 4 bytes or bigger) at the start of michael@0: each data area. Every program should verify that its michael@0: unique signature is present, in addition to the Header ID michael@0: value being correct, before assuming that it is a block of michael@0: known type. michael@0: michael@0: -OS/2 Extra Field: michael@0: michael@0: The following is the layout of the OS/2 attributes "extra" michael@0: block. (Last Revision 09/05/95) michael@0: michael@0: Note: all fields stored in Intel low-byte/high-byte order. michael@0: michael@0: Value Size Description michael@0: ----- ---- ----------- michael@0: (OS/2) 0x0009 2 bytes Tag for this "extra" block type michael@0: TSize 2 bytes Size for the following data block michael@0: BSize 4 bytes Uncompressed Block Size michael@0: CType 2 bytes Compression type michael@0: EACRC 4 bytes CRC value for uncompress block michael@0: (var) variable Compressed block michael@0: michael@0: The OS/2 extended attribute structure (FEA2LIST) is michael@0: compressed and then stored in it's entirety within this michael@0: structure. There will only ever be one "block" of data in michael@0: VarFields[]. michael@0: michael@0: -UNIX Extra Field: michael@0: michael@0: The following is the layout of the Unix "extra" block. michael@0: Note: all fields are stored in Intel low-byte/high-byte michael@0: order. michael@0: michael@0: Value Size Description michael@0: ----- ---- ----------- michael@0: (UNIX) 0x000d 2 bytes Tag for this "extra" block type michael@0: TSize 2 bytes Size for the following data block michael@0: Atime 4 bytes File last access time michael@0: Mtime 4 bytes File last modification time michael@0: Uid 2 bytes File user ID michael@0: Gid 2 bytes File group ID michael@0: (var) variable Variable length data field michael@0: michael@0: The variable length data field will contain file type michael@0: specific data. Currently the only values allowed are michael@0: the original "linked to" file names for hard or symbolic michael@0: links. michael@0: michael@0: -VAX/VMS Extra Field: michael@0: michael@0: The following is the layout of the VAX/VMS attributes michael@0: "extra" block. michael@0: michael@0: Note: all fields stored in Intel low-byte/high-byte order. michael@0: michael@0: Value Size Description michael@0: ----- ---- ----------- michael@0: (VMS) 0x000c 2 bytes Tag for this "extra" block type michael@0: TSize 2 bytes Size of the total "extra" block michael@0: CRC 4 bytes 32-bit CRC for remainder of the block michael@0: Tag1 2 bytes VMS attribute tag value #1 michael@0: Size1 2 bytes Size of attribute #1, in bytes michael@0: (var.) Size1 Attribute #1 data michael@0: . michael@0: . michael@0: . michael@0: TagN 2 bytes VMS attribute tage value #N michael@0: SizeN 2 bytes Size of attribute #N, in bytes michael@0: (var.) SizeN Attribute #N data michael@0: michael@0: Rules: michael@0: michael@0: 1. There will be one or more of attributes present, which michael@0: will each be preceded by the above TagX & SizeX values. michael@0: These values are identical to the ATR$C_XXXX and michael@0: ATR$S_XXXX constants which are defined in ATR.H under michael@0: VMS C. Neither of these values will ever be zero. michael@0: michael@0: 2. No word alignment or padding is performed. michael@0: michael@0: 3. A well-behaved PKZIP/VMS program should never produce michael@0: more than one sub-block with the same TagX value. Also, michael@0: there will never be more than one "extra" block of type michael@0: 0x000c in a particular directory record. michael@0: michael@0: -NTFS Extra Field: michael@0: michael@0: The following is the layout of the NTFS attributes michael@0: "extra" block. michael@0: michael@0: Note: all fields stored in Intel low-byte/high-byte order. michael@0: michael@0: Value Size Description michael@0: ----- ---- ----------- michael@0: (NTFS) 0x000a 2 bytes Tag for this "extra" block type michael@0: TSize 2 bytes Size of the total "extra" block michael@0: Reserved 4 bytes Reserved for future use michael@0: Tag1 2 bytes NTFS attribute tag value #1 michael@0: Size1 2 bytes Size of attribute #1, in bytes michael@0: (var.) Size1 Attribute #1 data michael@0: . michael@0: . michael@0: . michael@0: TagN 2 bytes NTFS attribute tage value #N michael@0: SizeN 2 bytes Size of attribute #N, in bytes michael@0: (var.) SizeN Attribute #N data michael@0: michael@0: For NTFS, values for Tag1 through TagN are as follows: michael@0: (currently only one set of attributes is defined for NTFS) michael@0: michael@0: Tag Size Description michael@0: ----- ---- ----------- michael@0: 0x0001 2 bytes Tag for attribute #1 michael@0: Size1 2 bytes Size of attribute #1, in bytes michael@0: Mtime 8 bytes File last modification time michael@0: Atime 8 bytes File last access time michael@0: Ctime 8 bytes File creation time michael@0: michael@0: -PATCH Descriptor Extra Field: michael@0: michael@0: The following is the layout of the Patch Descriptor "extra" michael@0: block. michael@0: michael@0: Note: all fields stored in Intel low-byte/high-byte order. michael@0: michael@0: Value Size Description michael@0: ----- ---- ----------- michael@0: (Patch) 0x000f 2 bytes Tag for this "extra" block type michael@0: TSize 2 bytes Size of the total "extra" block michael@0: Version 2 bytes Version of the descriptor michael@0: Flags 4 bytes Actions and reactions (see below) michael@0: OldSize 4 bytes Size of the file about to be patched michael@0: OldCRC 4 bytes 32-bit CRC of the file to be patched michael@0: NewSize 4 bytes Size of the resulting file michael@0: NewCRC 4 bytes 32-bit CRC of the resulting file michael@0: michael@0: Actions and reactions michael@0: michael@0: Bits Description michael@0: ---- ---------------- michael@0: 0 Use for autodetection michael@0: 1 Treat as selfpatch michael@0: 2-3 RESERVED michael@0: 4-5 Action (see below) michael@0: 6-7 RESERVED michael@0: 8-9 Reaction (see below) to absent file michael@0: 10-11 Reaction (see below) to newer file michael@0: 12-13 Reaction (see below) to unknown file michael@0: 14-15 RESERVED michael@0: 16-31 RESERVED michael@0: michael@0: Actions michael@0: michael@0: Action Value michael@0: ------ ----- michael@0: none 0 michael@0: add 1 michael@0: delete 2 michael@0: patch 3 michael@0: michael@0: Reactions michael@0: michael@0: Reaction Value michael@0: -------- ----- michael@0: ask 0 michael@0: skip 1 michael@0: ignore 2 michael@0: fail 3 michael@0: michael@0: - FWKCS MD5 Extra Field: michael@0: michael@0: The FWKCS Contents_Signature System, used in michael@0: automatically identifying files independent of filename, michael@0: optionally adds and uses an extra field to support the michael@0: rapid creation of an enhanced contents_signature: michael@0: michael@0: Header ID = 0x4b46 michael@0: Data Size = 0x0013 michael@0: Preface = 'M','D','5' michael@0: followed by 16 bytes containing the uncompressed file's michael@0: 128_bit MD5 hash(1), low byte first. michael@0: michael@0: When FWKCS revises a zipfile central directory to add michael@0: this extra field for a file, it also replaces the michael@0: central directory entry for that file's uncompressed michael@0: filelength with a measured value. michael@0: michael@0: FWKCS provides an option to strip this extra field, if michael@0: present, from a zipfile central directory. In adding michael@0: this extra field, FWKCS preserves Zipfile Authenticity michael@0: Verification; if stripping this extra field, FWKCS michael@0: preserves all versions of AV through PKZIP version 2.04g. michael@0: michael@0: FWKCS, and FWKCS Contents_Signature System, are michael@0: trademarks of Frederick W. Kantor. michael@0: michael@0: (1) R. Rivest, RFC1321.TXT, MIT Laboratory for Computer michael@0: Science and RSA Data Security, Inc., April 1992. michael@0: ll.76-77: "The MD5 algorithm is being placed in the michael@0: public domain for review and possible adoption as a michael@0: standard." michael@0: michael@0: file comment: (Variable) michael@0: michael@0: The comment for this file. michael@0: michael@0: number of this disk: (2 bytes) michael@0: michael@0: The number of this disk, which contains central michael@0: directory end record. michael@0: michael@0: number of the disk with the start of the central michael@0: directory: (2 bytes) michael@0: michael@0: The number of the disk on which the central michael@0: directory starts. michael@0: michael@0: total number of entries in the central dir on michael@0: this disk: (2 bytes) michael@0: michael@0: The number of central directory entries on this disk. michael@0: michael@0: total number of entries in the central dir: (2 bytes) michael@0: michael@0: The total number of files in the zipfile. michael@0: michael@0: size of the central directory: (4 bytes) michael@0: michael@0: The size (in bytes) of the entire central directory. michael@0: michael@0: offset of start of central directory with respect to michael@0: the starting disk number: (4 bytes) michael@0: michael@0: Offset of the start of the central directory on the michael@0: disk on which the central directory starts. michael@0: michael@0: zipfile comment length: (2 bytes) michael@0: michael@0: The length of the comment for this zipfile. michael@0: michael@0: zipfile comment: (Variable) michael@0: michael@0: The comment for this zipfile. michael@0: michael@0: D. General notes: michael@0: michael@0: 1) All fields unless otherwise noted are unsigned and stored michael@0: in Intel low-byte:high-byte, low-word:high-word order. michael@0: michael@0: 2) String fields are not null terminated, since the michael@0: length is given explicitly. michael@0: michael@0: 3) Local headers should not span disk boundaries. Also, even michael@0: though the central directory can span disk boundaries, no michael@0: single record in the central directory should be split michael@0: across disks. michael@0: michael@0: 4) The entries in the central directory may not necessarily michael@0: be in the same order that files appear in the zipfile. michael@0: michael@0: UnShrinking - Method 1 michael@0: ---------------------- michael@0: michael@0: Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm michael@0: with partial clearing. The initial code size is 9 bits, and michael@0: the maximum code size is 13 bits. Shrinking differs from michael@0: conventional Dynamic Ziv-Lempel-Welch implementations in several michael@0: respects: michael@0: michael@0: 1) The code size is controlled by the compressor, and is not michael@0: automatically increased when codes larger than the current michael@0: code size are created (but not necessarily used). When michael@0: the decompressor encounters the code sequence 256 michael@0: (decimal) followed by 1, it should increase the code size michael@0: read from the input stream to the next bit size. No michael@0: blocking of the codes is performed, so the next code at michael@0: the increased size should be read from the input stream michael@0: immediately after where the previous code at the smaller michael@0: bit size was read. Again, the decompressor should not michael@0: increase the code size used until the sequence 256,1 is michael@0: encountered. michael@0: michael@0: 2) When the table becomes full, total clearing is not michael@0: performed. Rather, when the compressor emits the code michael@0: sequence 256,2 (decimal), the decompressor should clear michael@0: all leaf nodes from the Ziv-Lempel tree, and continue to michael@0: use the current code size. The nodes that are cleared michael@0: from the Ziv-Lempel tree are then re-used, with the lowest michael@0: code value re-used first, and the highest code value michael@0: re-used last. The compressor can emit the sequence 256,2 michael@0: at any time. michael@0: michael@0: Expanding - Methods 2-5 michael@0: ----------------------- michael@0: michael@0: The Reducing algorithm is actually a combination of two michael@0: distinct algorithms. The first algorithm compresses repeated michael@0: byte sequences, and the second algorithm takes the compressed michael@0: stream from the first algorithm and applies a probabilistic michael@0: compression method. michael@0: michael@0: The probabilistic compression stores an array of 'follower michael@0: sets' S(j), for j=0 to 255, corresponding to each possible michael@0: ASCII character. Each set contains between 0 and 32 michael@0: characters, to be denoted as S(j)[0],...,S(j)[m], where m<32. michael@0: The sets are stored at the beginning of the data area for a michael@0: Reduced file, in reverse order, with S(255) first, and S(0) michael@0: last. michael@0: michael@0: The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] }, michael@0: where N(j) is the size of set S(j). N(j) can be 0, in which michael@0: case the follower set for S(j) is empty. Each N(j) value is michael@0: encoded in 6 bits, followed by N(j) eight bit character values michael@0: corresponding to S(j)[0] to S(j)[N(j)-1] respectively. If michael@0: N(j) is 0, then no values for S(j) are stored, and the value michael@0: for N(j-1) immediately follows. michael@0: michael@0: Immediately after the follower sets, is the compressed data michael@0: stream. The compressed data stream can be interpreted for the michael@0: probabilistic decompression as follows: michael@0: michael@0: let Last-Character <- 0. michael@0: loop until done michael@0: if the follower set S(Last-Character) is empty then michael@0: read 8 bits from the input stream, and copy this michael@0: value to the output stream. michael@0: otherwise if the follower set S(Last-Character) is non-empty then michael@0: read 1 bit from the input stream. michael@0: if this bit is not zero then michael@0: read 8 bits from the input stream, and copy this michael@0: value to the output stream. michael@0: otherwise if this bit is zero then michael@0: read B(N(Last-Character)) bits from the input michael@0: stream, and assign this value to I. michael@0: Copy the value of S(Last-Character)[I] to the michael@0: output stream. michael@0: michael@0: assign the last value placed on the output stream to michael@0: Last-Character. michael@0: end loop michael@0: michael@0: B(N(j)) is defined as the minimal number of bits required to michael@0: encode the value N(j)-1. michael@0: michael@0: The decompressed stream from above can then be expanded to michael@0: re-create the original file as follows: michael@0: michael@0: let State <- 0. michael@0: michael@0: loop until done michael@0: read 8 bits from the input stream into C. michael@0: case State of michael@0: 0: if C is not equal to DLE (144 decimal) then michael@0: copy C to the output stream. michael@0: otherwise if C is equal to DLE then michael@0: let State <- 1. michael@0: michael@0: 1: if C is non-zero then michael@0: let V <- C. michael@0: let Len <- L(V) michael@0: let State <- F(Len). michael@0: otherwise if C is zero then michael@0: copy the value 144 (decimal) to the output stream. michael@0: let State <- 0 michael@0: michael@0: 2: let Len <- Len + C michael@0: let State <- 3. michael@0: michael@0: 3: move backwards D(V,C) bytes in the output stream michael@0: (if this position is before the start of the output michael@0: stream, then assume that all the data before the michael@0: start of the output stream is filled with zeros). michael@0: copy Len+3 bytes from this position to the output stream. michael@0: let State <- 0. michael@0: end case michael@0: end loop michael@0: michael@0: The functions F,L, and D are dependent on the 'compression michael@0: factor', 1 through 4, and are defined as follows: michael@0: michael@0: For compression factor 1: michael@0: L(X) equals the lower 7 bits of X. michael@0: F(X) equals 2 if X equals 127 otherwise F(X) equals 3. michael@0: D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1. michael@0: For compression factor 2: michael@0: L(X) equals the lower 6 bits of X. michael@0: F(X) equals 2 if X equals 63 otherwise F(X) equals 3. michael@0: D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1. michael@0: For compression factor 3: michael@0: L(X) equals the lower 5 bits of X. michael@0: F(X) equals 2 if X equals 31 otherwise F(X) equals 3. michael@0: D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1. michael@0: For compression factor 4: michael@0: L(X) equals the lower 4 bits of X. michael@0: F(X) equals 2 if X equals 15 otherwise F(X) equals 3. michael@0: D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1. michael@0: michael@0: Imploding - Method 6 michael@0: -------------------- michael@0: michael@0: The Imploding algorithm is actually a combination of two distinct michael@0: algorithms. The first algorithm compresses repeated byte michael@0: sequences using a sliding dictionary. The second algorithm is michael@0: used to compress the encoding of the sliding dictionary output, michael@0: using multiple Shannon-Fano trees. michael@0: michael@0: The Imploding algorithm can use a 4K or 8K sliding dictionary michael@0: size. The dictionary size used can be determined by bit 1 in the michael@0: general purpose flag word; a 0 bit indicates a 4K dictionary michael@0: while a 1 bit indicates an 8K dictionary. michael@0: michael@0: The Shannon-Fano trees are stored at the start of the compressed michael@0: file. The number of trees stored is defined by bit 2 in the michael@0: general purpose flag word; a 0 bit indicates two trees stored, a michael@0: 1 bit indicates three trees are stored. If 3 trees are stored, michael@0: the first Shannon-Fano tree represents the encoding of the michael@0: Literal characters, the second tree represents the encoding of michael@0: the Length information, the third represents the encoding of the michael@0: Distance information. When 2 Shannon-Fano trees are stored, the michael@0: Length tree is stored first, followed by the Distance tree. michael@0: michael@0: The Literal Shannon-Fano tree, if present is used to represent michael@0: the entire ASCII character set, and contains 256 values. This michael@0: tree is used to compress any data not compressed by the sliding michael@0: dictionary algorithm. When this tree is present, the Minimum michael@0: Match Length for the sliding dictionary is 3. If this tree is michael@0: not present, the Minimum Match Length is 2. michael@0: michael@0: The Length Shannon-Fano tree is used to compress the Length part michael@0: of the (length,distance) pairs from the sliding dictionary michael@0: output. The Length tree contains 64 values, ranging from the michael@0: Minimum Match Length, to 63 plus the Minimum Match Length. michael@0: michael@0: The Distance Shannon-Fano tree is used to compress the Distance michael@0: part of the (length,distance) pairs from the sliding dictionary michael@0: output. The Distance tree contains 64 values, ranging from 0 to michael@0: 63, representing the upper 6 bits of the distance value. The michael@0: distance values themselves will be between 0 and the sliding michael@0: dictionary size, either 4K or 8K. michael@0: michael@0: The Shannon-Fano trees themselves are stored in a compressed michael@0: format. The first byte of the tree data represents the number of michael@0: bytes of data representing the (compressed) Shannon-Fano tree michael@0: minus 1. The remaining bytes represent the Shannon-Fano tree michael@0: data encoded as: michael@0: michael@0: High 4 bits: Number of values at this bit length + 1. (1 - 16) michael@0: Low 4 bits: Bit Length needed to represent value + 1. (1 - 16) michael@0: michael@0: The Shannon-Fano codes can be constructed from the bit lengths michael@0: using the following algorithm: michael@0: michael@0: 1) Sort the Bit Lengths in ascending order, while retaining the michael@0: order of the original lengths stored in the file. michael@0: michael@0: 2) Generate the Shannon-Fano trees: michael@0: michael@0: Code <- 0 michael@0: CodeIncrement <- 0 michael@0: LastBitLength <- 0 michael@0: i <- number of Shannon-Fano codes - 1 (either 255 or 63) michael@0: michael@0: loop while i >= 0 michael@0: Code = Code + CodeIncrement michael@0: if BitLength(i) <> LastBitLength then michael@0: LastBitLength=BitLength(i) michael@0: CodeIncrement = 1 shifted left (16 - LastBitLength) michael@0: ShannonCode(i) = Code michael@0: i <- i - 1 michael@0: end loop michael@0: michael@0: 3) Reverse the order of all the bits in the above ShannonCode() michael@0: vector, so that the most significant bit becomes the least michael@0: significant bit. For example, the value 0x1234 (hex) would michael@0: become 0x2C48 (hex). michael@0: michael@0: 4) Restore the order of Shannon-Fano codes as originally stored michael@0: within the file. michael@0: michael@0: Example: michael@0: michael@0: This example will show the encoding of a Shannon-Fano tree michael@0: of size 8. Notice that the actual Shannon-Fano trees used michael@0: for Imploding are either 64 or 256 entries in size. michael@0: michael@0: Example: 0x02, 0x42, 0x01, 0x13 michael@0: michael@0: The first byte indicates 3 values in this table. Decoding the michael@0: bytes: michael@0: 0x42 = 5 codes of 3 bits long michael@0: 0x01 = 1 code of 2 bits long michael@0: 0x13 = 2 codes of 4 bits long michael@0: michael@0: This would generate the original bit length array of: michael@0: (3, 3, 3, 3, 3, 2, 4, 4) michael@0: michael@0: There are 8 codes in this table for the values 0 thru 7. Using michael@0: the algorithm to obtain the Shannon-Fano codes produces: michael@0: michael@0: Reversed Order Original michael@0: Val Sorted Constructed Code Value Restored Length michael@0: --- ------ ----------------- -------- -------- ------ michael@0: 0: 2 1100000000000000 11 101 3 michael@0: 1: 3 1010000000000000 101 001 3 michael@0: 2: 3 1000000000000000 001 110 3 michael@0: 3: 3 0110000000000000 110 010 3 michael@0: 4: 3 0100000000000000 010 100 3 michael@0: 5: 3 0010000000000000 100 11 2 michael@0: 6: 4 0001000000000000 1000 1000 4 michael@0: 7: 4 0000000000000000 0000 0000 4 michael@0: michael@0: The values in the Val, Order Restored and Original Length columns michael@0: now represent the Shannon-Fano encoding tree that can be used for michael@0: decoding the Shannon-Fano encoded data. How to parse the michael@0: variable length Shannon-Fano values from the data stream is beyond michael@0: the scope of this document. (See the references listed at the end of michael@0: this document for more information.) However, traditional decoding michael@0: schemes used for Huffman variable length decoding, such as the michael@0: Greenlaw algorithm, can be successfully applied. michael@0: michael@0: The compressed data stream begins immediately after the michael@0: compressed Shannon-Fano data. The compressed data stream can be michael@0: interpreted as follows: michael@0: michael@0: loop until done michael@0: read 1 bit from input stream. michael@0: michael@0: if this bit is non-zero then (encoded data is literal data) michael@0: if Literal Shannon-Fano tree is present michael@0: read and decode character using Literal Shannon-Fano tree. michael@0: otherwise michael@0: read 8 bits from input stream. michael@0: copy character to the output stream. michael@0: otherwise (encoded data is sliding dictionary match) michael@0: if 8K dictionary size michael@0: read 7 bits for offset Distance (lower 7 bits of offset). michael@0: otherwise michael@0: read 6 bits for offset Distance (lower 6 bits of offset). michael@0: michael@0: using the Distance Shannon-Fano tree, read and decode the michael@0: upper 6 bits of the Distance value. michael@0: michael@0: using the Length Shannon-Fano tree, read and decode michael@0: the Length value. michael@0: michael@0: Length <- Length + Minimum Match Length michael@0: michael@0: if Length = 63 + Minimum Match Length michael@0: read 8 bits from the input stream, michael@0: add this value to Length. michael@0: michael@0: move backwards Distance+1 bytes in the output stream, and michael@0: copy Length characters from this position to the output michael@0: stream. (if this position is before the start of the output michael@0: stream, then assume that all the data before the start of michael@0: the output stream is filled with zeros). michael@0: end loop michael@0: michael@0: Tokenizing - Method 7 michael@0: -------------------- michael@0: michael@0: This method is not used by PKZIP. michael@0: michael@0: Deflating - Method 8 michael@0: ----------------- michael@0: michael@0: The Deflate algorithm is similar to the Implode algorithm using michael@0: a sliding dictionary of up to 32K with secondary compression michael@0: from Huffman/Shannon-Fano codes. michael@0: michael@0: The compressed data is stored in blocks with a header describing michael@0: the block and the Huffman codes used in the data block. The header michael@0: format is as follows: michael@0: michael@0: Bit 0: Last Block bit This bit is set to 1 if this is the last michael@0: compressed block in the data. michael@0: Bits 1-2: Block type michael@0: 00 (0) - Block is stored - All stored data is byte aligned. michael@0: Skip bits until next byte, then next word = block michael@0: length, followed by the ones compliment of the block michael@0: length word. Remaining data in block is the stored michael@0: data. michael@0: michael@0: 01 (1) - Use fixed Huffman codes for literal and distance codes. michael@0: Lit Code Bits Dist Code Bits michael@0: --------- ---- --------- ---- michael@0: 0 - 143 8 0 - 31 5 michael@0: 144 - 255 9 michael@0: 256 - 279 7 michael@0: 280 - 287 8 michael@0: michael@0: Literal codes 286-287 and distance codes 30-31 are michael@0: never used but participate in the huffman construction. michael@0: michael@0: 10 (2) - Dynamic Huffman codes. (See expanding Huffman codes) michael@0: michael@0: 11 (3) - Reserved - Flag a "Error in compressed data" if seen. michael@0: michael@0: Expanding Huffman Codes michael@0: ----------------------- michael@0: If the data block is stored with dynamic Huffman codes, the Huffman michael@0: codes are sent in the following compressed format: michael@0: michael@0: 5 Bits: # of Literal codes sent - 256 (256 - 286) michael@0: All other codes are never sent. michael@0: 5 Bits: # of Dist codes - 1 (1 - 32) michael@0: 4 Bits: # of Bit Length codes - 3 (3 - 19) michael@0: michael@0: The Huffman codes are sent as bit lengths and the codes are built as michael@0: described in the implode algorithm. The bit lengths themselves are michael@0: compressed with Huffman codes. There are 19 bit length codes: michael@0: michael@0: 0 - 15: Represent bit lengths of 0 - 15 michael@0: 16: Copy the previous bit length 3 - 6 times. michael@0: The next 2 bits indicate repeat length (0 = 3, ... ,3 = 6) michael@0: Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will michael@0: expand to 12 bit lengths of 8 (1 + 6 + 5) michael@0: 17: Repeat a bit length of 0 for 3 - 10 times. (3 bits of length) michael@0: 18: Repeat a bit length of 0 for 11 - 138 times (7 bits of length) michael@0: michael@0: The lengths of the bit length codes are sent packed 3 bits per value michael@0: (0 - 7) in the following order: michael@0: michael@0: 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 michael@0: michael@0: The Huffman codes should be built as described in the Implode algorithm michael@0: except codes are assigned starting at the shortest bit length, i.e. the michael@0: shortest code should be all 0's rather than all 1's. Also, codes with michael@0: a bit length of zero do not participate in the tree construction. The michael@0: codes are then used to decode the bit lengths for the literal and michael@0: distance tables. michael@0: michael@0: The bit lengths for the literal tables are sent first with the number michael@0: of entries sent described by the 5 bits sent earlier. There are up michael@0: to 286 literal characters; the first 256 represent the respective 8 michael@0: bit character, code 256 represents the End-Of-Block code, the remaining michael@0: 29 codes represent copy lengths of 3 thru 258. There are up to 30 michael@0: distance codes representing distances from 1 thru 32k as described michael@0: below. michael@0: michael@0: Length Codes michael@0: ------------ michael@0: Extra Extra Extra Extra michael@0: Code Bits Length Code Bits Lengths Code Bits Lengths Code Bits Length(s) michael@0: ---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- --------- michael@0: 257 0 3 265 1 11,12 273 3 35-42 281 5 131-162 michael@0: 258 0 4 266 1 13,14 274 3 43-50 282 5 163-194 michael@0: 259 0 5 267 1 15,16 275 3 51-58 283 5 195-226 michael@0: 260 0 6 268 1 17,18 276 3 59-66 284 5 227-257 michael@0: 261 0 7 269 2 19-22 277 4 67-82 285 0 258 michael@0: 262 0 8 270 2 23-26 278 4 83-98 michael@0: 263 0 9 271 2 27-30 279 4 99-114 michael@0: 264 0 10 272 2 31-34 280 4 115-130 michael@0: michael@0: Distance Codes michael@0: -------------- michael@0: Extra Extra Extra Extra michael@0: Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance michael@0: ---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- -------- michael@0: 0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144 michael@0: 1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192 michael@0: 2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288 michael@0: 3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384 michael@0: 4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576 michael@0: 5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768 michael@0: 6 2 9-12 14 6 129-192 22 10 2049-3072 michael@0: 7 2 13-16 15 6 193-256 23 10 3073-4096 michael@0: michael@0: The compressed data stream begins immediately after the michael@0: compressed header data. The compressed data stream can be michael@0: interpreted as follows: michael@0: michael@0: do michael@0: read header from input stream. michael@0: michael@0: if stored block michael@0: skip bits until byte aligned michael@0: read count and 1's compliment of count michael@0: copy count bytes data block michael@0: otherwise michael@0: loop until end of block code sent michael@0: decode literal character from input stream michael@0: if literal < 256 michael@0: copy character to the output stream michael@0: otherwise michael@0: if literal = end of block michael@0: break from loop michael@0: otherwise michael@0: decode distance from input stream michael@0: michael@0: move backwards distance bytes in the output stream, and michael@0: copy length characters from this position to the output michael@0: stream. michael@0: end loop michael@0: while not last block michael@0: michael@0: if data descriptor exists michael@0: skip bits until byte aligned michael@0: read crc and sizes michael@0: endif michael@0: michael@0: Decryption michael@0: ---------- michael@0: michael@0: The encryption used in PKZIP was generously supplied by Roger michael@0: Schlafly. PKWARE is grateful to Mr. Schlafly for his expert michael@0: help and advice in the field of data encryption. michael@0: michael@0: PKZIP encrypts the compressed data stream. Encrypted files must michael@0: be decrypted before they can be extracted. michael@0: michael@0: Each encrypted file has an extra 12 bytes stored at the start of michael@0: the data area defining the encryption header for that file. The michael@0: encryption header is originally set to random values, and then michael@0: itself encrypted, using three, 32-bit keys. The key values are michael@0: initialized using the supplied encryption password. After each byte michael@0: is encrypted, the keys are then updated using pseudo-random number michael@0: generation techniques in combination with the same CRC-32 algorithm michael@0: used in PKZIP and described elsewhere in this document. michael@0: michael@0: The following is the basic steps required to decrypt a file: michael@0: michael@0: 1) Initialize the three 32-bit keys with the password. michael@0: 2) Read and decrypt the 12-byte encryption header, further michael@0: initializing the encryption keys. michael@0: 3) Read and decrypt the compressed data stream using the michael@0: encryption keys. michael@0: michael@0: Step 1 - Initializing the encryption keys michael@0: ----------------------------------------- michael@0: michael@0: Key(0) <- 305419896 michael@0: Key(1) <- 591751049 michael@0: Key(2) <- 878082192 michael@0: michael@0: loop for i <- 0 to length(password)-1 michael@0: update_keys(password(i)) michael@0: end loop michael@0: michael@0: Where update_keys() is defined as: michael@0: michael@0: update_keys(char): michael@0: Key(0) <- crc32(key(0),char) michael@0: Key(1) <- Key(1) + (Key(0) & 000000ffH) michael@0: Key(1) <- Key(1) * 134775813 + 1 michael@0: Key(2) <- crc32(key(2),key(1) >> 24) michael@0: end update_keys michael@0: michael@0: Where crc32(old_crc,char) is a routine that given a CRC value and a michael@0: character, returns an updated CRC value after applying the CRC-32 michael@0: algorithm described elsewhere in this document. michael@0: michael@0: Step 2 - Decrypting the encryption header michael@0: ----------------------------------------- michael@0: michael@0: The purpose of this step is to further initialize the encryption michael@0: keys, based on random data, to render a plaintext attack on the michael@0: data ineffective. michael@0: michael@0: Read the 12-byte encryption header into Buffer, in locations michael@0: Buffer(0) thru Buffer(11). michael@0: michael@0: loop for i <- 0 to 11 michael@0: C <- buffer(i) ^ decrypt_byte() michael@0: update_keys(C) michael@0: buffer(i) <- C michael@0: end loop michael@0: michael@0: Where decrypt_byte() is defined as: michael@0: michael@0: unsigned char decrypt_byte() michael@0: local unsigned short temp michael@0: temp <- Key(2) | 2 michael@0: decrypt_byte <- (temp * (temp ^ 1)) >> 8 michael@0: end decrypt_byte michael@0: michael@0: After the header is decrypted, the last 1 or 2 bytes in Buffer michael@0: should be the high-order word/byte of the CRC for the file being michael@0: decrypted, stored in Intel low-byte/high-byte order. Versions of michael@0: PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is michael@0: used on versions after 2.0. This can be used to test if the password michael@0: supplied is correct or not. michael@0: michael@0: Step 3 - Decrypting the compressed data stream michael@0: ---------------------------------------------- michael@0: michael@0: The compressed data stream can be decrypted as follows: michael@0: michael@0: loop until done michael@0: read a character into C michael@0: Temp <- C ^ decrypt_byte() michael@0: update_keys(temp) michael@0: output Temp michael@0: end loop michael@0: michael@0: In addition to the above mentioned contributors to PKZIP and PKUNZIP, michael@0: I would like to extend special thanks to Robert Mahoney for suggesting michael@0: the extension .ZIP for this software. michael@0: michael@0: References: michael@0: michael@0: Fiala, Edward R., and Greene, Daniel H., "Data compression with michael@0: finite windows", Communications of the ACM, Volume 32, Number 4, michael@0: April 1989, pages 490-505. michael@0: michael@0: Held, Gilbert, "Data Compression, Techniques and Applications, michael@0: Hardware and Software Considerations", John Wiley & Sons, 1987. michael@0: michael@0: Huffman, D.A., "A method for the construction of minimum-redundancy michael@0: codes", Proceedings of the IRE, Volume 40, Number 9, September 1952, michael@0: pages 1098-1101. michael@0: michael@0: Nelson, Mark, "LZW Data Compression", Dr. Dobbs Journal, Volume 14, michael@0: Number 10, October 1989, pages 29-37. michael@0: michael@0: Nelson, Mark, "The Data Compression Book", M&T Books, 1991. michael@0: michael@0: Storer, James A., "Data Compression, Methods and Theory", michael@0: Computer Science Press, 1988 michael@0: michael@0: Welch, Terry, "A Technique for High-Performance Data Compression", michael@0: IEEE Computer, Volume 17, Number 6, June 1984, pages 8-19. michael@0: michael@0: Ziv, J. and Lempel, A., "A universal algorithm for sequential data michael@0: compression", Communications of the ACM, Volume 30, Number 6, michael@0: June 1987, pages 520-540. michael@0: michael@0: Ziv, J. and Lempel, A., "Compression of individual sequences via michael@0: variable-rate coding", IEEE Transactions on Information Theory, michael@0: Volume 24, Number 5, September 1978, pages 530-536.