modules/libjar/appnote.txt

Thu, 22 Jan 2015 13:21:57 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Thu, 22 Jan 2015 13:21:57 +0100
branch
TOR_BUG_9701
changeset 15
b8a032363ba2
permissions
-rw-r--r--

Incorporate requested changes from Mozilla in review:
https://bugzilla.mozilla.org/show_bug.cgi?id=1123480#c6

michael@0 1 Revised: 03/01/1999
michael@0 2
michael@0 3 Disclaimer
michael@0 4 ----------
michael@0 5
michael@0 6 Although PKWARE will attempt to supply current and accurate
michael@0 7 information relating to its file formats, algorithms, and the
michael@0 8 subject programs, the possibility of error can not be eliminated.
michael@0 9 PKWARE therefore expressly disclaims any warranty that the
michael@0 10 information contained in the associated materials relating to the
michael@0 11 subject programs and/or the format of the files created or
michael@0 12 accessed by the subject programs and/or the algorithms used by
michael@0 13 the subject programs, or any other matter, is current, correct or
michael@0 14 accurate as delivered. Any risk of damage due to any possible
michael@0 15 inaccurate information is assumed by the user of the information.
michael@0 16 Furthermore, the information relating to the subject programs
michael@0 17 and/or the file formats created or accessed by the subject
michael@0 18 programs and/or the algorithms used by the subject programs is
michael@0 19 subject to change without notice.
michael@0 20
michael@0 21 General Format of a ZIP file
michael@0 22 ----------------------------
michael@0 23
michael@0 24 Files stored in arbitrary order. Large zipfiles can span multiple
michael@0 25 diskette media.
michael@0 26
michael@0 27 Overall zipfile format:
michael@0 28
michael@0 29 [local file header + file data + data_descriptor] . . .
michael@0 30 [central directory] end of central directory record
michael@0 31
michael@0 32
michael@0 33 A. Local file header:
michael@0 34
michael@0 35 local file header signature 4 bytes (0x04034b50)
michael@0 36 version needed to extract 2 bytes
michael@0 37 general purpose bit flag 2 bytes
michael@0 38 compression method 2 bytes
michael@0 39 last mod file time 2 bytes
michael@0 40 last mod file date 2 bytes
michael@0 41 crc-32 4 bytes
michael@0 42 compressed size 4 bytes
michael@0 43 uncompressed size 4 bytes
michael@0 44 filename length 2 bytes
michael@0 45 extra field length 2 bytes
michael@0 46
michael@0 47 filename (variable size)
michael@0 48 extra field (variable size)
michael@0 49
michael@0 50 B. Data descriptor:
michael@0 51
michael@0 52 crc-32 4 bytes
michael@0 53 compressed size 4 bytes
michael@0 54 uncompressed size 4 bytes
michael@0 55
michael@0 56 This descriptor exists only if bit 3 of the general
michael@0 57 purpose bit flag is set (see below). It is byte aligned
michael@0 58 and immediately follows the last byte of compressed data.
michael@0 59 This descriptor is used only when it was not possible to
michael@0 60 seek in the output zip file, e.g., when the output zip file
michael@0 61 was standard output or a non seekable device.
michael@0 62
michael@0 63 C. Central directory structure:
michael@0 64
michael@0 65 [file header] . . . end of central dir record
michael@0 66
michael@0 67 File header:
michael@0 68
michael@0 69 central file header signature 4 bytes (0x02014b50)
michael@0 70 version made by 2 bytes
michael@0 71 version needed to extract 2 bytes
michael@0 72 general purpose bit flag 2 bytes
michael@0 73 compression method 2 bytes
michael@0 74 last mod file time 2 bytes
michael@0 75 last mod file date 2 bytes
michael@0 76 crc-32 4 bytes
michael@0 77 compressed size 4 bytes
michael@0 78 uncompressed size 4 bytes
michael@0 79 filename length 2 bytes
michael@0 80 extra field length 2 bytes
michael@0 81 file comment length 2 bytes
michael@0 82 disk number start 2 bytes
michael@0 83 internal file attributes 2 bytes
michael@0 84 external file attributes 4 bytes
michael@0 85 relative offset of local header 4 bytes
michael@0 86
michael@0 87 filename (variable size)
michael@0 88 extra field (variable size)
michael@0 89 file comment (variable size)
michael@0 90
michael@0 91 End of central dir record:
michael@0 92
michael@0 93 end of central dir signature 4 bytes (0x06054b50)
michael@0 94 number of this disk 2 bytes
michael@0 95 number of the disk with the
michael@0 96 start of the central directory 2 bytes
michael@0 97 total number of entries in
michael@0 98 the central dir on this disk 2 bytes
michael@0 99 total number of entries in
michael@0 100 the central dir 2 bytes
michael@0 101 size of the central directory 4 bytes
michael@0 102 offset of start of central
michael@0 103 directory with respect to
michael@0 104 the starting disk number 4 bytes
michael@0 105 zipfile comment length 2 bytes
michael@0 106 zipfile comment (variable size)
michael@0 107
michael@0 108 D. Explanation of fields:
michael@0 109
michael@0 110 version made by (2 bytes)
michael@0 111
michael@0 112 The upper byte indicates the compatibility of the file
michael@0 113 attribute information. If the external file attributes
michael@0 114 are compatible with MS-DOS and can be read by PKZIP for
michael@0 115 DOS version 2.04g then this value will be zero. If these
michael@0 116 attributes are not compatible, then this value will
michael@0 117 identify the host system on which the attributes are
michael@0 118 compatible. Software can use this information to determine
michael@0 119 the line record format for text files etc. The current
michael@0 120 mappings are:
michael@0 121
michael@0 122 0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
michael@0 123 1 - Amiga 2 - VAX/VMS
michael@0 124 3 - Unix 4 - VM/CMS
michael@0 125 5 - Atari ST 6 - OS/2 H.P.F.S.
michael@0 126 7 - Macintosh 8 - Z-System
michael@0 127 9 - CP/M 10 - Windows NTFS
michael@0 128 11 thru 255 - unused
michael@0 129
michael@0 130 The lower byte indicates the version number of the
michael@0 131 software used to encode the file. The value/10
michael@0 132 indicates the major version number, and the value
michael@0 133 mod 10 is the minor version number.
michael@0 134
michael@0 135 version needed to extract (2 bytes)
michael@0 136
michael@0 137 The minimum software version needed to extract the
michael@0 138 file, mapped as above.
michael@0 139
michael@0 140 general purpose bit flag: (2 bytes)
michael@0 141
michael@0 142 Bit 0: If set, indicates that the file is encrypted.
michael@0 143
michael@0 144 (For Method 6 - Imploding)
michael@0 145 Bit 1: If the compression method used was type 6,
michael@0 146 Imploding, then this bit, if set, indicates
michael@0 147 an 8K sliding dictionary was used. If clear,
michael@0 148 then a 4K sliding dictionary was used.
michael@0 149 Bit 2: If the compression method used was type 6,
michael@0 150 Imploding, then this bit, if set, indicates
michael@0 151 3 Shannon-Fano trees were used to encode the
michael@0 152 sliding dictionary output. If clear, then 2
michael@0 153 Shannon-Fano trees were used.
michael@0 154
michael@0 155 (For Method 8 - Deflating)
michael@0 156 Bit 2 Bit 1
michael@0 157 0 0 Normal (-en) compression option was used.
michael@0 158 0 1 Maximum (-ex) compression option was used.
michael@0 159 1 0 Fast (-ef) compression option was used.
michael@0 160 1 1 Super Fast (-es) compression option was used.
michael@0 161
michael@0 162 Note: Bits 1 and 2 are undefined if the compression
michael@0 163 method is any other.
michael@0 164
michael@0 165 Bit 3: If this bit is set, the fields crc-32, compressed
michael@0 166 size and uncompressed size are set to zero in the
michael@0 167 local header. The correct values are put in the
michael@0 168 data descriptor immediately following the compressed
michael@0 169 data. (Note: PKZIP version 2.04g for DOS only
michael@0 170 recognizes this bit for method 8 compression, newer
michael@0 171 versions of PKZIP recognize this bit for any
michael@0 172 compression method.)
michael@0 173
michael@0 174 Bit 4: Reserved for use with method 8, for enhanced
michael@0 175 deflating.
michael@0 176
michael@0 177 Bit 5: If this bit is set, this indicates that the file is
michael@0 178 compressed patched data. (Note: Requires PKZIP
michael@0 179 version 2.70 or greater)
michael@0 180
michael@0 181 Bit 6: Currently unused.
michael@0 182
michael@0 183 Bit 7: Currently unused.
michael@0 184
michael@0 185 Bit 8: Currently unused.
michael@0 186
michael@0 187 Bit 9: Currently unused.
michael@0 188
michael@0 189 Bit 10: Currently unused.
michael@0 190
michael@0 191 Bit 11: Currently unused.
michael@0 192
michael@0 193 Bit 12: Reserved by PKWARE for enhanced compression.
michael@0 194
michael@0 195 Bit 13: Reserved by PKWARE.
michael@0 196
michael@0 197 Bit 14: Reserved by PKWARE.
michael@0 198
michael@0 199 Bit 15: Reserved by PKWARE.
michael@0 200
michael@0 201 compression method: (2 bytes)
michael@0 202
michael@0 203 (see accompanying documentation for algorithm
michael@0 204 descriptions)
michael@0 205
michael@0 206 0 - The file is stored (no compression)
michael@0 207 1 - The file is Shrunk
michael@0 208 2 - The file is Reduced with compression factor 1
michael@0 209 3 - The file is Reduced with compression factor 2
michael@0 210 4 - The file is Reduced with compression factor 3
michael@0 211 5 - The file is Reduced with compression factor 4
michael@0 212 6 - The file is Imploded
michael@0 213 7 - Reserved for Tokenizing compression algorithm
michael@0 214 8 - The file is Deflated
michael@0 215 9 - Reserved for enhanced Deflating
michael@0 216 10 - PKWARE Date Compression Library Imploding
michael@0 217
michael@0 218 date and time fields: (2 bytes each)
michael@0 219
michael@0 220 The date and time are encoded in standard MS-DOS format.
michael@0 221 If input came from standard input, the date and time are
michael@0 222 those at which compression was started for this data.
michael@0 223
michael@0 224 CRC-32: (4 bytes)
michael@0 225
michael@0 226 The CRC-32 algorithm was generously contributed by
michael@0 227 David Schwaderer and can be found in his excellent
michael@0 228 book "C Programmers Guide to NetBIOS" published by
michael@0 229 Howard W. Sams & Co. Inc. The 'magic number' for
michael@0 230 the CRC is 0xdebb20e3. The proper CRC pre and post
michael@0 231 conditioning is used, meaning that the CRC register
michael@0 232 is pre-conditioned with all ones (a starting value
michael@0 233 of 0xffffffff) and the value is post-conditioned by
michael@0 234 taking the one's complement of the CRC residual.
michael@0 235 If bit 3 of the general purpose flag is set, this
michael@0 236 field is set to zero in the local header and the correct
michael@0 237 value is put in the data descriptor and in the central
michael@0 238 directory.
michael@0 239
michael@0 240 compressed size: (4 bytes)
michael@0 241 uncompressed size: (4 bytes)
michael@0 242
michael@0 243 The size of the file compressed and uncompressed,
michael@0 244 respectively. If bit 3 of the general purpose bit flag
michael@0 245 is set, these fields are set to zero in the local header
michael@0 246 and the correct values are put in the data descriptor and
michael@0 247 in the central directory.
michael@0 248
michael@0 249 filename length: (2 bytes)
michael@0 250 extra field length: (2 bytes)
michael@0 251 file comment length: (2 bytes)
michael@0 252
michael@0 253 The length of the filename, extra field, and comment
michael@0 254 fields respectively. The combined length of any
michael@0 255 directory record and these three fields should not
michael@0 256 generally exceed 65,535 bytes. If input came from standard
michael@0 257 input, the filename length is set to zero.
michael@0 258
michael@0 259 disk number start: (2 bytes)
michael@0 260
michael@0 261 The number of the disk on which this file begins.
michael@0 262
michael@0 263 internal file attributes: (2 bytes)
michael@0 264
michael@0 265 The lowest bit of this field indicates, if set, that
michael@0 266 the file is apparently an ASCII or text file. If not
michael@0 267 set, that the file apparently contains binary data.
michael@0 268 The remaining bits are unused in version 1.0.
michael@0 269
michael@0 270 Bits 1 and 2 are reserved for use by PKWARE.
michael@0 271
michael@0 272 external file attributes: (4 bytes)
michael@0 273
michael@0 274 The mapping of the external attributes is
michael@0 275 host-system dependent (see 'version made by'). For
michael@0 276 MS-DOS, the low order byte is the MS-DOS directory
michael@0 277 attribute byte. If input came from standard input, this
michael@0 278 field is set to zero.
michael@0 279
michael@0 280 relative offset of local header: (4 bytes)
michael@0 281
michael@0 282 This is the offset from the start of the first disk on
michael@0 283 which this file appears, to where the local header should
michael@0 284 be found.
michael@0 285
michael@0 286 filename: (Variable)
michael@0 287
michael@0 288 The name of the file, with optional relative path.
michael@0 289 The path stored should not contain a drive or
michael@0 290 device letter, or a leading slash. All slashes
michael@0 291 should be forward slashes '/' as opposed to
michael@0 292 backwards slashes '\' for compatibility with Amiga
michael@0 293 and Unix file systems etc. If input came from standard
michael@0 294 input, there is no filename field.
michael@0 295
michael@0 296 extra field: (Variable)
michael@0 297
michael@0 298 This is for future expansion. If additional information
michael@0 299 needs to be stored in the future, it should be stored
michael@0 300 here. Earlier versions of the software can then safely
michael@0 301 skip this file, and find the next file or header. This
michael@0 302 field will be 0 length in version 1.0.
michael@0 303
michael@0 304 In order to allow different programs and different types
michael@0 305 of information to be stored in the 'extra' field in .ZIP
michael@0 306 files, the following structure should be used for all
michael@0 307 programs storing data in this field:
michael@0 308
michael@0 309 header1+data1 + header2+data2 . . .
michael@0 310
michael@0 311 Each header should consist of:
michael@0 312
michael@0 313 Header ID - 2 bytes
michael@0 314 Data Size - 2 bytes
michael@0 315
michael@0 316 Note: all fields stored in Intel low-byte/high-byte order.
michael@0 317
michael@0 318 The Header ID field indicates the type of data that is in
michael@0 319 the following data block.
michael@0 320
michael@0 321 Header ID's of 0 thru 31 are reserved for use by PKWARE.
michael@0 322 The remaining ID's can be used by third party vendors for
michael@0 323 proprietary usage.
michael@0 324
michael@0 325 The current Header ID mappings defined by PKWARE are:
michael@0 326
michael@0 327 0x0007 AV Info
michael@0 328 0x0009 OS/2
michael@0 329 0x000a NTFS
michael@0 330 0x000c VAX/VMS
michael@0 331 0x000d Unix
michael@0 332 0x000f Patch Descriptor
michael@0 333
michael@0 334 Several third party mappings commonly used are:
michael@0 335
michael@0 336 0x4b46 FWKCS MD5 (see below)
michael@0 337 0x07c8 Macintosh
michael@0 338 0x4341 Acorn/SparkFS
michael@0 339 0x4453 Windows NT security descriptor (binary ACL)
michael@0 340 0x4704 VM/CMS
michael@0 341 0x470f MVS
michael@0 342 0x4c41 OS/2 access control list (text ACL)
michael@0 343 0x4d49 Info-ZIP VMS (VAX or Alpha)
michael@0 344 0x5455 extended timestamp
michael@0 345 0x5855 Info-ZIP Unix (original, also OS/2, NT, etc)
michael@0 346 0x6542 BeOS/BeBox
michael@0 347 0x756e ASi Unix
michael@0 348 0x7855 Info-ZIP Unix (new)
michael@0 349 0xfd4a SMS/QDOS
michael@0 350
michael@0 351 The Data Size field indicates the size of the following
michael@0 352 data block. Programs can use this value to skip to the
michael@0 353 next header block, passing over any data blocks that are
michael@0 354 not of interest.
michael@0 355
michael@0 356 Note: As stated above, the size of the entire .ZIP file
michael@0 357 header, including the filename, comment, and extra
michael@0 358 field should not exceed 64K in size.
michael@0 359
michael@0 360 In case two different programs should appropriate the same
michael@0 361 Header ID value, it is strongly recommended that each
michael@0 362 program place a unique signature of at least two bytes in
michael@0 363 size (and preferably 4 bytes or bigger) at the start of
michael@0 364 each data area. Every program should verify that its
michael@0 365 unique signature is present, in addition to the Header ID
michael@0 366 value being correct, before assuming that it is a block of
michael@0 367 known type.
michael@0 368
michael@0 369 -OS/2 Extra Field:
michael@0 370
michael@0 371 The following is the layout of the OS/2 attributes "extra"
michael@0 372 block. (Last Revision 09/05/95)
michael@0 373
michael@0 374 Note: all fields stored in Intel low-byte/high-byte order.
michael@0 375
michael@0 376 Value Size Description
michael@0 377 ----- ---- -----------
michael@0 378 (OS/2) 0x0009 2 bytes Tag for this "extra" block type
michael@0 379 TSize 2 bytes Size for the following data block
michael@0 380 BSize 4 bytes Uncompressed Block Size
michael@0 381 CType 2 bytes Compression type
michael@0 382 EACRC 4 bytes CRC value for uncompress block
michael@0 383 (var) variable Compressed block
michael@0 384
michael@0 385 The OS/2 extended attribute structure (FEA2LIST) is
michael@0 386 compressed and then stored in it's entirety within this
michael@0 387 structure. There will only ever be one "block" of data in
michael@0 388 VarFields[].
michael@0 389
michael@0 390 -UNIX Extra Field:
michael@0 391
michael@0 392 The following is the layout of the Unix "extra" block.
michael@0 393 Note: all fields are stored in Intel low-byte/high-byte
michael@0 394 order.
michael@0 395
michael@0 396 Value Size Description
michael@0 397 ----- ---- -----------
michael@0 398 (UNIX) 0x000d 2 bytes Tag for this "extra" block type
michael@0 399 TSize 2 bytes Size for the following data block
michael@0 400 Atime 4 bytes File last access time
michael@0 401 Mtime 4 bytes File last modification time
michael@0 402 Uid 2 bytes File user ID
michael@0 403 Gid 2 bytes File group ID
michael@0 404 (var) variable Variable length data field
michael@0 405
michael@0 406 The variable length data field will contain file type
michael@0 407 specific data. Currently the only values allowed are
michael@0 408 the original "linked to" file names for hard or symbolic
michael@0 409 links.
michael@0 410
michael@0 411 -VAX/VMS Extra Field:
michael@0 412
michael@0 413 The following is the layout of the VAX/VMS attributes
michael@0 414 "extra" block.
michael@0 415
michael@0 416 Note: all fields stored in Intel low-byte/high-byte order.
michael@0 417
michael@0 418 Value Size Description
michael@0 419 ----- ---- -----------
michael@0 420 (VMS) 0x000c 2 bytes Tag for this "extra" block type
michael@0 421 TSize 2 bytes Size of the total "extra" block
michael@0 422 CRC 4 bytes 32-bit CRC for remainder of the block
michael@0 423 Tag1 2 bytes VMS attribute tag value #1
michael@0 424 Size1 2 bytes Size of attribute #1, in bytes
michael@0 425 (var.) Size1 Attribute #1 data
michael@0 426 .
michael@0 427 .
michael@0 428 .
michael@0 429 TagN 2 bytes VMS attribute tage value #N
michael@0 430 SizeN 2 bytes Size of attribute #N, in bytes
michael@0 431 (var.) SizeN Attribute #N data
michael@0 432
michael@0 433 Rules:
michael@0 434
michael@0 435 1. There will be one or more of attributes present, which
michael@0 436 will each be preceded by the above TagX & SizeX values.
michael@0 437 These values are identical to the ATR$C_XXXX and
michael@0 438 ATR$S_XXXX constants which are defined in ATR.H under
michael@0 439 VMS C. Neither of these values will ever be zero.
michael@0 440
michael@0 441 2. No word alignment or padding is performed.
michael@0 442
michael@0 443 3. A well-behaved PKZIP/VMS program should never produce
michael@0 444 more than one sub-block with the same TagX value. Also,
michael@0 445 there will never be more than one "extra" block of type
michael@0 446 0x000c in a particular directory record.
michael@0 447
michael@0 448 -NTFS Extra Field:
michael@0 449
michael@0 450 The following is the layout of the NTFS attributes
michael@0 451 "extra" block.
michael@0 452
michael@0 453 Note: all fields stored in Intel low-byte/high-byte order.
michael@0 454
michael@0 455 Value Size Description
michael@0 456 ----- ---- -----------
michael@0 457 (NTFS) 0x000a 2 bytes Tag for this "extra" block type
michael@0 458 TSize 2 bytes Size of the total "extra" block
michael@0 459 Reserved 4 bytes Reserved for future use
michael@0 460 Tag1 2 bytes NTFS attribute tag value #1
michael@0 461 Size1 2 bytes Size of attribute #1, in bytes
michael@0 462 (var.) Size1 Attribute #1 data
michael@0 463 .
michael@0 464 .
michael@0 465 .
michael@0 466 TagN 2 bytes NTFS attribute tage value #N
michael@0 467 SizeN 2 bytes Size of attribute #N, in bytes
michael@0 468 (var.) SizeN Attribute #N data
michael@0 469
michael@0 470 For NTFS, values for Tag1 through TagN are as follows:
michael@0 471 (currently only one set of attributes is defined for NTFS)
michael@0 472
michael@0 473 Tag Size Description
michael@0 474 ----- ---- -----------
michael@0 475 0x0001 2 bytes Tag for attribute #1
michael@0 476 Size1 2 bytes Size of attribute #1, in bytes
michael@0 477 Mtime 8 bytes File last modification time
michael@0 478 Atime 8 bytes File last access time
michael@0 479 Ctime 8 bytes File creation time
michael@0 480
michael@0 481 -PATCH Descriptor Extra Field:
michael@0 482
michael@0 483 The following is the layout of the Patch Descriptor "extra"
michael@0 484 block.
michael@0 485
michael@0 486 Note: all fields stored in Intel low-byte/high-byte order.
michael@0 487
michael@0 488 Value Size Description
michael@0 489 ----- ---- -----------
michael@0 490 (Patch) 0x000f 2 bytes Tag for this "extra" block type
michael@0 491 TSize 2 bytes Size of the total "extra" block
michael@0 492 Version 2 bytes Version of the descriptor
michael@0 493 Flags 4 bytes Actions and reactions (see below)
michael@0 494 OldSize 4 bytes Size of the file about to be patched
michael@0 495 OldCRC 4 bytes 32-bit CRC of the file to be patched
michael@0 496 NewSize 4 bytes Size of the resulting file
michael@0 497 NewCRC 4 bytes 32-bit CRC of the resulting file
michael@0 498
michael@0 499 Actions and reactions
michael@0 500
michael@0 501 Bits Description
michael@0 502 ---- ----------------
michael@0 503 0 Use for autodetection
michael@0 504 1 Treat as selfpatch
michael@0 505 2-3 RESERVED
michael@0 506 4-5 Action (see below)
michael@0 507 6-7 RESERVED
michael@0 508 8-9 Reaction (see below) to absent file
michael@0 509 10-11 Reaction (see below) to newer file
michael@0 510 12-13 Reaction (see below) to unknown file
michael@0 511 14-15 RESERVED
michael@0 512 16-31 RESERVED
michael@0 513
michael@0 514 Actions
michael@0 515
michael@0 516 Action Value
michael@0 517 ------ -----
michael@0 518 none 0
michael@0 519 add 1
michael@0 520 delete 2
michael@0 521 patch 3
michael@0 522
michael@0 523 Reactions
michael@0 524
michael@0 525 Reaction Value
michael@0 526 -------- -----
michael@0 527 ask 0
michael@0 528 skip 1
michael@0 529 ignore 2
michael@0 530 fail 3
michael@0 531
michael@0 532 - FWKCS MD5 Extra Field:
michael@0 533
michael@0 534 The FWKCS Contents_Signature System, used in
michael@0 535 automatically identifying files independent of filename,
michael@0 536 optionally adds and uses an extra field to support the
michael@0 537 rapid creation of an enhanced contents_signature:
michael@0 538
michael@0 539 Header ID = 0x4b46
michael@0 540 Data Size = 0x0013
michael@0 541 Preface = 'M','D','5'
michael@0 542 followed by 16 bytes containing the uncompressed file's
michael@0 543 128_bit MD5 hash(1), low byte first.
michael@0 544
michael@0 545 When FWKCS revises a zipfile central directory to add
michael@0 546 this extra field for a file, it also replaces the
michael@0 547 central directory entry for that file's uncompressed
michael@0 548 filelength with a measured value.
michael@0 549
michael@0 550 FWKCS provides an option to strip this extra field, if
michael@0 551 present, from a zipfile central directory. In adding
michael@0 552 this extra field, FWKCS preserves Zipfile Authenticity
michael@0 553 Verification; if stripping this extra field, FWKCS
michael@0 554 preserves all versions of AV through PKZIP version 2.04g.
michael@0 555
michael@0 556 FWKCS, and FWKCS Contents_Signature System, are
michael@0 557 trademarks of Frederick W. Kantor.
michael@0 558
michael@0 559 (1) R. Rivest, RFC1321.TXT, MIT Laboratory for Computer
michael@0 560 Science and RSA Data Security, Inc., April 1992.
michael@0 561 ll.76-77: "The MD5 algorithm is being placed in the
michael@0 562 public domain for review and possible adoption as a
michael@0 563 standard."
michael@0 564
michael@0 565 file comment: (Variable)
michael@0 566
michael@0 567 The comment for this file.
michael@0 568
michael@0 569 number of this disk: (2 bytes)
michael@0 570
michael@0 571 The number of this disk, which contains central
michael@0 572 directory end record.
michael@0 573
michael@0 574 number of the disk with the start of the central
michael@0 575 directory: (2 bytes)
michael@0 576
michael@0 577 The number of the disk on which the central
michael@0 578 directory starts.
michael@0 579
michael@0 580 total number of entries in the central dir on
michael@0 581 this disk: (2 bytes)
michael@0 582
michael@0 583 The number of central directory entries on this disk.
michael@0 584
michael@0 585 total number of entries in the central dir: (2 bytes)
michael@0 586
michael@0 587 The total number of files in the zipfile.
michael@0 588
michael@0 589 size of the central directory: (4 bytes)
michael@0 590
michael@0 591 The size (in bytes) of the entire central directory.
michael@0 592
michael@0 593 offset of start of central directory with respect to
michael@0 594 the starting disk number: (4 bytes)
michael@0 595
michael@0 596 Offset of the start of the central directory on the
michael@0 597 disk on which the central directory starts.
michael@0 598
michael@0 599 zipfile comment length: (2 bytes)
michael@0 600
michael@0 601 The length of the comment for this zipfile.
michael@0 602
michael@0 603 zipfile comment: (Variable)
michael@0 604
michael@0 605 The comment for this zipfile.
michael@0 606
michael@0 607 D. General notes:
michael@0 608
michael@0 609 1) All fields unless otherwise noted are unsigned and stored
michael@0 610 in Intel low-byte:high-byte, low-word:high-word order.
michael@0 611
michael@0 612 2) String fields are not null terminated, since the
michael@0 613 length is given explicitly.
michael@0 614
michael@0 615 3) Local headers should not span disk boundaries. Also, even
michael@0 616 though the central directory can span disk boundaries, no
michael@0 617 single record in the central directory should be split
michael@0 618 across disks.
michael@0 619
michael@0 620 4) The entries in the central directory may not necessarily
michael@0 621 be in the same order that files appear in the zipfile.
michael@0 622
michael@0 623 UnShrinking - Method 1
michael@0 624 ----------------------
michael@0 625
michael@0 626 Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm
michael@0 627 with partial clearing. The initial code size is 9 bits, and
michael@0 628 the maximum code size is 13 bits. Shrinking differs from
michael@0 629 conventional Dynamic Ziv-Lempel-Welch implementations in several
michael@0 630 respects:
michael@0 631
michael@0 632 1) The code size is controlled by the compressor, and is not
michael@0 633 automatically increased when codes larger than the current
michael@0 634 code size are created (but not necessarily used). When
michael@0 635 the decompressor encounters the code sequence 256
michael@0 636 (decimal) followed by 1, it should increase the code size
michael@0 637 read from the input stream to the next bit size. No
michael@0 638 blocking of the codes is performed, so the next code at
michael@0 639 the increased size should be read from the input stream
michael@0 640 immediately after where the previous code at the smaller
michael@0 641 bit size was read. Again, the decompressor should not
michael@0 642 increase the code size used until the sequence 256,1 is
michael@0 643 encountered.
michael@0 644
michael@0 645 2) When the table becomes full, total clearing is not
michael@0 646 performed. Rather, when the compressor emits the code
michael@0 647 sequence 256,2 (decimal), the decompressor should clear
michael@0 648 all leaf nodes from the Ziv-Lempel tree, and continue to
michael@0 649 use the current code size. The nodes that are cleared
michael@0 650 from the Ziv-Lempel tree are then re-used, with the lowest
michael@0 651 code value re-used first, and the highest code value
michael@0 652 re-used last. The compressor can emit the sequence 256,2
michael@0 653 at any time.
michael@0 654
michael@0 655 Expanding - Methods 2-5
michael@0 656 -----------------------
michael@0 657
michael@0 658 The Reducing algorithm is actually a combination of two
michael@0 659 distinct algorithms. The first algorithm compresses repeated
michael@0 660 byte sequences, and the second algorithm takes the compressed
michael@0 661 stream from the first algorithm and applies a probabilistic
michael@0 662 compression method.
michael@0 663
michael@0 664 The probabilistic compression stores an array of 'follower
michael@0 665 sets' S(j), for j=0 to 255, corresponding to each possible
michael@0 666 ASCII character. Each set contains between 0 and 32
michael@0 667 characters, to be denoted as S(j)[0],...,S(j)[m], where m<32.
michael@0 668 The sets are stored at the beginning of the data area for a
michael@0 669 Reduced file, in reverse order, with S(255) first, and S(0)
michael@0 670 last.
michael@0 671
michael@0 672 The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] },
michael@0 673 where N(j) is the size of set S(j). N(j) can be 0, in which
michael@0 674 case the follower set for S(j) is empty. Each N(j) value is
michael@0 675 encoded in 6 bits, followed by N(j) eight bit character values
michael@0 676 corresponding to S(j)[0] to S(j)[N(j)-1] respectively. If
michael@0 677 N(j) is 0, then no values for S(j) are stored, and the value
michael@0 678 for N(j-1) immediately follows.
michael@0 679
michael@0 680 Immediately after the follower sets, is the compressed data
michael@0 681 stream. The compressed data stream can be interpreted for the
michael@0 682 probabilistic decompression as follows:
michael@0 683
michael@0 684 let Last-Character <- 0.
michael@0 685 loop until done
michael@0 686 if the follower set S(Last-Character) is empty then
michael@0 687 read 8 bits from the input stream, and copy this
michael@0 688 value to the output stream.
michael@0 689 otherwise if the follower set S(Last-Character) is non-empty then
michael@0 690 read 1 bit from the input stream.
michael@0 691 if this bit is not zero then
michael@0 692 read 8 bits from the input stream, and copy this
michael@0 693 value to the output stream.
michael@0 694 otherwise if this bit is zero then
michael@0 695 read B(N(Last-Character)) bits from the input
michael@0 696 stream, and assign this value to I.
michael@0 697 Copy the value of S(Last-Character)[I] to the
michael@0 698 output stream.
michael@0 699
michael@0 700 assign the last value placed on the output stream to
michael@0 701 Last-Character.
michael@0 702 end loop
michael@0 703
michael@0 704 B(N(j)) is defined as the minimal number of bits required to
michael@0 705 encode the value N(j)-1.
michael@0 706
michael@0 707 The decompressed stream from above can then be expanded to
michael@0 708 re-create the original file as follows:
michael@0 709
michael@0 710 let State <- 0.
michael@0 711
michael@0 712 loop until done
michael@0 713 read 8 bits from the input stream into C.
michael@0 714 case State of
michael@0 715 0: if C is not equal to DLE (144 decimal) then
michael@0 716 copy C to the output stream.
michael@0 717 otherwise if C is equal to DLE then
michael@0 718 let State <- 1.
michael@0 719
michael@0 720 1: if C is non-zero then
michael@0 721 let V <- C.
michael@0 722 let Len <- L(V)
michael@0 723 let State <- F(Len).
michael@0 724 otherwise if C is zero then
michael@0 725 copy the value 144 (decimal) to the output stream.
michael@0 726 let State <- 0
michael@0 727
michael@0 728 2: let Len <- Len + C
michael@0 729 let State <- 3.
michael@0 730
michael@0 731 3: move backwards D(V,C) bytes in the output stream
michael@0 732 (if this position is before the start of the output
michael@0 733 stream, then assume that all the data before the
michael@0 734 start of the output stream is filled with zeros).
michael@0 735 copy Len+3 bytes from this position to the output stream.
michael@0 736 let State <- 0.
michael@0 737 end case
michael@0 738 end loop
michael@0 739
michael@0 740 The functions F,L, and D are dependent on the 'compression
michael@0 741 factor', 1 through 4, and are defined as follows:
michael@0 742
michael@0 743 For compression factor 1:
michael@0 744 L(X) equals the lower 7 bits of X.
michael@0 745 F(X) equals 2 if X equals 127 otherwise F(X) equals 3.
michael@0 746 D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1.
michael@0 747 For compression factor 2:
michael@0 748 L(X) equals the lower 6 bits of X.
michael@0 749 F(X) equals 2 if X equals 63 otherwise F(X) equals 3.
michael@0 750 D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1.
michael@0 751 For compression factor 3:
michael@0 752 L(X) equals the lower 5 bits of X.
michael@0 753 F(X) equals 2 if X equals 31 otherwise F(X) equals 3.
michael@0 754 D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1.
michael@0 755 For compression factor 4:
michael@0 756 L(X) equals the lower 4 bits of X.
michael@0 757 F(X) equals 2 if X equals 15 otherwise F(X) equals 3.
michael@0 758 D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1.
michael@0 759
michael@0 760 Imploding - Method 6
michael@0 761 --------------------
michael@0 762
michael@0 763 The Imploding algorithm is actually a combination of two distinct
michael@0 764 algorithms. The first algorithm compresses repeated byte
michael@0 765 sequences using a sliding dictionary. The second algorithm is
michael@0 766 used to compress the encoding of the sliding dictionary output,
michael@0 767 using multiple Shannon-Fano trees.
michael@0 768
michael@0 769 The Imploding algorithm can use a 4K or 8K sliding dictionary
michael@0 770 size. The dictionary size used can be determined by bit 1 in the
michael@0 771 general purpose flag word; a 0 bit indicates a 4K dictionary
michael@0 772 while a 1 bit indicates an 8K dictionary.
michael@0 773
michael@0 774 The Shannon-Fano trees are stored at the start of the compressed
michael@0 775 file. The number of trees stored is defined by bit 2 in the
michael@0 776 general purpose flag word; a 0 bit indicates two trees stored, a
michael@0 777 1 bit indicates three trees are stored. If 3 trees are stored,
michael@0 778 the first Shannon-Fano tree represents the encoding of the
michael@0 779 Literal characters, the second tree represents the encoding of
michael@0 780 the Length information, the third represents the encoding of the
michael@0 781 Distance information. When 2 Shannon-Fano trees are stored, the
michael@0 782 Length tree is stored first, followed by the Distance tree.
michael@0 783
michael@0 784 The Literal Shannon-Fano tree, if present is used to represent
michael@0 785 the entire ASCII character set, and contains 256 values. This
michael@0 786 tree is used to compress any data not compressed by the sliding
michael@0 787 dictionary algorithm. When this tree is present, the Minimum
michael@0 788 Match Length for the sliding dictionary is 3. If this tree is
michael@0 789 not present, the Minimum Match Length is 2.
michael@0 790
michael@0 791 The Length Shannon-Fano tree is used to compress the Length part
michael@0 792 of the (length,distance) pairs from the sliding dictionary
michael@0 793 output. The Length tree contains 64 values, ranging from the
michael@0 794 Minimum Match Length, to 63 plus the Minimum Match Length.
michael@0 795
michael@0 796 The Distance Shannon-Fano tree is used to compress the Distance
michael@0 797 part of the (length,distance) pairs from the sliding dictionary
michael@0 798 output. The Distance tree contains 64 values, ranging from 0 to
michael@0 799 63, representing the upper 6 bits of the distance value. The
michael@0 800 distance values themselves will be between 0 and the sliding
michael@0 801 dictionary size, either 4K or 8K.
michael@0 802
michael@0 803 The Shannon-Fano trees themselves are stored in a compressed
michael@0 804 format. The first byte of the tree data represents the number of
michael@0 805 bytes of data representing the (compressed) Shannon-Fano tree
michael@0 806 minus 1. The remaining bytes represent the Shannon-Fano tree
michael@0 807 data encoded as:
michael@0 808
michael@0 809 High 4 bits: Number of values at this bit length + 1. (1 - 16)
michael@0 810 Low 4 bits: Bit Length needed to represent value + 1. (1 - 16)
michael@0 811
michael@0 812 The Shannon-Fano codes can be constructed from the bit lengths
michael@0 813 using the following algorithm:
michael@0 814
michael@0 815 1) Sort the Bit Lengths in ascending order, while retaining the
michael@0 816 order of the original lengths stored in the file.
michael@0 817
michael@0 818 2) Generate the Shannon-Fano trees:
michael@0 819
michael@0 820 Code <- 0
michael@0 821 CodeIncrement <- 0
michael@0 822 LastBitLength <- 0
michael@0 823 i <- number of Shannon-Fano codes - 1 (either 255 or 63)
michael@0 824
michael@0 825 loop while i >= 0
michael@0 826 Code = Code + CodeIncrement
michael@0 827 if BitLength(i) <> LastBitLength then
michael@0 828 LastBitLength=BitLength(i)
michael@0 829 CodeIncrement = 1 shifted left (16 - LastBitLength)
michael@0 830 ShannonCode(i) = Code
michael@0 831 i <- i - 1
michael@0 832 end loop
michael@0 833
michael@0 834 3) Reverse the order of all the bits in the above ShannonCode()
michael@0 835 vector, so that the most significant bit becomes the least
michael@0 836 significant bit. For example, the value 0x1234 (hex) would
michael@0 837 become 0x2C48 (hex).
michael@0 838
michael@0 839 4) Restore the order of Shannon-Fano codes as originally stored
michael@0 840 within the file.
michael@0 841
michael@0 842 Example:
michael@0 843
michael@0 844 This example will show the encoding of a Shannon-Fano tree
michael@0 845 of size 8. Notice that the actual Shannon-Fano trees used
michael@0 846 for Imploding are either 64 or 256 entries in size.
michael@0 847
michael@0 848 Example: 0x02, 0x42, 0x01, 0x13
michael@0 849
michael@0 850 The first byte indicates 3 values in this table. Decoding the
michael@0 851 bytes:
michael@0 852 0x42 = 5 codes of 3 bits long
michael@0 853 0x01 = 1 code of 2 bits long
michael@0 854 0x13 = 2 codes of 4 bits long
michael@0 855
michael@0 856 This would generate the original bit length array of:
michael@0 857 (3, 3, 3, 3, 3, 2, 4, 4)
michael@0 858
michael@0 859 There are 8 codes in this table for the values 0 thru 7. Using
michael@0 860 the algorithm to obtain the Shannon-Fano codes produces:
michael@0 861
michael@0 862 Reversed Order Original
michael@0 863 Val Sorted Constructed Code Value Restored Length
michael@0 864 --- ------ ----------------- -------- -------- ------
michael@0 865 0: 2 1100000000000000 11 101 3
michael@0 866 1: 3 1010000000000000 101 001 3
michael@0 867 2: 3 1000000000000000 001 110 3
michael@0 868 3: 3 0110000000000000 110 010 3
michael@0 869 4: 3 0100000000000000 010 100 3
michael@0 870 5: 3 0010000000000000 100 11 2
michael@0 871 6: 4 0001000000000000 1000 1000 4
michael@0 872 7: 4 0000000000000000 0000 0000 4
michael@0 873
michael@0 874 The values in the Val, Order Restored and Original Length columns
michael@0 875 now represent the Shannon-Fano encoding tree that can be used for
michael@0 876 decoding the Shannon-Fano encoded data. How to parse the
michael@0 877 variable length Shannon-Fano values from the data stream is beyond
michael@0 878 the scope of this document. (See the references listed at the end of
michael@0 879 this document for more information.) However, traditional decoding
michael@0 880 schemes used for Huffman variable length decoding, such as the
michael@0 881 Greenlaw algorithm, can be successfully applied.
michael@0 882
michael@0 883 The compressed data stream begins immediately after the
michael@0 884 compressed Shannon-Fano data. The compressed data stream can be
michael@0 885 interpreted as follows:
michael@0 886
michael@0 887 loop until done
michael@0 888 read 1 bit from input stream.
michael@0 889
michael@0 890 if this bit is non-zero then (encoded data is literal data)
michael@0 891 if Literal Shannon-Fano tree is present
michael@0 892 read and decode character using Literal Shannon-Fano tree.
michael@0 893 otherwise
michael@0 894 read 8 bits from input stream.
michael@0 895 copy character to the output stream.
michael@0 896 otherwise (encoded data is sliding dictionary match)
michael@0 897 if 8K dictionary size
michael@0 898 read 7 bits for offset Distance (lower 7 bits of offset).
michael@0 899 otherwise
michael@0 900 read 6 bits for offset Distance (lower 6 bits of offset).
michael@0 901
michael@0 902 using the Distance Shannon-Fano tree, read and decode the
michael@0 903 upper 6 bits of the Distance value.
michael@0 904
michael@0 905 using the Length Shannon-Fano tree, read and decode
michael@0 906 the Length value.
michael@0 907
michael@0 908 Length <- Length + Minimum Match Length
michael@0 909
michael@0 910 if Length = 63 + Minimum Match Length
michael@0 911 read 8 bits from the input stream,
michael@0 912 add this value to Length.
michael@0 913
michael@0 914 move backwards Distance+1 bytes in the output stream, and
michael@0 915 copy Length characters from this position to the output
michael@0 916 stream. (if this position is before the start of the output
michael@0 917 stream, then assume that all the data before the start of
michael@0 918 the output stream is filled with zeros).
michael@0 919 end loop
michael@0 920
michael@0 921 Tokenizing - Method 7
michael@0 922 --------------------
michael@0 923
michael@0 924 This method is not used by PKZIP.
michael@0 925
michael@0 926 Deflating - Method 8
michael@0 927 -----------------
michael@0 928
michael@0 929 The Deflate algorithm is similar to the Implode algorithm using
michael@0 930 a sliding dictionary of up to 32K with secondary compression
michael@0 931 from Huffman/Shannon-Fano codes.
michael@0 932
michael@0 933 The compressed data is stored in blocks with a header describing
michael@0 934 the block and the Huffman codes used in the data block. The header
michael@0 935 format is as follows:
michael@0 936
michael@0 937 Bit 0: Last Block bit This bit is set to 1 if this is the last
michael@0 938 compressed block in the data.
michael@0 939 Bits 1-2: Block type
michael@0 940 00 (0) - Block is stored - All stored data is byte aligned.
michael@0 941 Skip bits until next byte, then next word = block
michael@0 942 length, followed by the ones compliment of the block
michael@0 943 length word. Remaining data in block is the stored
michael@0 944 data.
michael@0 945
michael@0 946 01 (1) - Use fixed Huffman codes for literal and distance codes.
michael@0 947 Lit Code Bits Dist Code Bits
michael@0 948 --------- ---- --------- ----
michael@0 949 0 - 143 8 0 - 31 5
michael@0 950 144 - 255 9
michael@0 951 256 - 279 7
michael@0 952 280 - 287 8
michael@0 953
michael@0 954 Literal codes 286-287 and distance codes 30-31 are
michael@0 955 never used but participate in the huffman construction.
michael@0 956
michael@0 957 10 (2) - Dynamic Huffman codes. (See expanding Huffman codes)
michael@0 958
michael@0 959 11 (3) - Reserved - Flag a "Error in compressed data" if seen.
michael@0 960
michael@0 961 Expanding Huffman Codes
michael@0 962 -----------------------
michael@0 963 If the data block is stored with dynamic Huffman codes, the Huffman
michael@0 964 codes are sent in the following compressed format:
michael@0 965
michael@0 966 5 Bits: # of Literal codes sent - 256 (256 - 286)
michael@0 967 All other codes are never sent.
michael@0 968 5 Bits: # of Dist codes - 1 (1 - 32)
michael@0 969 4 Bits: # of Bit Length codes - 3 (3 - 19)
michael@0 970
michael@0 971 The Huffman codes are sent as bit lengths and the codes are built as
michael@0 972 described in the implode algorithm. The bit lengths themselves are
michael@0 973 compressed with Huffman codes. There are 19 bit length codes:
michael@0 974
michael@0 975 0 - 15: Represent bit lengths of 0 - 15
michael@0 976 16: Copy the previous bit length 3 - 6 times.
michael@0 977 The next 2 bits indicate repeat length (0 = 3, ... ,3 = 6)
michael@0 978 Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will
michael@0 979 expand to 12 bit lengths of 8 (1 + 6 + 5)
michael@0 980 17: Repeat a bit length of 0 for 3 - 10 times. (3 bits of length)
michael@0 981 18: Repeat a bit length of 0 for 11 - 138 times (7 bits of length)
michael@0 982
michael@0 983 The lengths of the bit length codes are sent packed 3 bits per value
michael@0 984 (0 - 7) in the following order:
michael@0 985
michael@0 986 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
michael@0 987
michael@0 988 The Huffman codes should be built as described in the Implode algorithm
michael@0 989 except codes are assigned starting at the shortest bit length, i.e. the
michael@0 990 shortest code should be all 0's rather than all 1's. Also, codes with
michael@0 991 a bit length of zero do not participate in the tree construction. The
michael@0 992 codes are then used to decode the bit lengths for the literal and
michael@0 993 distance tables.
michael@0 994
michael@0 995 The bit lengths for the literal tables are sent first with the number
michael@0 996 of entries sent described by the 5 bits sent earlier. There are up
michael@0 997 to 286 literal characters; the first 256 represent the respective 8
michael@0 998 bit character, code 256 represents the End-Of-Block code, the remaining
michael@0 999 29 codes represent copy lengths of 3 thru 258. There are up to 30
michael@0 1000 distance codes representing distances from 1 thru 32k as described
michael@0 1001 below.
michael@0 1002
michael@0 1003 Length Codes
michael@0 1004 ------------
michael@0 1005 Extra Extra Extra Extra
michael@0 1006 Code Bits Length Code Bits Lengths Code Bits Lengths Code Bits Length(s)
michael@0 1007 ---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- ---------
michael@0 1008 257 0 3 265 1 11,12 273 3 35-42 281 5 131-162
michael@0 1009 258 0 4 266 1 13,14 274 3 43-50 282 5 163-194
michael@0 1010 259 0 5 267 1 15,16 275 3 51-58 283 5 195-226
michael@0 1011 260 0 6 268 1 17,18 276 3 59-66 284 5 227-257
michael@0 1012 261 0 7 269 2 19-22 277 4 67-82 285 0 258
michael@0 1013 262 0 8 270 2 23-26 278 4 83-98
michael@0 1014 263 0 9 271 2 27-30 279 4 99-114
michael@0 1015 264 0 10 272 2 31-34 280 4 115-130
michael@0 1016
michael@0 1017 Distance Codes
michael@0 1018 --------------
michael@0 1019 Extra Extra Extra Extra
michael@0 1020 Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
michael@0 1021 ---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
michael@0 1022 0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
michael@0 1023 1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
michael@0 1024 2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
michael@0 1025 3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
michael@0 1026 4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
michael@0 1027 5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
michael@0 1028 6 2 9-12 14 6 129-192 22 10 2049-3072
michael@0 1029 7 2 13-16 15 6 193-256 23 10 3073-4096
michael@0 1030
michael@0 1031 The compressed data stream begins immediately after the
michael@0 1032 compressed header data. The compressed data stream can be
michael@0 1033 interpreted as follows:
michael@0 1034
michael@0 1035 do
michael@0 1036 read header from input stream.
michael@0 1037
michael@0 1038 if stored block
michael@0 1039 skip bits until byte aligned
michael@0 1040 read count and 1's compliment of count
michael@0 1041 copy count bytes data block
michael@0 1042 otherwise
michael@0 1043 loop until end of block code sent
michael@0 1044 decode literal character from input stream
michael@0 1045 if literal < 256
michael@0 1046 copy character to the output stream
michael@0 1047 otherwise
michael@0 1048 if literal = end of block
michael@0 1049 break from loop
michael@0 1050 otherwise
michael@0 1051 decode distance from input stream
michael@0 1052
michael@0 1053 move backwards distance bytes in the output stream, and
michael@0 1054 copy length characters from this position to the output
michael@0 1055 stream.
michael@0 1056 end loop
michael@0 1057 while not last block
michael@0 1058
michael@0 1059 if data descriptor exists
michael@0 1060 skip bits until byte aligned
michael@0 1061 read crc and sizes
michael@0 1062 endif
michael@0 1063
michael@0 1064 Decryption
michael@0 1065 ----------
michael@0 1066
michael@0 1067 The encryption used in PKZIP was generously supplied by Roger
michael@0 1068 Schlafly. PKWARE is grateful to Mr. Schlafly for his expert
michael@0 1069 help and advice in the field of data encryption.
michael@0 1070
michael@0 1071 PKZIP encrypts the compressed data stream. Encrypted files must
michael@0 1072 be decrypted before they can be extracted.
michael@0 1073
michael@0 1074 Each encrypted file has an extra 12 bytes stored at the start of
michael@0 1075 the data area defining the encryption header for that file. The
michael@0 1076 encryption header is originally set to random values, and then
michael@0 1077 itself encrypted, using three, 32-bit keys. The key values are
michael@0 1078 initialized using the supplied encryption password. After each byte
michael@0 1079 is encrypted, the keys are then updated using pseudo-random number
michael@0 1080 generation techniques in combination with the same CRC-32 algorithm
michael@0 1081 used in PKZIP and described elsewhere in this document.
michael@0 1082
michael@0 1083 The following is the basic steps required to decrypt a file:
michael@0 1084
michael@0 1085 1) Initialize the three 32-bit keys with the password.
michael@0 1086 2) Read and decrypt the 12-byte encryption header, further
michael@0 1087 initializing the encryption keys.
michael@0 1088 3) Read and decrypt the compressed data stream using the
michael@0 1089 encryption keys.
michael@0 1090
michael@0 1091 Step 1 - Initializing the encryption keys
michael@0 1092 -----------------------------------------
michael@0 1093
michael@0 1094 Key(0) <- 305419896
michael@0 1095 Key(1) <- 591751049
michael@0 1096 Key(2) <- 878082192
michael@0 1097
michael@0 1098 loop for i <- 0 to length(password)-1
michael@0 1099 update_keys(password(i))
michael@0 1100 end loop
michael@0 1101
michael@0 1102 Where update_keys() is defined as:
michael@0 1103
michael@0 1104 update_keys(char):
michael@0 1105 Key(0) <- crc32(key(0),char)
michael@0 1106 Key(1) <- Key(1) + (Key(0) & 000000ffH)
michael@0 1107 Key(1) <- Key(1) * 134775813 + 1
michael@0 1108 Key(2) <- crc32(key(2),key(1) >> 24)
michael@0 1109 end update_keys
michael@0 1110
michael@0 1111 Where crc32(old_crc,char) is a routine that given a CRC value and a
michael@0 1112 character, returns an updated CRC value after applying the CRC-32
michael@0 1113 algorithm described elsewhere in this document.
michael@0 1114
michael@0 1115 Step 2 - Decrypting the encryption header
michael@0 1116 -----------------------------------------
michael@0 1117
michael@0 1118 The purpose of this step is to further initialize the encryption
michael@0 1119 keys, based on random data, to render a plaintext attack on the
michael@0 1120 data ineffective.
michael@0 1121
michael@0 1122 Read the 12-byte encryption header into Buffer, in locations
michael@0 1123 Buffer(0) thru Buffer(11).
michael@0 1124
michael@0 1125 loop for i <- 0 to 11
michael@0 1126 C <- buffer(i) ^ decrypt_byte()
michael@0 1127 update_keys(C)
michael@0 1128 buffer(i) <- C
michael@0 1129 end loop
michael@0 1130
michael@0 1131 Where decrypt_byte() is defined as:
michael@0 1132
michael@0 1133 unsigned char decrypt_byte()
michael@0 1134 local unsigned short temp
michael@0 1135 temp <- Key(2) | 2
michael@0 1136 decrypt_byte <- (temp * (temp ^ 1)) >> 8
michael@0 1137 end decrypt_byte
michael@0 1138
michael@0 1139 After the header is decrypted, the last 1 or 2 bytes in Buffer
michael@0 1140 should be the high-order word/byte of the CRC for the file being
michael@0 1141 decrypted, stored in Intel low-byte/high-byte order. Versions of
michael@0 1142 PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is
michael@0 1143 used on versions after 2.0. This can be used to test if the password
michael@0 1144 supplied is correct or not.
michael@0 1145
michael@0 1146 Step 3 - Decrypting the compressed data stream
michael@0 1147 ----------------------------------------------
michael@0 1148
michael@0 1149 The compressed data stream can be decrypted as follows:
michael@0 1150
michael@0 1151 loop until done
michael@0 1152 read a character into C
michael@0 1153 Temp <- C ^ decrypt_byte()
michael@0 1154 update_keys(temp)
michael@0 1155 output Temp
michael@0 1156 end loop
michael@0 1157
michael@0 1158 In addition to the above mentioned contributors to PKZIP and PKUNZIP,
michael@0 1159 I would like to extend special thanks to Robert Mahoney for suggesting
michael@0 1160 the extension .ZIP for this software.
michael@0 1161
michael@0 1162 References:
michael@0 1163
michael@0 1164 Fiala, Edward R., and Greene, Daniel H., "Data compression with
michael@0 1165 finite windows", Communications of the ACM, Volume 32, Number 4,
michael@0 1166 April 1989, pages 490-505.
michael@0 1167
michael@0 1168 Held, Gilbert, "Data Compression, Techniques and Applications,
michael@0 1169 Hardware and Software Considerations", John Wiley & Sons, 1987.
michael@0 1170
michael@0 1171 Huffman, D.A., "A method for the construction of minimum-redundancy
michael@0 1172 codes", Proceedings of the IRE, Volume 40, Number 9, September 1952,
michael@0 1173 pages 1098-1101.
michael@0 1174
michael@0 1175 Nelson, Mark, "LZW Data Compression", Dr. Dobbs Journal, Volume 14,
michael@0 1176 Number 10, October 1989, pages 29-37.
michael@0 1177
michael@0 1178 Nelson, Mark, "The Data Compression Book", M&T Books, 1991.
michael@0 1179
michael@0 1180 Storer, James A., "Data Compression, Methods and Theory",
michael@0 1181 Computer Science Press, 1988
michael@0 1182
michael@0 1183 Welch, Terry, "A Technique for High-Performance Data Compression",
michael@0 1184 IEEE Computer, Volume 17, Number 6, June 1984, pages 8-19.
michael@0 1185
michael@0 1186 Ziv, J. and Lempel, A., "A universal algorithm for sequential data
michael@0 1187 compression", Communications of the ACM, Volume 30, Number 6,
michael@0 1188 June 1987, pages 520-540.
michael@0 1189
michael@0 1190 Ziv, J. and Lempel, A., "Compression of individual sequences via
michael@0 1191 variable-rate coding", IEEE Transactions on Information Theory,
michael@0 1192 Volume 24, Number 5, September 1978, pages 530-536.

mercurial