Thu, 22 Jan 2015 13:21:57 +0100
Incorporate requested changes from Mozilla in review:
https://bugzilla.mozilla.org/show_bug.cgi?id=1123480#c6
michael@0 | 1 | Revised: 03/01/1999 |
michael@0 | 2 | |
michael@0 | 3 | Disclaimer |
michael@0 | 4 | ---------- |
michael@0 | 5 | |
michael@0 | 6 | Although PKWARE will attempt to supply current and accurate |
michael@0 | 7 | information relating to its file formats, algorithms, and the |
michael@0 | 8 | subject programs, the possibility of error can not be eliminated. |
michael@0 | 9 | PKWARE therefore expressly disclaims any warranty that the |
michael@0 | 10 | information contained in the associated materials relating to the |
michael@0 | 11 | subject programs and/or the format of the files created or |
michael@0 | 12 | accessed by the subject programs and/or the algorithms used by |
michael@0 | 13 | the subject programs, or any other matter, is current, correct or |
michael@0 | 14 | accurate as delivered. Any risk of damage due to any possible |
michael@0 | 15 | inaccurate information is assumed by the user of the information. |
michael@0 | 16 | Furthermore, the information relating to the subject programs |
michael@0 | 17 | and/or the file formats created or accessed by the subject |
michael@0 | 18 | programs and/or the algorithms used by the subject programs is |
michael@0 | 19 | subject to change without notice. |
michael@0 | 20 | |
michael@0 | 21 | General Format of a ZIP file |
michael@0 | 22 | ---------------------------- |
michael@0 | 23 | |
michael@0 | 24 | Files stored in arbitrary order. Large zipfiles can span multiple |
michael@0 | 25 | diskette media. |
michael@0 | 26 | |
michael@0 | 27 | Overall zipfile format: |
michael@0 | 28 | |
michael@0 | 29 | [local file header + file data + data_descriptor] . . . |
michael@0 | 30 | [central directory] end of central directory record |
michael@0 | 31 | |
michael@0 | 32 | |
michael@0 | 33 | A. Local file header: |
michael@0 | 34 | |
michael@0 | 35 | local file header signature 4 bytes (0x04034b50) |
michael@0 | 36 | version needed to extract 2 bytes |
michael@0 | 37 | general purpose bit flag 2 bytes |
michael@0 | 38 | compression method 2 bytes |
michael@0 | 39 | last mod file time 2 bytes |
michael@0 | 40 | last mod file date 2 bytes |
michael@0 | 41 | crc-32 4 bytes |
michael@0 | 42 | compressed size 4 bytes |
michael@0 | 43 | uncompressed size 4 bytes |
michael@0 | 44 | filename length 2 bytes |
michael@0 | 45 | extra field length 2 bytes |
michael@0 | 46 | |
michael@0 | 47 | filename (variable size) |
michael@0 | 48 | extra field (variable size) |
michael@0 | 49 | |
michael@0 | 50 | B. Data descriptor: |
michael@0 | 51 | |
michael@0 | 52 | crc-32 4 bytes |
michael@0 | 53 | compressed size 4 bytes |
michael@0 | 54 | uncompressed size 4 bytes |
michael@0 | 55 | |
michael@0 | 56 | This descriptor exists only if bit 3 of the general |
michael@0 | 57 | purpose bit flag is set (see below). It is byte aligned |
michael@0 | 58 | and immediately follows the last byte of compressed data. |
michael@0 | 59 | This descriptor is used only when it was not possible to |
michael@0 | 60 | seek in the output zip file, e.g., when the output zip file |
michael@0 | 61 | was standard output or a non seekable device. |
michael@0 | 62 | |
michael@0 | 63 | C. Central directory structure: |
michael@0 | 64 | |
michael@0 | 65 | [file header] . . . end of central dir record |
michael@0 | 66 | |
michael@0 | 67 | File header: |
michael@0 | 68 | |
michael@0 | 69 | central file header signature 4 bytes (0x02014b50) |
michael@0 | 70 | version made by 2 bytes |
michael@0 | 71 | version needed to extract 2 bytes |
michael@0 | 72 | general purpose bit flag 2 bytes |
michael@0 | 73 | compression method 2 bytes |
michael@0 | 74 | last mod file time 2 bytes |
michael@0 | 75 | last mod file date 2 bytes |
michael@0 | 76 | crc-32 4 bytes |
michael@0 | 77 | compressed size 4 bytes |
michael@0 | 78 | uncompressed size 4 bytes |
michael@0 | 79 | filename length 2 bytes |
michael@0 | 80 | extra field length 2 bytes |
michael@0 | 81 | file comment length 2 bytes |
michael@0 | 82 | disk number start 2 bytes |
michael@0 | 83 | internal file attributes 2 bytes |
michael@0 | 84 | external file attributes 4 bytes |
michael@0 | 85 | relative offset of local header 4 bytes |
michael@0 | 86 | |
michael@0 | 87 | filename (variable size) |
michael@0 | 88 | extra field (variable size) |
michael@0 | 89 | file comment (variable size) |
michael@0 | 90 | |
michael@0 | 91 | End of central dir record: |
michael@0 | 92 | |
michael@0 | 93 | end of central dir signature 4 bytes (0x06054b50) |
michael@0 | 94 | number of this disk 2 bytes |
michael@0 | 95 | number of the disk with the |
michael@0 | 96 | start of the central directory 2 bytes |
michael@0 | 97 | total number of entries in |
michael@0 | 98 | the central dir on this disk 2 bytes |
michael@0 | 99 | total number of entries in |
michael@0 | 100 | the central dir 2 bytes |
michael@0 | 101 | size of the central directory 4 bytes |
michael@0 | 102 | offset of start of central |
michael@0 | 103 | directory with respect to |
michael@0 | 104 | the starting disk number 4 bytes |
michael@0 | 105 | zipfile comment length 2 bytes |
michael@0 | 106 | zipfile comment (variable size) |
michael@0 | 107 | |
michael@0 | 108 | D. Explanation of fields: |
michael@0 | 109 | |
michael@0 | 110 | version made by (2 bytes) |
michael@0 | 111 | |
michael@0 | 112 | The upper byte indicates the compatibility of the file |
michael@0 | 113 | attribute information. If the external file attributes |
michael@0 | 114 | are compatible with MS-DOS and can be read by PKZIP for |
michael@0 | 115 | DOS version 2.04g then this value will be zero. If these |
michael@0 | 116 | attributes are not compatible, then this value will |
michael@0 | 117 | identify the host system on which the attributes are |
michael@0 | 118 | compatible. Software can use this information to determine |
michael@0 | 119 | the line record format for text files etc. The current |
michael@0 | 120 | mappings are: |
michael@0 | 121 | |
michael@0 | 122 | 0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems) |
michael@0 | 123 | 1 - Amiga 2 - VAX/VMS |
michael@0 | 124 | 3 - Unix 4 - VM/CMS |
michael@0 | 125 | 5 - Atari ST 6 - OS/2 H.P.F.S. |
michael@0 | 126 | 7 - Macintosh 8 - Z-System |
michael@0 | 127 | 9 - CP/M 10 - Windows NTFS |
michael@0 | 128 | 11 thru 255 - unused |
michael@0 | 129 | |
michael@0 | 130 | The lower byte indicates the version number of the |
michael@0 | 131 | software used to encode the file. The value/10 |
michael@0 | 132 | indicates the major version number, and the value |
michael@0 | 133 | mod 10 is the minor version number. |
michael@0 | 134 | |
michael@0 | 135 | version needed to extract (2 bytes) |
michael@0 | 136 | |
michael@0 | 137 | The minimum software version needed to extract the |
michael@0 | 138 | file, mapped as above. |
michael@0 | 139 | |
michael@0 | 140 | general purpose bit flag: (2 bytes) |
michael@0 | 141 | |
michael@0 | 142 | Bit 0: If set, indicates that the file is encrypted. |
michael@0 | 143 | |
michael@0 | 144 | (For Method 6 - Imploding) |
michael@0 | 145 | Bit 1: If the compression method used was type 6, |
michael@0 | 146 | Imploding, then this bit, if set, indicates |
michael@0 | 147 | an 8K sliding dictionary was used. If clear, |
michael@0 | 148 | then a 4K sliding dictionary was used. |
michael@0 | 149 | Bit 2: If the compression method used was type 6, |
michael@0 | 150 | Imploding, then this bit, if set, indicates |
michael@0 | 151 | 3 Shannon-Fano trees were used to encode the |
michael@0 | 152 | sliding dictionary output. If clear, then 2 |
michael@0 | 153 | Shannon-Fano trees were used. |
michael@0 | 154 | |
michael@0 | 155 | (For Method 8 - Deflating) |
michael@0 | 156 | Bit 2 Bit 1 |
michael@0 | 157 | 0 0 Normal (-en) compression option was used. |
michael@0 | 158 | 0 1 Maximum (-ex) compression option was used. |
michael@0 | 159 | 1 0 Fast (-ef) compression option was used. |
michael@0 | 160 | 1 1 Super Fast (-es) compression option was used. |
michael@0 | 161 | |
michael@0 | 162 | Note: Bits 1 and 2 are undefined if the compression |
michael@0 | 163 | method is any other. |
michael@0 | 164 | |
michael@0 | 165 | Bit 3: If this bit is set, the fields crc-32, compressed |
michael@0 | 166 | size and uncompressed size are set to zero in the |
michael@0 | 167 | local header. The correct values are put in the |
michael@0 | 168 | data descriptor immediately following the compressed |
michael@0 | 169 | data. (Note: PKZIP version 2.04g for DOS only |
michael@0 | 170 | recognizes this bit for method 8 compression, newer |
michael@0 | 171 | versions of PKZIP recognize this bit for any |
michael@0 | 172 | compression method.) |
michael@0 | 173 | |
michael@0 | 174 | Bit 4: Reserved for use with method 8, for enhanced |
michael@0 | 175 | deflating. |
michael@0 | 176 | |
michael@0 | 177 | Bit 5: If this bit is set, this indicates that the file is |
michael@0 | 178 | compressed patched data. (Note: Requires PKZIP |
michael@0 | 179 | version 2.70 or greater) |
michael@0 | 180 | |
michael@0 | 181 | Bit 6: Currently unused. |
michael@0 | 182 | |
michael@0 | 183 | Bit 7: Currently unused. |
michael@0 | 184 | |
michael@0 | 185 | Bit 8: Currently unused. |
michael@0 | 186 | |
michael@0 | 187 | Bit 9: Currently unused. |
michael@0 | 188 | |
michael@0 | 189 | Bit 10: Currently unused. |
michael@0 | 190 | |
michael@0 | 191 | Bit 11: Currently unused. |
michael@0 | 192 | |
michael@0 | 193 | Bit 12: Reserved by PKWARE for enhanced compression. |
michael@0 | 194 | |
michael@0 | 195 | Bit 13: Reserved by PKWARE. |
michael@0 | 196 | |
michael@0 | 197 | Bit 14: Reserved by PKWARE. |
michael@0 | 198 | |
michael@0 | 199 | Bit 15: Reserved by PKWARE. |
michael@0 | 200 | |
michael@0 | 201 | compression method: (2 bytes) |
michael@0 | 202 | |
michael@0 | 203 | (see accompanying documentation for algorithm |
michael@0 | 204 | descriptions) |
michael@0 | 205 | |
michael@0 | 206 | 0 - The file is stored (no compression) |
michael@0 | 207 | 1 - The file is Shrunk |
michael@0 | 208 | 2 - The file is Reduced with compression factor 1 |
michael@0 | 209 | 3 - The file is Reduced with compression factor 2 |
michael@0 | 210 | 4 - The file is Reduced with compression factor 3 |
michael@0 | 211 | 5 - The file is Reduced with compression factor 4 |
michael@0 | 212 | 6 - The file is Imploded |
michael@0 | 213 | 7 - Reserved for Tokenizing compression algorithm |
michael@0 | 214 | 8 - The file is Deflated |
michael@0 | 215 | 9 - Reserved for enhanced Deflating |
michael@0 | 216 | 10 - PKWARE Date Compression Library Imploding |
michael@0 | 217 | |
michael@0 | 218 | date and time fields: (2 bytes each) |
michael@0 | 219 | |
michael@0 | 220 | The date and time are encoded in standard MS-DOS format. |
michael@0 | 221 | If input came from standard input, the date and time are |
michael@0 | 222 | those at which compression was started for this data. |
michael@0 | 223 | |
michael@0 | 224 | CRC-32: (4 bytes) |
michael@0 | 225 | |
michael@0 | 226 | The CRC-32 algorithm was generously contributed by |
michael@0 | 227 | David Schwaderer and can be found in his excellent |
michael@0 | 228 | book "C Programmers Guide to NetBIOS" published by |
michael@0 | 229 | Howard W. Sams & Co. Inc. The 'magic number' for |
michael@0 | 230 | the CRC is 0xdebb20e3. The proper CRC pre and post |
michael@0 | 231 | conditioning is used, meaning that the CRC register |
michael@0 | 232 | is pre-conditioned with all ones (a starting value |
michael@0 | 233 | of 0xffffffff) and the value is post-conditioned by |
michael@0 | 234 | taking the one's complement of the CRC residual. |
michael@0 | 235 | If bit 3 of the general purpose flag is set, this |
michael@0 | 236 | field is set to zero in the local header and the correct |
michael@0 | 237 | value is put in the data descriptor and in the central |
michael@0 | 238 | directory. |
michael@0 | 239 | |
michael@0 | 240 | compressed size: (4 bytes) |
michael@0 | 241 | uncompressed size: (4 bytes) |
michael@0 | 242 | |
michael@0 | 243 | The size of the file compressed and uncompressed, |
michael@0 | 244 | respectively. If bit 3 of the general purpose bit flag |
michael@0 | 245 | is set, these fields are set to zero in the local header |
michael@0 | 246 | and the correct values are put in the data descriptor and |
michael@0 | 247 | in the central directory. |
michael@0 | 248 | |
michael@0 | 249 | filename length: (2 bytes) |
michael@0 | 250 | extra field length: (2 bytes) |
michael@0 | 251 | file comment length: (2 bytes) |
michael@0 | 252 | |
michael@0 | 253 | The length of the filename, extra field, and comment |
michael@0 | 254 | fields respectively. The combined length of any |
michael@0 | 255 | directory record and these three fields should not |
michael@0 | 256 | generally exceed 65,535 bytes. If input came from standard |
michael@0 | 257 | input, the filename length is set to zero. |
michael@0 | 258 | |
michael@0 | 259 | disk number start: (2 bytes) |
michael@0 | 260 | |
michael@0 | 261 | The number of the disk on which this file begins. |
michael@0 | 262 | |
michael@0 | 263 | internal file attributes: (2 bytes) |
michael@0 | 264 | |
michael@0 | 265 | The lowest bit of this field indicates, if set, that |
michael@0 | 266 | the file is apparently an ASCII or text file. If not |
michael@0 | 267 | set, that the file apparently contains binary data. |
michael@0 | 268 | The remaining bits are unused in version 1.0. |
michael@0 | 269 | |
michael@0 | 270 | Bits 1 and 2 are reserved for use by PKWARE. |
michael@0 | 271 | |
michael@0 | 272 | external file attributes: (4 bytes) |
michael@0 | 273 | |
michael@0 | 274 | The mapping of the external attributes is |
michael@0 | 275 | host-system dependent (see 'version made by'). For |
michael@0 | 276 | MS-DOS, the low order byte is the MS-DOS directory |
michael@0 | 277 | attribute byte. If input came from standard input, this |
michael@0 | 278 | field is set to zero. |
michael@0 | 279 | |
michael@0 | 280 | relative offset of local header: (4 bytes) |
michael@0 | 281 | |
michael@0 | 282 | This is the offset from the start of the first disk on |
michael@0 | 283 | which this file appears, to where the local header should |
michael@0 | 284 | be found. |
michael@0 | 285 | |
michael@0 | 286 | filename: (Variable) |
michael@0 | 287 | |
michael@0 | 288 | The name of the file, with optional relative path. |
michael@0 | 289 | The path stored should not contain a drive or |
michael@0 | 290 | device letter, or a leading slash. All slashes |
michael@0 | 291 | should be forward slashes '/' as opposed to |
michael@0 | 292 | backwards slashes '\' for compatibility with Amiga |
michael@0 | 293 | and Unix file systems etc. If input came from standard |
michael@0 | 294 | input, there is no filename field. |
michael@0 | 295 | |
michael@0 | 296 | extra field: (Variable) |
michael@0 | 297 | |
michael@0 | 298 | This is for future expansion. If additional information |
michael@0 | 299 | needs to be stored in the future, it should be stored |
michael@0 | 300 | here. Earlier versions of the software can then safely |
michael@0 | 301 | skip this file, and find the next file or header. This |
michael@0 | 302 | field will be 0 length in version 1.0. |
michael@0 | 303 | |
michael@0 | 304 | In order to allow different programs and different types |
michael@0 | 305 | of information to be stored in the 'extra' field in .ZIP |
michael@0 | 306 | files, the following structure should be used for all |
michael@0 | 307 | programs storing data in this field: |
michael@0 | 308 | |
michael@0 | 309 | header1+data1 + header2+data2 . . . |
michael@0 | 310 | |
michael@0 | 311 | Each header should consist of: |
michael@0 | 312 | |
michael@0 | 313 | Header ID - 2 bytes |
michael@0 | 314 | Data Size - 2 bytes |
michael@0 | 315 | |
michael@0 | 316 | Note: all fields stored in Intel low-byte/high-byte order. |
michael@0 | 317 | |
michael@0 | 318 | The Header ID field indicates the type of data that is in |
michael@0 | 319 | the following data block. |
michael@0 | 320 | |
michael@0 | 321 | Header ID's of 0 thru 31 are reserved for use by PKWARE. |
michael@0 | 322 | The remaining ID's can be used by third party vendors for |
michael@0 | 323 | proprietary usage. |
michael@0 | 324 | |
michael@0 | 325 | The current Header ID mappings defined by PKWARE are: |
michael@0 | 326 | |
michael@0 | 327 | 0x0007 AV Info |
michael@0 | 328 | 0x0009 OS/2 |
michael@0 | 329 | 0x000a NTFS |
michael@0 | 330 | 0x000c VAX/VMS |
michael@0 | 331 | 0x000d Unix |
michael@0 | 332 | 0x000f Patch Descriptor |
michael@0 | 333 | |
michael@0 | 334 | Several third party mappings commonly used are: |
michael@0 | 335 | |
michael@0 | 336 | 0x4b46 FWKCS MD5 (see below) |
michael@0 | 337 | 0x07c8 Macintosh |
michael@0 | 338 | 0x4341 Acorn/SparkFS |
michael@0 | 339 | 0x4453 Windows NT security descriptor (binary ACL) |
michael@0 | 340 | 0x4704 VM/CMS |
michael@0 | 341 | 0x470f MVS |
michael@0 | 342 | 0x4c41 OS/2 access control list (text ACL) |
michael@0 | 343 | 0x4d49 Info-ZIP VMS (VAX or Alpha) |
michael@0 | 344 | 0x5455 extended timestamp |
michael@0 | 345 | 0x5855 Info-ZIP Unix (original, also OS/2, NT, etc) |
michael@0 | 346 | 0x6542 BeOS/BeBox |
michael@0 | 347 | 0x756e ASi Unix |
michael@0 | 348 | 0x7855 Info-ZIP Unix (new) |
michael@0 | 349 | 0xfd4a SMS/QDOS |
michael@0 | 350 | |
michael@0 | 351 | The Data Size field indicates the size of the following |
michael@0 | 352 | data block. Programs can use this value to skip to the |
michael@0 | 353 | next header block, passing over any data blocks that are |
michael@0 | 354 | not of interest. |
michael@0 | 355 | |
michael@0 | 356 | Note: As stated above, the size of the entire .ZIP file |
michael@0 | 357 | header, including the filename, comment, and extra |
michael@0 | 358 | field should not exceed 64K in size. |
michael@0 | 359 | |
michael@0 | 360 | In case two different programs should appropriate the same |
michael@0 | 361 | Header ID value, it is strongly recommended that each |
michael@0 | 362 | program place a unique signature of at least two bytes in |
michael@0 | 363 | size (and preferably 4 bytes or bigger) at the start of |
michael@0 | 364 | each data area. Every program should verify that its |
michael@0 | 365 | unique signature is present, in addition to the Header ID |
michael@0 | 366 | value being correct, before assuming that it is a block of |
michael@0 | 367 | known type. |
michael@0 | 368 | |
michael@0 | 369 | -OS/2 Extra Field: |
michael@0 | 370 | |
michael@0 | 371 | The following is the layout of the OS/2 attributes "extra" |
michael@0 | 372 | block. (Last Revision 09/05/95) |
michael@0 | 373 | |
michael@0 | 374 | Note: all fields stored in Intel low-byte/high-byte order. |
michael@0 | 375 | |
michael@0 | 376 | Value Size Description |
michael@0 | 377 | ----- ---- ----------- |
michael@0 | 378 | (OS/2) 0x0009 2 bytes Tag for this "extra" block type |
michael@0 | 379 | TSize 2 bytes Size for the following data block |
michael@0 | 380 | BSize 4 bytes Uncompressed Block Size |
michael@0 | 381 | CType 2 bytes Compression type |
michael@0 | 382 | EACRC 4 bytes CRC value for uncompress block |
michael@0 | 383 | (var) variable Compressed block |
michael@0 | 384 | |
michael@0 | 385 | The OS/2 extended attribute structure (FEA2LIST) is |
michael@0 | 386 | compressed and then stored in it's entirety within this |
michael@0 | 387 | structure. There will only ever be one "block" of data in |
michael@0 | 388 | VarFields[]. |
michael@0 | 389 | |
michael@0 | 390 | -UNIX Extra Field: |
michael@0 | 391 | |
michael@0 | 392 | The following is the layout of the Unix "extra" block. |
michael@0 | 393 | Note: all fields are stored in Intel low-byte/high-byte |
michael@0 | 394 | order. |
michael@0 | 395 | |
michael@0 | 396 | Value Size Description |
michael@0 | 397 | ----- ---- ----------- |
michael@0 | 398 | (UNIX) 0x000d 2 bytes Tag for this "extra" block type |
michael@0 | 399 | TSize 2 bytes Size for the following data block |
michael@0 | 400 | Atime 4 bytes File last access time |
michael@0 | 401 | Mtime 4 bytes File last modification time |
michael@0 | 402 | Uid 2 bytes File user ID |
michael@0 | 403 | Gid 2 bytes File group ID |
michael@0 | 404 | (var) variable Variable length data field |
michael@0 | 405 | |
michael@0 | 406 | The variable length data field will contain file type |
michael@0 | 407 | specific data. Currently the only values allowed are |
michael@0 | 408 | the original "linked to" file names for hard or symbolic |
michael@0 | 409 | links. |
michael@0 | 410 | |
michael@0 | 411 | -VAX/VMS Extra Field: |
michael@0 | 412 | |
michael@0 | 413 | The following is the layout of the VAX/VMS attributes |
michael@0 | 414 | "extra" block. |
michael@0 | 415 | |
michael@0 | 416 | Note: all fields stored in Intel low-byte/high-byte order. |
michael@0 | 417 | |
michael@0 | 418 | Value Size Description |
michael@0 | 419 | ----- ---- ----------- |
michael@0 | 420 | (VMS) 0x000c 2 bytes Tag for this "extra" block type |
michael@0 | 421 | TSize 2 bytes Size of the total "extra" block |
michael@0 | 422 | CRC 4 bytes 32-bit CRC for remainder of the block |
michael@0 | 423 | Tag1 2 bytes VMS attribute tag value #1 |
michael@0 | 424 | Size1 2 bytes Size of attribute #1, in bytes |
michael@0 | 425 | (var.) Size1 Attribute #1 data |
michael@0 | 426 | . |
michael@0 | 427 | . |
michael@0 | 428 | . |
michael@0 | 429 | TagN 2 bytes VMS attribute tage value #N |
michael@0 | 430 | SizeN 2 bytes Size of attribute #N, in bytes |
michael@0 | 431 | (var.) SizeN Attribute #N data |
michael@0 | 432 | |
michael@0 | 433 | Rules: |
michael@0 | 434 | |
michael@0 | 435 | 1. There will be one or more of attributes present, which |
michael@0 | 436 | will each be preceded by the above TagX & SizeX values. |
michael@0 | 437 | These values are identical to the ATR$C_XXXX and |
michael@0 | 438 | ATR$S_XXXX constants which are defined in ATR.H under |
michael@0 | 439 | VMS C. Neither of these values will ever be zero. |
michael@0 | 440 | |
michael@0 | 441 | 2. No word alignment or padding is performed. |
michael@0 | 442 | |
michael@0 | 443 | 3. A well-behaved PKZIP/VMS program should never produce |
michael@0 | 444 | more than one sub-block with the same TagX value. Also, |
michael@0 | 445 | there will never be more than one "extra" block of type |
michael@0 | 446 | 0x000c in a particular directory record. |
michael@0 | 447 | |
michael@0 | 448 | -NTFS Extra Field: |
michael@0 | 449 | |
michael@0 | 450 | The following is the layout of the NTFS attributes |
michael@0 | 451 | "extra" block. |
michael@0 | 452 | |
michael@0 | 453 | Note: all fields stored in Intel low-byte/high-byte order. |
michael@0 | 454 | |
michael@0 | 455 | Value Size Description |
michael@0 | 456 | ----- ---- ----------- |
michael@0 | 457 | (NTFS) 0x000a 2 bytes Tag for this "extra" block type |
michael@0 | 458 | TSize 2 bytes Size of the total "extra" block |
michael@0 | 459 | Reserved 4 bytes Reserved for future use |
michael@0 | 460 | Tag1 2 bytes NTFS attribute tag value #1 |
michael@0 | 461 | Size1 2 bytes Size of attribute #1, in bytes |
michael@0 | 462 | (var.) Size1 Attribute #1 data |
michael@0 | 463 | . |
michael@0 | 464 | . |
michael@0 | 465 | . |
michael@0 | 466 | TagN 2 bytes NTFS attribute tage value #N |
michael@0 | 467 | SizeN 2 bytes Size of attribute #N, in bytes |
michael@0 | 468 | (var.) SizeN Attribute #N data |
michael@0 | 469 | |
michael@0 | 470 | For NTFS, values for Tag1 through TagN are as follows: |
michael@0 | 471 | (currently only one set of attributes is defined for NTFS) |
michael@0 | 472 | |
michael@0 | 473 | Tag Size Description |
michael@0 | 474 | ----- ---- ----------- |
michael@0 | 475 | 0x0001 2 bytes Tag for attribute #1 |
michael@0 | 476 | Size1 2 bytes Size of attribute #1, in bytes |
michael@0 | 477 | Mtime 8 bytes File last modification time |
michael@0 | 478 | Atime 8 bytes File last access time |
michael@0 | 479 | Ctime 8 bytes File creation time |
michael@0 | 480 | |
michael@0 | 481 | -PATCH Descriptor Extra Field: |
michael@0 | 482 | |
michael@0 | 483 | The following is the layout of the Patch Descriptor "extra" |
michael@0 | 484 | block. |
michael@0 | 485 | |
michael@0 | 486 | Note: all fields stored in Intel low-byte/high-byte order. |
michael@0 | 487 | |
michael@0 | 488 | Value Size Description |
michael@0 | 489 | ----- ---- ----------- |
michael@0 | 490 | (Patch) 0x000f 2 bytes Tag for this "extra" block type |
michael@0 | 491 | TSize 2 bytes Size of the total "extra" block |
michael@0 | 492 | Version 2 bytes Version of the descriptor |
michael@0 | 493 | Flags 4 bytes Actions and reactions (see below) |
michael@0 | 494 | OldSize 4 bytes Size of the file about to be patched |
michael@0 | 495 | OldCRC 4 bytes 32-bit CRC of the file to be patched |
michael@0 | 496 | NewSize 4 bytes Size of the resulting file |
michael@0 | 497 | NewCRC 4 bytes 32-bit CRC of the resulting file |
michael@0 | 498 | |
michael@0 | 499 | Actions and reactions |
michael@0 | 500 | |
michael@0 | 501 | Bits Description |
michael@0 | 502 | ---- ---------------- |
michael@0 | 503 | 0 Use for autodetection |
michael@0 | 504 | 1 Treat as selfpatch |
michael@0 | 505 | 2-3 RESERVED |
michael@0 | 506 | 4-5 Action (see below) |
michael@0 | 507 | 6-7 RESERVED |
michael@0 | 508 | 8-9 Reaction (see below) to absent file |
michael@0 | 509 | 10-11 Reaction (see below) to newer file |
michael@0 | 510 | 12-13 Reaction (see below) to unknown file |
michael@0 | 511 | 14-15 RESERVED |
michael@0 | 512 | 16-31 RESERVED |
michael@0 | 513 | |
michael@0 | 514 | Actions |
michael@0 | 515 | |
michael@0 | 516 | Action Value |
michael@0 | 517 | ------ ----- |
michael@0 | 518 | none 0 |
michael@0 | 519 | add 1 |
michael@0 | 520 | delete 2 |
michael@0 | 521 | patch 3 |
michael@0 | 522 | |
michael@0 | 523 | Reactions |
michael@0 | 524 | |
michael@0 | 525 | Reaction Value |
michael@0 | 526 | -------- ----- |
michael@0 | 527 | ask 0 |
michael@0 | 528 | skip 1 |
michael@0 | 529 | ignore 2 |
michael@0 | 530 | fail 3 |
michael@0 | 531 | |
michael@0 | 532 | - FWKCS MD5 Extra Field: |
michael@0 | 533 | |
michael@0 | 534 | The FWKCS Contents_Signature System, used in |
michael@0 | 535 | automatically identifying files independent of filename, |
michael@0 | 536 | optionally adds and uses an extra field to support the |
michael@0 | 537 | rapid creation of an enhanced contents_signature: |
michael@0 | 538 | |
michael@0 | 539 | Header ID = 0x4b46 |
michael@0 | 540 | Data Size = 0x0013 |
michael@0 | 541 | Preface = 'M','D','5' |
michael@0 | 542 | followed by 16 bytes containing the uncompressed file's |
michael@0 | 543 | 128_bit MD5 hash(1), low byte first. |
michael@0 | 544 | |
michael@0 | 545 | When FWKCS revises a zipfile central directory to add |
michael@0 | 546 | this extra field for a file, it also replaces the |
michael@0 | 547 | central directory entry for that file's uncompressed |
michael@0 | 548 | filelength with a measured value. |
michael@0 | 549 | |
michael@0 | 550 | FWKCS provides an option to strip this extra field, if |
michael@0 | 551 | present, from a zipfile central directory. In adding |
michael@0 | 552 | this extra field, FWKCS preserves Zipfile Authenticity |
michael@0 | 553 | Verification; if stripping this extra field, FWKCS |
michael@0 | 554 | preserves all versions of AV through PKZIP version 2.04g. |
michael@0 | 555 | |
michael@0 | 556 | FWKCS, and FWKCS Contents_Signature System, are |
michael@0 | 557 | trademarks of Frederick W. Kantor. |
michael@0 | 558 | |
michael@0 | 559 | (1) R. Rivest, RFC1321.TXT, MIT Laboratory for Computer |
michael@0 | 560 | Science and RSA Data Security, Inc., April 1992. |
michael@0 | 561 | ll.76-77: "The MD5 algorithm is being placed in the |
michael@0 | 562 | public domain for review and possible adoption as a |
michael@0 | 563 | standard." |
michael@0 | 564 | |
michael@0 | 565 | file comment: (Variable) |
michael@0 | 566 | |
michael@0 | 567 | The comment for this file. |
michael@0 | 568 | |
michael@0 | 569 | number of this disk: (2 bytes) |
michael@0 | 570 | |
michael@0 | 571 | The number of this disk, which contains central |
michael@0 | 572 | directory end record. |
michael@0 | 573 | |
michael@0 | 574 | number of the disk with the start of the central |
michael@0 | 575 | directory: (2 bytes) |
michael@0 | 576 | |
michael@0 | 577 | The number of the disk on which the central |
michael@0 | 578 | directory starts. |
michael@0 | 579 | |
michael@0 | 580 | total number of entries in the central dir on |
michael@0 | 581 | this disk: (2 bytes) |
michael@0 | 582 | |
michael@0 | 583 | The number of central directory entries on this disk. |
michael@0 | 584 | |
michael@0 | 585 | total number of entries in the central dir: (2 bytes) |
michael@0 | 586 | |
michael@0 | 587 | The total number of files in the zipfile. |
michael@0 | 588 | |
michael@0 | 589 | size of the central directory: (4 bytes) |
michael@0 | 590 | |
michael@0 | 591 | The size (in bytes) of the entire central directory. |
michael@0 | 592 | |
michael@0 | 593 | offset of start of central directory with respect to |
michael@0 | 594 | the starting disk number: (4 bytes) |
michael@0 | 595 | |
michael@0 | 596 | Offset of the start of the central directory on the |
michael@0 | 597 | disk on which the central directory starts. |
michael@0 | 598 | |
michael@0 | 599 | zipfile comment length: (2 bytes) |
michael@0 | 600 | |
michael@0 | 601 | The length of the comment for this zipfile. |
michael@0 | 602 | |
michael@0 | 603 | zipfile comment: (Variable) |
michael@0 | 604 | |
michael@0 | 605 | The comment for this zipfile. |
michael@0 | 606 | |
michael@0 | 607 | D. General notes: |
michael@0 | 608 | |
michael@0 | 609 | 1) All fields unless otherwise noted are unsigned and stored |
michael@0 | 610 | in Intel low-byte:high-byte, low-word:high-word order. |
michael@0 | 611 | |
michael@0 | 612 | 2) String fields are not null terminated, since the |
michael@0 | 613 | length is given explicitly. |
michael@0 | 614 | |
michael@0 | 615 | 3) Local headers should not span disk boundaries. Also, even |
michael@0 | 616 | though the central directory can span disk boundaries, no |
michael@0 | 617 | single record in the central directory should be split |
michael@0 | 618 | across disks. |
michael@0 | 619 | |
michael@0 | 620 | 4) The entries in the central directory may not necessarily |
michael@0 | 621 | be in the same order that files appear in the zipfile. |
michael@0 | 622 | |
michael@0 | 623 | UnShrinking - Method 1 |
michael@0 | 624 | ---------------------- |
michael@0 | 625 | |
michael@0 | 626 | Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm |
michael@0 | 627 | with partial clearing. The initial code size is 9 bits, and |
michael@0 | 628 | the maximum code size is 13 bits. Shrinking differs from |
michael@0 | 629 | conventional Dynamic Ziv-Lempel-Welch implementations in several |
michael@0 | 630 | respects: |
michael@0 | 631 | |
michael@0 | 632 | 1) The code size is controlled by the compressor, and is not |
michael@0 | 633 | automatically increased when codes larger than the current |
michael@0 | 634 | code size are created (but not necessarily used). When |
michael@0 | 635 | the decompressor encounters the code sequence 256 |
michael@0 | 636 | (decimal) followed by 1, it should increase the code size |
michael@0 | 637 | read from the input stream to the next bit size. No |
michael@0 | 638 | blocking of the codes is performed, so the next code at |
michael@0 | 639 | the increased size should be read from the input stream |
michael@0 | 640 | immediately after where the previous code at the smaller |
michael@0 | 641 | bit size was read. Again, the decompressor should not |
michael@0 | 642 | increase the code size used until the sequence 256,1 is |
michael@0 | 643 | encountered. |
michael@0 | 644 | |
michael@0 | 645 | 2) When the table becomes full, total clearing is not |
michael@0 | 646 | performed. Rather, when the compressor emits the code |
michael@0 | 647 | sequence 256,2 (decimal), the decompressor should clear |
michael@0 | 648 | all leaf nodes from the Ziv-Lempel tree, and continue to |
michael@0 | 649 | use the current code size. The nodes that are cleared |
michael@0 | 650 | from the Ziv-Lempel tree are then re-used, with the lowest |
michael@0 | 651 | code value re-used first, and the highest code value |
michael@0 | 652 | re-used last. The compressor can emit the sequence 256,2 |
michael@0 | 653 | at any time. |
michael@0 | 654 | |
michael@0 | 655 | Expanding - Methods 2-5 |
michael@0 | 656 | ----------------------- |
michael@0 | 657 | |
michael@0 | 658 | The Reducing algorithm is actually a combination of two |
michael@0 | 659 | distinct algorithms. The first algorithm compresses repeated |
michael@0 | 660 | byte sequences, and the second algorithm takes the compressed |
michael@0 | 661 | stream from the first algorithm and applies a probabilistic |
michael@0 | 662 | compression method. |
michael@0 | 663 | |
michael@0 | 664 | The probabilistic compression stores an array of 'follower |
michael@0 | 665 | sets' S(j), for j=0 to 255, corresponding to each possible |
michael@0 | 666 | ASCII character. Each set contains between 0 and 32 |
michael@0 | 667 | characters, to be denoted as S(j)[0],...,S(j)[m], where m<32. |
michael@0 | 668 | The sets are stored at the beginning of the data area for a |
michael@0 | 669 | Reduced file, in reverse order, with S(255) first, and S(0) |
michael@0 | 670 | last. |
michael@0 | 671 | |
michael@0 | 672 | The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] }, |
michael@0 | 673 | where N(j) is the size of set S(j). N(j) can be 0, in which |
michael@0 | 674 | case the follower set for S(j) is empty. Each N(j) value is |
michael@0 | 675 | encoded in 6 bits, followed by N(j) eight bit character values |
michael@0 | 676 | corresponding to S(j)[0] to S(j)[N(j)-1] respectively. If |
michael@0 | 677 | N(j) is 0, then no values for S(j) are stored, and the value |
michael@0 | 678 | for N(j-1) immediately follows. |
michael@0 | 679 | |
michael@0 | 680 | Immediately after the follower sets, is the compressed data |
michael@0 | 681 | stream. The compressed data stream can be interpreted for the |
michael@0 | 682 | probabilistic decompression as follows: |
michael@0 | 683 | |
michael@0 | 684 | let Last-Character <- 0. |
michael@0 | 685 | loop until done |
michael@0 | 686 | if the follower set S(Last-Character) is empty then |
michael@0 | 687 | read 8 bits from the input stream, and copy this |
michael@0 | 688 | value to the output stream. |
michael@0 | 689 | otherwise if the follower set S(Last-Character) is non-empty then |
michael@0 | 690 | read 1 bit from the input stream. |
michael@0 | 691 | if this bit is not zero then |
michael@0 | 692 | read 8 bits from the input stream, and copy this |
michael@0 | 693 | value to the output stream. |
michael@0 | 694 | otherwise if this bit is zero then |
michael@0 | 695 | read B(N(Last-Character)) bits from the input |
michael@0 | 696 | stream, and assign this value to I. |
michael@0 | 697 | Copy the value of S(Last-Character)[I] to the |
michael@0 | 698 | output stream. |
michael@0 | 699 | |
michael@0 | 700 | assign the last value placed on the output stream to |
michael@0 | 701 | Last-Character. |
michael@0 | 702 | end loop |
michael@0 | 703 | |
michael@0 | 704 | B(N(j)) is defined as the minimal number of bits required to |
michael@0 | 705 | encode the value N(j)-1. |
michael@0 | 706 | |
michael@0 | 707 | The decompressed stream from above can then be expanded to |
michael@0 | 708 | re-create the original file as follows: |
michael@0 | 709 | |
michael@0 | 710 | let State <- 0. |
michael@0 | 711 | |
michael@0 | 712 | loop until done |
michael@0 | 713 | read 8 bits from the input stream into C. |
michael@0 | 714 | case State of |
michael@0 | 715 | 0: if C is not equal to DLE (144 decimal) then |
michael@0 | 716 | copy C to the output stream. |
michael@0 | 717 | otherwise if C is equal to DLE then |
michael@0 | 718 | let State <- 1. |
michael@0 | 719 | |
michael@0 | 720 | 1: if C is non-zero then |
michael@0 | 721 | let V <- C. |
michael@0 | 722 | let Len <- L(V) |
michael@0 | 723 | let State <- F(Len). |
michael@0 | 724 | otherwise if C is zero then |
michael@0 | 725 | copy the value 144 (decimal) to the output stream. |
michael@0 | 726 | let State <- 0 |
michael@0 | 727 | |
michael@0 | 728 | 2: let Len <- Len + C |
michael@0 | 729 | let State <- 3. |
michael@0 | 730 | |
michael@0 | 731 | 3: move backwards D(V,C) bytes in the output stream |
michael@0 | 732 | (if this position is before the start of the output |
michael@0 | 733 | stream, then assume that all the data before the |
michael@0 | 734 | start of the output stream is filled with zeros). |
michael@0 | 735 | copy Len+3 bytes from this position to the output stream. |
michael@0 | 736 | let State <- 0. |
michael@0 | 737 | end case |
michael@0 | 738 | end loop |
michael@0 | 739 | |
michael@0 | 740 | The functions F,L, and D are dependent on the 'compression |
michael@0 | 741 | factor', 1 through 4, and are defined as follows: |
michael@0 | 742 | |
michael@0 | 743 | For compression factor 1: |
michael@0 | 744 | L(X) equals the lower 7 bits of X. |
michael@0 | 745 | F(X) equals 2 if X equals 127 otherwise F(X) equals 3. |
michael@0 | 746 | D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1. |
michael@0 | 747 | For compression factor 2: |
michael@0 | 748 | L(X) equals the lower 6 bits of X. |
michael@0 | 749 | F(X) equals 2 if X equals 63 otherwise F(X) equals 3. |
michael@0 | 750 | D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1. |
michael@0 | 751 | For compression factor 3: |
michael@0 | 752 | L(X) equals the lower 5 bits of X. |
michael@0 | 753 | F(X) equals 2 if X equals 31 otherwise F(X) equals 3. |
michael@0 | 754 | D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1. |
michael@0 | 755 | For compression factor 4: |
michael@0 | 756 | L(X) equals the lower 4 bits of X. |
michael@0 | 757 | F(X) equals 2 if X equals 15 otherwise F(X) equals 3. |
michael@0 | 758 | D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1. |
michael@0 | 759 | |
michael@0 | 760 | Imploding - Method 6 |
michael@0 | 761 | -------------------- |
michael@0 | 762 | |
michael@0 | 763 | The Imploding algorithm is actually a combination of two distinct |
michael@0 | 764 | algorithms. The first algorithm compresses repeated byte |
michael@0 | 765 | sequences using a sliding dictionary. The second algorithm is |
michael@0 | 766 | used to compress the encoding of the sliding dictionary output, |
michael@0 | 767 | using multiple Shannon-Fano trees. |
michael@0 | 768 | |
michael@0 | 769 | The Imploding algorithm can use a 4K or 8K sliding dictionary |
michael@0 | 770 | size. The dictionary size used can be determined by bit 1 in the |
michael@0 | 771 | general purpose flag word; a 0 bit indicates a 4K dictionary |
michael@0 | 772 | while a 1 bit indicates an 8K dictionary. |
michael@0 | 773 | |
michael@0 | 774 | The Shannon-Fano trees are stored at the start of the compressed |
michael@0 | 775 | file. The number of trees stored is defined by bit 2 in the |
michael@0 | 776 | general purpose flag word; a 0 bit indicates two trees stored, a |
michael@0 | 777 | 1 bit indicates three trees are stored. If 3 trees are stored, |
michael@0 | 778 | the first Shannon-Fano tree represents the encoding of the |
michael@0 | 779 | Literal characters, the second tree represents the encoding of |
michael@0 | 780 | the Length information, the third represents the encoding of the |
michael@0 | 781 | Distance information. When 2 Shannon-Fano trees are stored, the |
michael@0 | 782 | Length tree is stored first, followed by the Distance tree. |
michael@0 | 783 | |
michael@0 | 784 | The Literal Shannon-Fano tree, if present is used to represent |
michael@0 | 785 | the entire ASCII character set, and contains 256 values. This |
michael@0 | 786 | tree is used to compress any data not compressed by the sliding |
michael@0 | 787 | dictionary algorithm. When this tree is present, the Minimum |
michael@0 | 788 | Match Length for the sliding dictionary is 3. If this tree is |
michael@0 | 789 | not present, the Minimum Match Length is 2. |
michael@0 | 790 | |
michael@0 | 791 | The Length Shannon-Fano tree is used to compress the Length part |
michael@0 | 792 | of the (length,distance) pairs from the sliding dictionary |
michael@0 | 793 | output. The Length tree contains 64 values, ranging from the |
michael@0 | 794 | Minimum Match Length, to 63 plus the Minimum Match Length. |
michael@0 | 795 | |
michael@0 | 796 | The Distance Shannon-Fano tree is used to compress the Distance |
michael@0 | 797 | part of the (length,distance) pairs from the sliding dictionary |
michael@0 | 798 | output. The Distance tree contains 64 values, ranging from 0 to |
michael@0 | 799 | 63, representing the upper 6 bits of the distance value. The |
michael@0 | 800 | distance values themselves will be between 0 and the sliding |
michael@0 | 801 | dictionary size, either 4K or 8K. |
michael@0 | 802 | |
michael@0 | 803 | The Shannon-Fano trees themselves are stored in a compressed |
michael@0 | 804 | format. The first byte of the tree data represents the number of |
michael@0 | 805 | bytes of data representing the (compressed) Shannon-Fano tree |
michael@0 | 806 | minus 1. The remaining bytes represent the Shannon-Fano tree |
michael@0 | 807 | data encoded as: |
michael@0 | 808 | |
michael@0 | 809 | High 4 bits: Number of values at this bit length + 1. (1 - 16) |
michael@0 | 810 | Low 4 bits: Bit Length needed to represent value + 1. (1 - 16) |
michael@0 | 811 | |
michael@0 | 812 | The Shannon-Fano codes can be constructed from the bit lengths |
michael@0 | 813 | using the following algorithm: |
michael@0 | 814 | |
michael@0 | 815 | 1) Sort the Bit Lengths in ascending order, while retaining the |
michael@0 | 816 | order of the original lengths stored in the file. |
michael@0 | 817 | |
michael@0 | 818 | 2) Generate the Shannon-Fano trees: |
michael@0 | 819 | |
michael@0 | 820 | Code <- 0 |
michael@0 | 821 | CodeIncrement <- 0 |
michael@0 | 822 | LastBitLength <- 0 |
michael@0 | 823 | i <- number of Shannon-Fano codes - 1 (either 255 or 63) |
michael@0 | 824 | |
michael@0 | 825 | loop while i >= 0 |
michael@0 | 826 | Code = Code + CodeIncrement |
michael@0 | 827 | if BitLength(i) <> LastBitLength then |
michael@0 | 828 | LastBitLength=BitLength(i) |
michael@0 | 829 | CodeIncrement = 1 shifted left (16 - LastBitLength) |
michael@0 | 830 | ShannonCode(i) = Code |
michael@0 | 831 | i <- i - 1 |
michael@0 | 832 | end loop |
michael@0 | 833 | |
michael@0 | 834 | 3) Reverse the order of all the bits in the above ShannonCode() |
michael@0 | 835 | vector, so that the most significant bit becomes the least |
michael@0 | 836 | significant bit. For example, the value 0x1234 (hex) would |
michael@0 | 837 | become 0x2C48 (hex). |
michael@0 | 838 | |
michael@0 | 839 | 4) Restore the order of Shannon-Fano codes as originally stored |
michael@0 | 840 | within the file. |
michael@0 | 841 | |
michael@0 | 842 | Example: |
michael@0 | 843 | |
michael@0 | 844 | This example will show the encoding of a Shannon-Fano tree |
michael@0 | 845 | of size 8. Notice that the actual Shannon-Fano trees used |
michael@0 | 846 | for Imploding are either 64 or 256 entries in size. |
michael@0 | 847 | |
michael@0 | 848 | Example: 0x02, 0x42, 0x01, 0x13 |
michael@0 | 849 | |
michael@0 | 850 | The first byte indicates 3 values in this table. Decoding the |
michael@0 | 851 | bytes: |
michael@0 | 852 | 0x42 = 5 codes of 3 bits long |
michael@0 | 853 | 0x01 = 1 code of 2 bits long |
michael@0 | 854 | 0x13 = 2 codes of 4 bits long |
michael@0 | 855 | |
michael@0 | 856 | This would generate the original bit length array of: |
michael@0 | 857 | (3, 3, 3, 3, 3, 2, 4, 4) |
michael@0 | 858 | |
michael@0 | 859 | There are 8 codes in this table for the values 0 thru 7. Using |
michael@0 | 860 | the algorithm to obtain the Shannon-Fano codes produces: |
michael@0 | 861 | |
michael@0 | 862 | Reversed Order Original |
michael@0 | 863 | Val Sorted Constructed Code Value Restored Length |
michael@0 | 864 | --- ------ ----------------- -------- -------- ------ |
michael@0 | 865 | 0: 2 1100000000000000 11 101 3 |
michael@0 | 866 | 1: 3 1010000000000000 101 001 3 |
michael@0 | 867 | 2: 3 1000000000000000 001 110 3 |
michael@0 | 868 | 3: 3 0110000000000000 110 010 3 |
michael@0 | 869 | 4: 3 0100000000000000 010 100 3 |
michael@0 | 870 | 5: 3 0010000000000000 100 11 2 |
michael@0 | 871 | 6: 4 0001000000000000 1000 1000 4 |
michael@0 | 872 | 7: 4 0000000000000000 0000 0000 4 |
michael@0 | 873 | |
michael@0 | 874 | The values in the Val, Order Restored and Original Length columns |
michael@0 | 875 | now represent the Shannon-Fano encoding tree that can be used for |
michael@0 | 876 | decoding the Shannon-Fano encoded data. How to parse the |
michael@0 | 877 | variable length Shannon-Fano values from the data stream is beyond |
michael@0 | 878 | the scope of this document. (See the references listed at the end of |
michael@0 | 879 | this document for more information.) However, traditional decoding |
michael@0 | 880 | schemes used for Huffman variable length decoding, such as the |
michael@0 | 881 | Greenlaw algorithm, can be successfully applied. |
michael@0 | 882 | |
michael@0 | 883 | The compressed data stream begins immediately after the |
michael@0 | 884 | compressed Shannon-Fano data. The compressed data stream can be |
michael@0 | 885 | interpreted as follows: |
michael@0 | 886 | |
michael@0 | 887 | loop until done |
michael@0 | 888 | read 1 bit from input stream. |
michael@0 | 889 | |
michael@0 | 890 | if this bit is non-zero then (encoded data is literal data) |
michael@0 | 891 | if Literal Shannon-Fano tree is present |
michael@0 | 892 | read and decode character using Literal Shannon-Fano tree. |
michael@0 | 893 | otherwise |
michael@0 | 894 | read 8 bits from input stream. |
michael@0 | 895 | copy character to the output stream. |
michael@0 | 896 | otherwise (encoded data is sliding dictionary match) |
michael@0 | 897 | if 8K dictionary size |
michael@0 | 898 | read 7 bits for offset Distance (lower 7 bits of offset). |
michael@0 | 899 | otherwise |
michael@0 | 900 | read 6 bits for offset Distance (lower 6 bits of offset). |
michael@0 | 901 | |
michael@0 | 902 | using the Distance Shannon-Fano tree, read and decode the |
michael@0 | 903 | upper 6 bits of the Distance value. |
michael@0 | 904 | |
michael@0 | 905 | using the Length Shannon-Fano tree, read and decode |
michael@0 | 906 | the Length value. |
michael@0 | 907 | |
michael@0 | 908 | Length <- Length + Minimum Match Length |
michael@0 | 909 | |
michael@0 | 910 | if Length = 63 + Minimum Match Length |
michael@0 | 911 | read 8 bits from the input stream, |
michael@0 | 912 | add this value to Length. |
michael@0 | 913 | |
michael@0 | 914 | move backwards Distance+1 bytes in the output stream, and |
michael@0 | 915 | copy Length characters from this position to the output |
michael@0 | 916 | stream. (if this position is before the start of the output |
michael@0 | 917 | stream, then assume that all the data before the start of |
michael@0 | 918 | the output stream is filled with zeros). |
michael@0 | 919 | end loop |
michael@0 | 920 | |
michael@0 | 921 | Tokenizing - Method 7 |
michael@0 | 922 | -------------------- |
michael@0 | 923 | |
michael@0 | 924 | This method is not used by PKZIP. |
michael@0 | 925 | |
michael@0 | 926 | Deflating - Method 8 |
michael@0 | 927 | ----------------- |
michael@0 | 928 | |
michael@0 | 929 | The Deflate algorithm is similar to the Implode algorithm using |
michael@0 | 930 | a sliding dictionary of up to 32K with secondary compression |
michael@0 | 931 | from Huffman/Shannon-Fano codes. |
michael@0 | 932 | |
michael@0 | 933 | The compressed data is stored in blocks with a header describing |
michael@0 | 934 | the block and the Huffman codes used in the data block. The header |
michael@0 | 935 | format is as follows: |
michael@0 | 936 | |
michael@0 | 937 | Bit 0: Last Block bit This bit is set to 1 if this is the last |
michael@0 | 938 | compressed block in the data. |
michael@0 | 939 | Bits 1-2: Block type |
michael@0 | 940 | 00 (0) - Block is stored - All stored data is byte aligned. |
michael@0 | 941 | Skip bits until next byte, then next word = block |
michael@0 | 942 | length, followed by the ones compliment of the block |
michael@0 | 943 | length word. Remaining data in block is the stored |
michael@0 | 944 | data. |
michael@0 | 945 | |
michael@0 | 946 | 01 (1) - Use fixed Huffman codes for literal and distance codes. |
michael@0 | 947 | Lit Code Bits Dist Code Bits |
michael@0 | 948 | --------- ---- --------- ---- |
michael@0 | 949 | 0 - 143 8 0 - 31 5 |
michael@0 | 950 | 144 - 255 9 |
michael@0 | 951 | 256 - 279 7 |
michael@0 | 952 | 280 - 287 8 |
michael@0 | 953 | |
michael@0 | 954 | Literal codes 286-287 and distance codes 30-31 are |
michael@0 | 955 | never used but participate in the huffman construction. |
michael@0 | 956 | |
michael@0 | 957 | 10 (2) - Dynamic Huffman codes. (See expanding Huffman codes) |
michael@0 | 958 | |
michael@0 | 959 | 11 (3) - Reserved - Flag a "Error in compressed data" if seen. |
michael@0 | 960 | |
michael@0 | 961 | Expanding Huffman Codes |
michael@0 | 962 | ----------------------- |
michael@0 | 963 | If the data block is stored with dynamic Huffman codes, the Huffman |
michael@0 | 964 | codes are sent in the following compressed format: |
michael@0 | 965 | |
michael@0 | 966 | 5 Bits: # of Literal codes sent - 256 (256 - 286) |
michael@0 | 967 | All other codes are never sent. |
michael@0 | 968 | 5 Bits: # of Dist codes - 1 (1 - 32) |
michael@0 | 969 | 4 Bits: # of Bit Length codes - 3 (3 - 19) |
michael@0 | 970 | |
michael@0 | 971 | The Huffman codes are sent as bit lengths and the codes are built as |
michael@0 | 972 | described in the implode algorithm. The bit lengths themselves are |
michael@0 | 973 | compressed with Huffman codes. There are 19 bit length codes: |
michael@0 | 974 | |
michael@0 | 975 | 0 - 15: Represent bit lengths of 0 - 15 |
michael@0 | 976 | 16: Copy the previous bit length 3 - 6 times. |
michael@0 | 977 | The next 2 bits indicate repeat length (0 = 3, ... ,3 = 6) |
michael@0 | 978 | Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will |
michael@0 | 979 | expand to 12 bit lengths of 8 (1 + 6 + 5) |
michael@0 | 980 | 17: Repeat a bit length of 0 for 3 - 10 times. (3 bits of length) |
michael@0 | 981 | 18: Repeat a bit length of 0 for 11 - 138 times (7 bits of length) |
michael@0 | 982 | |
michael@0 | 983 | The lengths of the bit length codes are sent packed 3 bits per value |
michael@0 | 984 | (0 - 7) in the following order: |
michael@0 | 985 | |
michael@0 | 986 | 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 |
michael@0 | 987 | |
michael@0 | 988 | The Huffman codes should be built as described in the Implode algorithm |
michael@0 | 989 | except codes are assigned starting at the shortest bit length, i.e. the |
michael@0 | 990 | shortest code should be all 0's rather than all 1's. Also, codes with |
michael@0 | 991 | a bit length of zero do not participate in the tree construction. The |
michael@0 | 992 | codes are then used to decode the bit lengths for the literal and |
michael@0 | 993 | distance tables. |
michael@0 | 994 | |
michael@0 | 995 | The bit lengths for the literal tables are sent first with the number |
michael@0 | 996 | of entries sent described by the 5 bits sent earlier. There are up |
michael@0 | 997 | to 286 literal characters; the first 256 represent the respective 8 |
michael@0 | 998 | bit character, code 256 represents the End-Of-Block code, the remaining |
michael@0 | 999 | 29 codes represent copy lengths of 3 thru 258. There are up to 30 |
michael@0 | 1000 | distance codes representing distances from 1 thru 32k as described |
michael@0 | 1001 | below. |
michael@0 | 1002 | |
michael@0 | 1003 | Length Codes |
michael@0 | 1004 | ------------ |
michael@0 | 1005 | Extra Extra Extra Extra |
michael@0 | 1006 | Code Bits Length Code Bits Lengths Code Bits Lengths Code Bits Length(s) |
michael@0 | 1007 | ---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- --------- |
michael@0 | 1008 | 257 0 3 265 1 11,12 273 3 35-42 281 5 131-162 |
michael@0 | 1009 | 258 0 4 266 1 13,14 274 3 43-50 282 5 163-194 |
michael@0 | 1010 | 259 0 5 267 1 15,16 275 3 51-58 283 5 195-226 |
michael@0 | 1011 | 260 0 6 268 1 17,18 276 3 59-66 284 5 227-257 |
michael@0 | 1012 | 261 0 7 269 2 19-22 277 4 67-82 285 0 258 |
michael@0 | 1013 | 262 0 8 270 2 23-26 278 4 83-98 |
michael@0 | 1014 | 263 0 9 271 2 27-30 279 4 99-114 |
michael@0 | 1015 | 264 0 10 272 2 31-34 280 4 115-130 |
michael@0 | 1016 | |
michael@0 | 1017 | Distance Codes |
michael@0 | 1018 | -------------- |
michael@0 | 1019 | Extra Extra Extra Extra |
michael@0 | 1020 | Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance |
michael@0 | 1021 | ---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- -------- |
michael@0 | 1022 | 0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144 |
michael@0 | 1023 | 1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192 |
michael@0 | 1024 | 2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288 |
michael@0 | 1025 | 3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384 |
michael@0 | 1026 | 4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576 |
michael@0 | 1027 | 5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768 |
michael@0 | 1028 | 6 2 9-12 14 6 129-192 22 10 2049-3072 |
michael@0 | 1029 | 7 2 13-16 15 6 193-256 23 10 3073-4096 |
michael@0 | 1030 | |
michael@0 | 1031 | The compressed data stream begins immediately after the |
michael@0 | 1032 | compressed header data. The compressed data stream can be |
michael@0 | 1033 | interpreted as follows: |
michael@0 | 1034 | |
michael@0 | 1035 | do |
michael@0 | 1036 | read header from input stream. |
michael@0 | 1037 | |
michael@0 | 1038 | if stored block |
michael@0 | 1039 | skip bits until byte aligned |
michael@0 | 1040 | read count and 1's compliment of count |
michael@0 | 1041 | copy count bytes data block |
michael@0 | 1042 | otherwise |
michael@0 | 1043 | loop until end of block code sent |
michael@0 | 1044 | decode literal character from input stream |
michael@0 | 1045 | if literal < 256 |
michael@0 | 1046 | copy character to the output stream |
michael@0 | 1047 | otherwise |
michael@0 | 1048 | if literal = end of block |
michael@0 | 1049 | break from loop |
michael@0 | 1050 | otherwise |
michael@0 | 1051 | decode distance from input stream |
michael@0 | 1052 | |
michael@0 | 1053 | move backwards distance bytes in the output stream, and |
michael@0 | 1054 | copy length characters from this position to the output |
michael@0 | 1055 | stream. |
michael@0 | 1056 | end loop |
michael@0 | 1057 | while not last block |
michael@0 | 1058 | |
michael@0 | 1059 | if data descriptor exists |
michael@0 | 1060 | skip bits until byte aligned |
michael@0 | 1061 | read crc and sizes |
michael@0 | 1062 | endif |
michael@0 | 1063 | |
michael@0 | 1064 | Decryption |
michael@0 | 1065 | ---------- |
michael@0 | 1066 | |
michael@0 | 1067 | The encryption used in PKZIP was generously supplied by Roger |
michael@0 | 1068 | Schlafly. PKWARE is grateful to Mr. Schlafly for his expert |
michael@0 | 1069 | help and advice in the field of data encryption. |
michael@0 | 1070 | |
michael@0 | 1071 | PKZIP encrypts the compressed data stream. Encrypted files must |
michael@0 | 1072 | be decrypted before they can be extracted. |
michael@0 | 1073 | |
michael@0 | 1074 | Each encrypted file has an extra 12 bytes stored at the start of |
michael@0 | 1075 | the data area defining the encryption header for that file. The |
michael@0 | 1076 | encryption header is originally set to random values, and then |
michael@0 | 1077 | itself encrypted, using three, 32-bit keys. The key values are |
michael@0 | 1078 | initialized using the supplied encryption password. After each byte |
michael@0 | 1079 | is encrypted, the keys are then updated using pseudo-random number |
michael@0 | 1080 | generation techniques in combination with the same CRC-32 algorithm |
michael@0 | 1081 | used in PKZIP and described elsewhere in this document. |
michael@0 | 1082 | |
michael@0 | 1083 | The following is the basic steps required to decrypt a file: |
michael@0 | 1084 | |
michael@0 | 1085 | 1) Initialize the three 32-bit keys with the password. |
michael@0 | 1086 | 2) Read and decrypt the 12-byte encryption header, further |
michael@0 | 1087 | initializing the encryption keys. |
michael@0 | 1088 | 3) Read and decrypt the compressed data stream using the |
michael@0 | 1089 | encryption keys. |
michael@0 | 1090 | |
michael@0 | 1091 | Step 1 - Initializing the encryption keys |
michael@0 | 1092 | ----------------------------------------- |
michael@0 | 1093 | |
michael@0 | 1094 | Key(0) <- 305419896 |
michael@0 | 1095 | Key(1) <- 591751049 |
michael@0 | 1096 | Key(2) <- 878082192 |
michael@0 | 1097 | |
michael@0 | 1098 | loop for i <- 0 to length(password)-1 |
michael@0 | 1099 | update_keys(password(i)) |
michael@0 | 1100 | end loop |
michael@0 | 1101 | |
michael@0 | 1102 | Where update_keys() is defined as: |
michael@0 | 1103 | |
michael@0 | 1104 | update_keys(char): |
michael@0 | 1105 | Key(0) <- crc32(key(0),char) |
michael@0 | 1106 | Key(1) <- Key(1) + (Key(0) & 000000ffH) |
michael@0 | 1107 | Key(1) <- Key(1) * 134775813 + 1 |
michael@0 | 1108 | Key(2) <- crc32(key(2),key(1) >> 24) |
michael@0 | 1109 | end update_keys |
michael@0 | 1110 | |
michael@0 | 1111 | Where crc32(old_crc,char) is a routine that given a CRC value and a |
michael@0 | 1112 | character, returns an updated CRC value after applying the CRC-32 |
michael@0 | 1113 | algorithm described elsewhere in this document. |
michael@0 | 1114 | |
michael@0 | 1115 | Step 2 - Decrypting the encryption header |
michael@0 | 1116 | ----------------------------------------- |
michael@0 | 1117 | |
michael@0 | 1118 | The purpose of this step is to further initialize the encryption |
michael@0 | 1119 | keys, based on random data, to render a plaintext attack on the |
michael@0 | 1120 | data ineffective. |
michael@0 | 1121 | |
michael@0 | 1122 | Read the 12-byte encryption header into Buffer, in locations |
michael@0 | 1123 | Buffer(0) thru Buffer(11). |
michael@0 | 1124 | |
michael@0 | 1125 | loop for i <- 0 to 11 |
michael@0 | 1126 | C <- buffer(i) ^ decrypt_byte() |
michael@0 | 1127 | update_keys(C) |
michael@0 | 1128 | buffer(i) <- C |
michael@0 | 1129 | end loop |
michael@0 | 1130 | |
michael@0 | 1131 | Where decrypt_byte() is defined as: |
michael@0 | 1132 | |
michael@0 | 1133 | unsigned char decrypt_byte() |
michael@0 | 1134 | local unsigned short temp |
michael@0 | 1135 | temp <- Key(2) | 2 |
michael@0 | 1136 | decrypt_byte <- (temp * (temp ^ 1)) >> 8 |
michael@0 | 1137 | end decrypt_byte |
michael@0 | 1138 | |
michael@0 | 1139 | After the header is decrypted, the last 1 or 2 bytes in Buffer |
michael@0 | 1140 | should be the high-order word/byte of the CRC for the file being |
michael@0 | 1141 | decrypted, stored in Intel low-byte/high-byte order. Versions of |
michael@0 | 1142 | PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is |
michael@0 | 1143 | used on versions after 2.0. This can be used to test if the password |
michael@0 | 1144 | supplied is correct or not. |
michael@0 | 1145 | |
michael@0 | 1146 | Step 3 - Decrypting the compressed data stream |
michael@0 | 1147 | ---------------------------------------------- |
michael@0 | 1148 | |
michael@0 | 1149 | The compressed data stream can be decrypted as follows: |
michael@0 | 1150 | |
michael@0 | 1151 | loop until done |
michael@0 | 1152 | read a character into C |
michael@0 | 1153 | Temp <- C ^ decrypt_byte() |
michael@0 | 1154 | update_keys(temp) |
michael@0 | 1155 | output Temp |
michael@0 | 1156 | end loop |
michael@0 | 1157 | |
michael@0 | 1158 | In addition to the above mentioned contributors to PKZIP and PKUNZIP, |
michael@0 | 1159 | I would like to extend special thanks to Robert Mahoney for suggesting |
michael@0 | 1160 | the extension .ZIP for this software. |
michael@0 | 1161 | |
michael@0 | 1162 | References: |
michael@0 | 1163 | |
michael@0 | 1164 | Fiala, Edward R., and Greene, Daniel H., "Data compression with |
michael@0 | 1165 | finite windows", Communications of the ACM, Volume 32, Number 4, |
michael@0 | 1166 | April 1989, pages 490-505. |
michael@0 | 1167 | |
michael@0 | 1168 | Held, Gilbert, "Data Compression, Techniques and Applications, |
michael@0 | 1169 | Hardware and Software Considerations", John Wiley & Sons, 1987. |
michael@0 | 1170 | |
michael@0 | 1171 | Huffman, D.A., "A method for the construction of minimum-redundancy |
michael@0 | 1172 | codes", Proceedings of the IRE, Volume 40, Number 9, September 1952, |
michael@0 | 1173 | pages 1098-1101. |
michael@0 | 1174 | |
michael@0 | 1175 | Nelson, Mark, "LZW Data Compression", Dr. Dobbs Journal, Volume 14, |
michael@0 | 1176 | Number 10, October 1989, pages 29-37. |
michael@0 | 1177 | |
michael@0 | 1178 | Nelson, Mark, "The Data Compression Book", M&T Books, 1991. |
michael@0 | 1179 | |
michael@0 | 1180 | Storer, James A., "Data Compression, Methods and Theory", |
michael@0 | 1181 | Computer Science Press, 1988 |
michael@0 | 1182 | |
michael@0 | 1183 | Welch, Terry, "A Technique for High-Performance Data Compression", |
michael@0 | 1184 | IEEE Computer, Volume 17, Number 6, June 1984, pages 8-19. |
michael@0 | 1185 | |
michael@0 | 1186 | Ziv, J. and Lempel, A., "A universal algorithm for sequential data |
michael@0 | 1187 | compression", Communications of the ACM, Volume 30, Number 6, |
michael@0 | 1188 | June 1987, pages 520-540. |
michael@0 | 1189 | |
michael@0 | 1190 | Ziv, J. and Lempel, A., "Compression of individual sequences via |
michael@0 | 1191 | variable-rate coding", IEEE Transactions on Information Theory, |
michael@0 | 1192 | Volume 24, Number 5, September 1978, pages 530-536. |