modules/libjar/appnote.txt

branch
TOR_BUG_9701
changeset 15
b8a032363ba2
equal deleted inserted replaced
-1:000000000000 0:2bb08910fb51
1 Revised: 03/01/1999
2
3 Disclaimer
4 ----------
5
6 Although PKWARE will attempt to supply current and accurate
7 information relating to its file formats, algorithms, and the
8 subject programs, the possibility of error can not be eliminated.
9 PKWARE therefore expressly disclaims any warranty that the
10 information contained in the associated materials relating to the
11 subject programs and/or the format of the files created or
12 accessed by the subject programs and/or the algorithms used by
13 the subject programs, or any other matter, is current, correct or
14 accurate as delivered. Any risk of damage due to any possible
15 inaccurate information is assumed by the user of the information.
16 Furthermore, the information relating to the subject programs
17 and/or the file formats created or accessed by the subject
18 programs and/or the algorithms used by the subject programs is
19 subject to change without notice.
20
21 General Format of a ZIP file
22 ----------------------------
23
24 Files stored in arbitrary order. Large zipfiles can span multiple
25 diskette media.
26
27 Overall zipfile format:
28
29 [local file header + file data + data_descriptor] . . .
30 [central directory] end of central directory record
31
32
33 A. Local file header:
34
35 local file header signature 4 bytes (0x04034b50)
36 version needed to extract 2 bytes
37 general purpose bit flag 2 bytes
38 compression method 2 bytes
39 last mod file time 2 bytes
40 last mod file date 2 bytes
41 crc-32 4 bytes
42 compressed size 4 bytes
43 uncompressed size 4 bytes
44 filename length 2 bytes
45 extra field length 2 bytes
46
47 filename (variable size)
48 extra field (variable size)
49
50 B. Data descriptor:
51
52 crc-32 4 bytes
53 compressed size 4 bytes
54 uncompressed size 4 bytes
55
56 This descriptor exists only if bit 3 of the general
57 purpose bit flag is set (see below). It is byte aligned
58 and immediately follows the last byte of compressed data.
59 This descriptor is used only when it was not possible to
60 seek in the output zip file, e.g., when the output zip file
61 was standard output or a non seekable device.
62
63 C. Central directory structure:
64
65 [file header] . . . end of central dir record
66
67 File header:
68
69 central file header signature 4 bytes (0x02014b50)
70 version made by 2 bytes
71 version needed to extract 2 bytes
72 general purpose bit flag 2 bytes
73 compression method 2 bytes
74 last mod file time 2 bytes
75 last mod file date 2 bytes
76 crc-32 4 bytes
77 compressed size 4 bytes
78 uncompressed size 4 bytes
79 filename length 2 bytes
80 extra field length 2 bytes
81 file comment length 2 bytes
82 disk number start 2 bytes
83 internal file attributes 2 bytes
84 external file attributes 4 bytes
85 relative offset of local header 4 bytes
86
87 filename (variable size)
88 extra field (variable size)
89 file comment (variable size)
90
91 End of central dir record:
92
93 end of central dir signature 4 bytes (0x06054b50)
94 number of this disk 2 bytes
95 number of the disk with the
96 start of the central directory 2 bytes
97 total number of entries in
98 the central dir on this disk 2 bytes
99 total number of entries in
100 the central dir 2 bytes
101 size of the central directory 4 bytes
102 offset of start of central
103 directory with respect to
104 the starting disk number 4 bytes
105 zipfile comment length 2 bytes
106 zipfile comment (variable size)
107
108 D. Explanation of fields:
109
110 version made by (2 bytes)
111
112 The upper byte indicates the compatibility of the file
113 attribute information. If the external file attributes
114 are compatible with MS-DOS and can be read by PKZIP for
115 DOS version 2.04g then this value will be zero. If these
116 attributes are not compatible, then this value will
117 identify the host system on which the attributes are
118 compatible. Software can use this information to determine
119 the line record format for text files etc. The current
120 mappings are:
121
122 0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
123 1 - Amiga 2 - VAX/VMS
124 3 - Unix 4 - VM/CMS
125 5 - Atari ST 6 - OS/2 H.P.F.S.
126 7 - Macintosh 8 - Z-System
127 9 - CP/M 10 - Windows NTFS
128 11 thru 255 - unused
129
130 The lower byte indicates the version number of the
131 software used to encode the file. The value/10
132 indicates the major version number, and the value
133 mod 10 is the minor version number.
134
135 version needed to extract (2 bytes)
136
137 The minimum software version needed to extract the
138 file, mapped as above.
139
140 general purpose bit flag: (2 bytes)
141
142 Bit 0: If set, indicates that the file is encrypted.
143
144 (For Method 6 - Imploding)
145 Bit 1: If the compression method used was type 6,
146 Imploding, then this bit, if set, indicates
147 an 8K sliding dictionary was used. If clear,
148 then a 4K sliding dictionary was used.
149 Bit 2: If the compression method used was type 6,
150 Imploding, then this bit, if set, indicates
151 3 Shannon-Fano trees were used to encode the
152 sliding dictionary output. If clear, then 2
153 Shannon-Fano trees were used.
154
155 (For Method 8 - Deflating)
156 Bit 2 Bit 1
157 0 0 Normal (-en) compression option was used.
158 0 1 Maximum (-ex) compression option was used.
159 1 0 Fast (-ef) compression option was used.
160 1 1 Super Fast (-es) compression option was used.
161
162 Note: Bits 1 and 2 are undefined if the compression
163 method is any other.
164
165 Bit 3: If this bit is set, the fields crc-32, compressed
166 size and uncompressed size are set to zero in the
167 local header. The correct values are put in the
168 data descriptor immediately following the compressed
169 data. (Note: PKZIP version 2.04g for DOS only
170 recognizes this bit for method 8 compression, newer
171 versions of PKZIP recognize this bit for any
172 compression method.)
173
174 Bit 4: Reserved for use with method 8, for enhanced
175 deflating.
176
177 Bit 5: If this bit is set, this indicates that the file is
178 compressed patched data. (Note: Requires PKZIP
179 version 2.70 or greater)
180
181 Bit 6: Currently unused.
182
183 Bit 7: Currently unused.
184
185 Bit 8: Currently unused.
186
187 Bit 9: Currently unused.
188
189 Bit 10: Currently unused.
190
191 Bit 11: Currently unused.
192
193 Bit 12: Reserved by PKWARE for enhanced compression.
194
195 Bit 13: Reserved by PKWARE.
196
197 Bit 14: Reserved by PKWARE.
198
199 Bit 15: Reserved by PKWARE.
200
201 compression method: (2 bytes)
202
203 (see accompanying documentation for algorithm
204 descriptions)
205
206 0 - The file is stored (no compression)
207 1 - The file is Shrunk
208 2 - The file is Reduced with compression factor 1
209 3 - The file is Reduced with compression factor 2
210 4 - The file is Reduced with compression factor 3
211 5 - The file is Reduced with compression factor 4
212 6 - The file is Imploded
213 7 - Reserved for Tokenizing compression algorithm
214 8 - The file is Deflated
215 9 - Reserved for enhanced Deflating
216 10 - PKWARE Date Compression Library Imploding
217
218 date and time fields: (2 bytes each)
219
220 The date and time are encoded in standard MS-DOS format.
221 If input came from standard input, the date and time are
222 those at which compression was started for this data.
223
224 CRC-32: (4 bytes)
225
226 The CRC-32 algorithm was generously contributed by
227 David Schwaderer and can be found in his excellent
228 book "C Programmers Guide to NetBIOS" published by
229 Howard W. Sams & Co. Inc. The 'magic number' for
230 the CRC is 0xdebb20e3. The proper CRC pre and post
231 conditioning is used, meaning that the CRC register
232 is pre-conditioned with all ones (a starting value
233 of 0xffffffff) and the value is post-conditioned by
234 taking the one's complement of the CRC residual.
235 If bit 3 of the general purpose flag is set, this
236 field is set to zero in the local header and the correct
237 value is put in the data descriptor and in the central
238 directory.
239
240 compressed size: (4 bytes)
241 uncompressed size: (4 bytes)
242
243 The size of the file compressed and uncompressed,
244 respectively. If bit 3 of the general purpose bit flag
245 is set, these fields are set to zero in the local header
246 and the correct values are put in the data descriptor and
247 in the central directory.
248
249 filename length: (2 bytes)
250 extra field length: (2 bytes)
251 file comment length: (2 bytes)
252
253 The length of the filename, extra field, and comment
254 fields respectively. The combined length of any
255 directory record and these three fields should not
256 generally exceed 65,535 bytes. If input came from standard
257 input, the filename length is set to zero.
258
259 disk number start: (2 bytes)
260
261 The number of the disk on which this file begins.
262
263 internal file attributes: (2 bytes)
264
265 The lowest bit of this field indicates, if set, that
266 the file is apparently an ASCII or text file. If not
267 set, that the file apparently contains binary data.
268 The remaining bits are unused in version 1.0.
269
270 Bits 1 and 2 are reserved for use by PKWARE.
271
272 external file attributes: (4 bytes)
273
274 The mapping of the external attributes is
275 host-system dependent (see 'version made by'). For
276 MS-DOS, the low order byte is the MS-DOS directory
277 attribute byte. If input came from standard input, this
278 field is set to zero.
279
280 relative offset of local header: (4 bytes)
281
282 This is the offset from the start of the first disk on
283 which this file appears, to where the local header should
284 be found.
285
286 filename: (Variable)
287
288 The name of the file, with optional relative path.
289 The path stored should not contain a drive or
290 device letter, or a leading slash. All slashes
291 should be forward slashes '/' as opposed to
292 backwards slashes '\' for compatibility with Amiga
293 and Unix file systems etc. If input came from standard
294 input, there is no filename field.
295
296 extra field: (Variable)
297
298 This is for future expansion. If additional information
299 needs to be stored in the future, it should be stored
300 here. Earlier versions of the software can then safely
301 skip this file, and find the next file or header. This
302 field will be 0 length in version 1.0.
303
304 In order to allow different programs and different types
305 of information to be stored in the 'extra' field in .ZIP
306 files, the following structure should be used for all
307 programs storing data in this field:
308
309 header1+data1 + header2+data2 . . .
310
311 Each header should consist of:
312
313 Header ID - 2 bytes
314 Data Size - 2 bytes
315
316 Note: all fields stored in Intel low-byte/high-byte order.
317
318 The Header ID field indicates the type of data that is in
319 the following data block.
320
321 Header ID's of 0 thru 31 are reserved for use by PKWARE.
322 The remaining ID's can be used by third party vendors for
323 proprietary usage.
324
325 The current Header ID mappings defined by PKWARE are:
326
327 0x0007 AV Info
328 0x0009 OS/2
329 0x000a NTFS
330 0x000c VAX/VMS
331 0x000d Unix
332 0x000f Patch Descriptor
333
334 Several third party mappings commonly used are:
335
336 0x4b46 FWKCS MD5 (see below)
337 0x07c8 Macintosh
338 0x4341 Acorn/SparkFS
339 0x4453 Windows NT security descriptor (binary ACL)
340 0x4704 VM/CMS
341 0x470f MVS
342 0x4c41 OS/2 access control list (text ACL)
343 0x4d49 Info-ZIP VMS (VAX or Alpha)
344 0x5455 extended timestamp
345 0x5855 Info-ZIP Unix (original, also OS/2, NT, etc)
346 0x6542 BeOS/BeBox
347 0x756e ASi Unix
348 0x7855 Info-ZIP Unix (new)
349 0xfd4a SMS/QDOS
350
351 The Data Size field indicates the size of the following
352 data block. Programs can use this value to skip to the
353 next header block, passing over any data blocks that are
354 not of interest.
355
356 Note: As stated above, the size of the entire .ZIP file
357 header, including the filename, comment, and extra
358 field should not exceed 64K in size.
359
360 In case two different programs should appropriate the same
361 Header ID value, it is strongly recommended that each
362 program place a unique signature of at least two bytes in
363 size (and preferably 4 bytes or bigger) at the start of
364 each data area. Every program should verify that its
365 unique signature is present, in addition to the Header ID
366 value being correct, before assuming that it is a block of
367 known type.
368
369 -OS/2 Extra Field:
370
371 The following is the layout of the OS/2 attributes "extra"
372 block. (Last Revision 09/05/95)
373
374 Note: all fields stored in Intel low-byte/high-byte order.
375
376 Value Size Description
377 ----- ---- -----------
378 (OS/2) 0x0009 2 bytes Tag for this "extra" block type
379 TSize 2 bytes Size for the following data block
380 BSize 4 bytes Uncompressed Block Size
381 CType 2 bytes Compression type
382 EACRC 4 bytes CRC value for uncompress block
383 (var) variable Compressed block
384
385 The OS/2 extended attribute structure (FEA2LIST) is
386 compressed and then stored in it's entirety within this
387 structure. There will only ever be one "block" of data in
388 VarFields[].
389
390 -UNIX Extra Field:
391
392 The following is the layout of the Unix "extra" block.
393 Note: all fields are stored in Intel low-byte/high-byte
394 order.
395
396 Value Size Description
397 ----- ---- -----------
398 (UNIX) 0x000d 2 bytes Tag for this "extra" block type
399 TSize 2 bytes Size for the following data block
400 Atime 4 bytes File last access time
401 Mtime 4 bytes File last modification time
402 Uid 2 bytes File user ID
403 Gid 2 bytes File group ID
404 (var) variable Variable length data field
405
406 The variable length data field will contain file type
407 specific data. Currently the only values allowed are
408 the original "linked to" file names for hard or symbolic
409 links.
410
411 -VAX/VMS Extra Field:
412
413 The following is the layout of the VAX/VMS attributes
414 "extra" block.
415
416 Note: all fields stored in Intel low-byte/high-byte order.
417
418 Value Size Description
419 ----- ---- -----------
420 (VMS) 0x000c 2 bytes Tag for this "extra" block type
421 TSize 2 bytes Size of the total "extra" block
422 CRC 4 bytes 32-bit CRC for remainder of the block
423 Tag1 2 bytes VMS attribute tag value #1
424 Size1 2 bytes Size of attribute #1, in bytes
425 (var.) Size1 Attribute #1 data
426 .
427 .
428 .
429 TagN 2 bytes VMS attribute tage value #N
430 SizeN 2 bytes Size of attribute #N, in bytes
431 (var.) SizeN Attribute #N data
432
433 Rules:
434
435 1. There will be one or more of attributes present, which
436 will each be preceded by the above TagX & SizeX values.
437 These values are identical to the ATR$C_XXXX and
438 ATR$S_XXXX constants which are defined in ATR.H under
439 VMS C. Neither of these values will ever be zero.
440
441 2. No word alignment or padding is performed.
442
443 3. A well-behaved PKZIP/VMS program should never produce
444 more than one sub-block with the same TagX value. Also,
445 there will never be more than one "extra" block of type
446 0x000c in a particular directory record.
447
448 -NTFS Extra Field:
449
450 The following is the layout of the NTFS attributes
451 "extra" block.
452
453 Note: all fields stored in Intel low-byte/high-byte order.
454
455 Value Size Description
456 ----- ---- -----------
457 (NTFS) 0x000a 2 bytes Tag for this "extra" block type
458 TSize 2 bytes Size of the total "extra" block
459 Reserved 4 bytes Reserved for future use
460 Tag1 2 bytes NTFS attribute tag value #1
461 Size1 2 bytes Size of attribute #1, in bytes
462 (var.) Size1 Attribute #1 data
463 .
464 .
465 .
466 TagN 2 bytes NTFS attribute tage value #N
467 SizeN 2 bytes Size of attribute #N, in bytes
468 (var.) SizeN Attribute #N data
469
470 For NTFS, values for Tag1 through TagN are as follows:
471 (currently only one set of attributes is defined for NTFS)
472
473 Tag Size Description
474 ----- ---- -----------
475 0x0001 2 bytes Tag for attribute #1
476 Size1 2 bytes Size of attribute #1, in bytes
477 Mtime 8 bytes File last modification time
478 Atime 8 bytes File last access time
479 Ctime 8 bytes File creation time
480
481 -PATCH Descriptor Extra Field:
482
483 The following is the layout of the Patch Descriptor "extra"
484 block.
485
486 Note: all fields stored in Intel low-byte/high-byte order.
487
488 Value Size Description
489 ----- ---- -----------
490 (Patch) 0x000f 2 bytes Tag for this "extra" block type
491 TSize 2 bytes Size of the total "extra" block
492 Version 2 bytes Version of the descriptor
493 Flags 4 bytes Actions and reactions (see below)
494 OldSize 4 bytes Size of the file about to be patched
495 OldCRC 4 bytes 32-bit CRC of the file to be patched
496 NewSize 4 bytes Size of the resulting file
497 NewCRC 4 bytes 32-bit CRC of the resulting file
498
499 Actions and reactions
500
501 Bits Description
502 ---- ----------------
503 0 Use for autodetection
504 1 Treat as selfpatch
505 2-3 RESERVED
506 4-5 Action (see below)
507 6-7 RESERVED
508 8-9 Reaction (see below) to absent file
509 10-11 Reaction (see below) to newer file
510 12-13 Reaction (see below) to unknown file
511 14-15 RESERVED
512 16-31 RESERVED
513
514 Actions
515
516 Action Value
517 ------ -----
518 none 0
519 add 1
520 delete 2
521 patch 3
522
523 Reactions
524
525 Reaction Value
526 -------- -----
527 ask 0
528 skip 1
529 ignore 2
530 fail 3
531
532 - FWKCS MD5 Extra Field:
533
534 The FWKCS Contents_Signature System, used in
535 automatically identifying files independent of filename,
536 optionally adds and uses an extra field to support the
537 rapid creation of an enhanced contents_signature:
538
539 Header ID = 0x4b46
540 Data Size = 0x0013
541 Preface = 'M','D','5'
542 followed by 16 bytes containing the uncompressed file's
543 128_bit MD5 hash(1), low byte first.
544
545 When FWKCS revises a zipfile central directory to add
546 this extra field for a file, it also replaces the
547 central directory entry for that file's uncompressed
548 filelength with a measured value.
549
550 FWKCS provides an option to strip this extra field, if
551 present, from a zipfile central directory. In adding
552 this extra field, FWKCS preserves Zipfile Authenticity
553 Verification; if stripping this extra field, FWKCS
554 preserves all versions of AV through PKZIP version 2.04g.
555
556 FWKCS, and FWKCS Contents_Signature System, are
557 trademarks of Frederick W. Kantor.
558
559 (1) R. Rivest, RFC1321.TXT, MIT Laboratory for Computer
560 Science and RSA Data Security, Inc., April 1992.
561 ll.76-77: "The MD5 algorithm is being placed in the
562 public domain for review and possible adoption as a
563 standard."
564
565 file comment: (Variable)
566
567 The comment for this file.
568
569 number of this disk: (2 bytes)
570
571 The number of this disk, which contains central
572 directory end record.
573
574 number of the disk with the start of the central
575 directory: (2 bytes)
576
577 The number of the disk on which the central
578 directory starts.
579
580 total number of entries in the central dir on
581 this disk: (2 bytes)
582
583 The number of central directory entries on this disk.
584
585 total number of entries in the central dir: (2 bytes)
586
587 The total number of files in the zipfile.
588
589 size of the central directory: (4 bytes)
590
591 The size (in bytes) of the entire central directory.
592
593 offset of start of central directory with respect to
594 the starting disk number: (4 bytes)
595
596 Offset of the start of the central directory on the
597 disk on which the central directory starts.
598
599 zipfile comment length: (2 bytes)
600
601 The length of the comment for this zipfile.
602
603 zipfile comment: (Variable)
604
605 The comment for this zipfile.
606
607 D. General notes:
608
609 1) All fields unless otherwise noted are unsigned and stored
610 in Intel low-byte:high-byte, low-word:high-word order.
611
612 2) String fields are not null terminated, since the
613 length is given explicitly.
614
615 3) Local headers should not span disk boundaries. Also, even
616 though the central directory can span disk boundaries, no
617 single record in the central directory should be split
618 across disks.
619
620 4) The entries in the central directory may not necessarily
621 be in the same order that files appear in the zipfile.
622
623 UnShrinking - Method 1
624 ----------------------
625
626 Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm
627 with partial clearing. The initial code size is 9 bits, and
628 the maximum code size is 13 bits. Shrinking differs from
629 conventional Dynamic Ziv-Lempel-Welch implementations in several
630 respects:
631
632 1) The code size is controlled by the compressor, and is not
633 automatically increased when codes larger than the current
634 code size are created (but not necessarily used). When
635 the decompressor encounters the code sequence 256
636 (decimal) followed by 1, it should increase the code size
637 read from the input stream to the next bit size. No
638 blocking of the codes is performed, so the next code at
639 the increased size should be read from the input stream
640 immediately after where the previous code at the smaller
641 bit size was read. Again, the decompressor should not
642 increase the code size used until the sequence 256,1 is
643 encountered.
644
645 2) When the table becomes full, total clearing is not
646 performed. Rather, when the compressor emits the code
647 sequence 256,2 (decimal), the decompressor should clear
648 all leaf nodes from the Ziv-Lempel tree, and continue to
649 use the current code size. The nodes that are cleared
650 from the Ziv-Lempel tree are then re-used, with the lowest
651 code value re-used first, and the highest code value
652 re-used last. The compressor can emit the sequence 256,2
653 at any time.
654
655 Expanding - Methods 2-5
656 -----------------------
657
658 The Reducing algorithm is actually a combination of two
659 distinct algorithms. The first algorithm compresses repeated
660 byte sequences, and the second algorithm takes the compressed
661 stream from the first algorithm and applies a probabilistic
662 compression method.
663
664 The probabilistic compression stores an array of 'follower
665 sets' S(j), for j=0 to 255, corresponding to each possible
666 ASCII character. Each set contains between 0 and 32
667 characters, to be denoted as S(j)[0],...,S(j)[m], where m<32.
668 The sets are stored at the beginning of the data area for a
669 Reduced file, in reverse order, with S(255) first, and S(0)
670 last.
671
672 The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] },
673 where N(j) is the size of set S(j). N(j) can be 0, in which
674 case the follower set for S(j) is empty. Each N(j) value is
675 encoded in 6 bits, followed by N(j) eight bit character values
676 corresponding to S(j)[0] to S(j)[N(j)-1] respectively. If
677 N(j) is 0, then no values for S(j) are stored, and the value
678 for N(j-1) immediately follows.
679
680 Immediately after the follower sets, is the compressed data
681 stream. The compressed data stream can be interpreted for the
682 probabilistic decompression as follows:
683
684 let Last-Character <- 0.
685 loop until done
686 if the follower set S(Last-Character) is empty then
687 read 8 bits from the input stream, and copy this
688 value to the output stream.
689 otherwise if the follower set S(Last-Character) is non-empty then
690 read 1 bit from the input stream.
691 if this bit is not zero then
692 read 8 bits from the input stream, and copy this
693 value to the output stream.
694 otherwise if this bit is zero then
695 read B(N(Last-Character)) bits from the input
696 stream, and assign this value to I.
697 Copy the value of S(Last-Character)[I] to the
698 output stream.
699
700 assign the last value placed on the output stream to
701 Last-Character.
702 end loop
703
704 B(N(j)) is defined as the minimal number of bits required to
705 encode the value N(j)-1.
706
707 The decompressed stream from above can then be expanded to
708 re-create the original file as follows:
709
710 let State <- 0.
711
712 loop until done
713 read 8 bits from the input stream into C.
714 case State of
715 0: if C is not equal to DLE (144 decimal) then
716 copy C to the output stream.
717 otherwise if C is equal to DLE then
718 let State <- 1.
719
720 1: if C is non-zero then
721 let V <- C.
722 let Len <- L(V)
723 let State <- F(Len).
724 otherwise if C is zero then
725 copy the value 144 (decimal) to the output stream.
726 let State <- 0
727
728 2: let Len <- Len + C
729 let State <- 3.
730
731 3: move backwards D(V,C) bytes in the output stream
732 (if this position is before the start of the output
733 stream, then assume that all the data before the
734 start of the output stream is filled with zeros).
735 copy Len+3 bytes from this position to the output stream.
736 let State <- 0.
737 end case
738 end loop
739
740 The functions F,L, and D are dependent on the 'compression
741 factor', 1 through 4, and are defined as follows:
742
743 For compression factor 1:
744 L(X) equals the lower 7 bits of X.
745 F(X) equals 2 if X equals 127 otherwise F(X) equals 3.
746 D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1.
747 For compression factor 2:
748 L(X) equals the lower 6 bits of X.
749 F(X) equals 2 if X equals 63 otherwise F(X) equals 3.
750 D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1.
751 For compression factor 3:
752 L(X) equals the lower 5 bits of X.
753 F(X) equals 2 if X equals 31 otherwise F(X) equals 3.
754 D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1.
755 For compression factor 4:
756 L(X) equals the lower 4 bits of X.
757 F(X) equals 2 if X equals 15 otherwise F(X) equals 3.
758 D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1.
759
760 Imploding - Method 6
761 --------------------
762
763 The Imploding algorithm is actually a combination of two distinct
764 algorithms. The first algorithm compresses repeated byte
765 sequences using a sliding dictionary. The second algorithm is
766 used to compress the encoding of the sliding dictionary output,
767 using multiple Shannon-Fano trees.
768
769 The Imploding algorithm can use a 4K or 8K sliding dictionary
770 size. The dictionary size used can be determined by bit 1 in the
771 general purpose flag word; a 0 bit indicates a 4K dictionary
772 while a 1 bit indicates an 8K dictionary.
773
774 The Shannon-Fano trees are stored at the start of the compressed
775 file. The number of trees stored is defined by bit 2 in the
776 general purpose flag word; a 0 bit indicates two trees stored, a
777 1 bit indicates three trees are stored. If 3 trees are stored,
778 the first Shannon-Fano tree represents the encoding of the
779 Literal characters, the second tree represents the encoding of
780 the Length information, the third represents the encoding of the
781 Distance information. When 2 Shannon-Fano trees are stored, the
782 Length tree is stored first, followed by the Distance tree.
783
784 The Literal Shannon-Fano tree, if present is used to represent
785 the entire ASCII character set, and contains 256 values. This
786 tree is used to compress any data not compressed by the sliding
787 dictionary algorithm. When this tree is present, the Minimum
788 Match Length for the sliding dictionary is 3. If this tree is
789 not present, the Minimum Match Length is 2.
790
791 The Length Shannon-Fano tree is used to compress the Length part
792 of the (length,distance) pairs from the sliding dictionary
793 output. The Length tree contains 64 values, ranging from the
794 Minimum Match Length, to 63 plus the Minimum Match Length.
795
796 The Distance Shannon-Fano tree is used to compress the Distance
797 part of the (length,distance) pairs from the sliding dictionary
798 output. The Distance tree contains 64 values, ranging from 0 to
799 63, representing the upper 6 bits of the distance value. The
800 distance values themselves will be between 0 and the sliding
801 dictionary size, either 4K or 8K.
802
803 The Shannon-Fano trees themselves are stored in a compressed
804 format. The first byte of the tree data represents the number of
805 bytes of data representing the (compressed) Shannon-Fano tree
806 minus 1. The remaining bytes represent the Shannon-Fano tree
807 data encoded as:
808
809 High 4 bits: Number of values at this bit length + 1. (1 - 16)
810 Low 4 bits: Bit Length needed to represent value + 1. (1 - 16)
811
812 The Shannon-Fano codes can be constructed from the bit lengths
813 using the following algorithm:
814
815 1) Sort the Bit Lengths in ascending order, while retaining the
816 order of the original lengths stored in the file.
817
818 2) Generate the Shannon-Fano trees:
819
820 Code <- 0
821 CodeIncrement <- 0
822 LastBitLength <- 0
823 i <- number of Shannon-Fano codes - 1 (either 255 or 63)
824
825 loop while i >= 0
826 Code = Code + CodeIncrement
827 if BitLength(i) <> LastBitLength then
828 LastBitLength=BitLength(i)
829 CodeIncrement = 1 shifted left (16 - LastBitLength)
830 ShannonCode(i) = Code
831 i <- i - 1
832 end loop
833
834 3) Reverse the order of all the bits in the above ShannonCode()
835 vector, so that the most significant bit becomes the least
836 significant bit. For example, the value 0x1234 (hex) would
837 become 0x2C48 (hex).
838
839 4) Restore the order of Shannon-Fano codes as originally stored
840 within the file.
841
842 Example:
843
844 This example will show the encoding of a Shannon-Fano tree
845 of size 8. Notice that the actual Shannon-Fano trees used
846 for Imploding are either 64 or 256 entries in size.
847
848 Example: 0x02, 0x42, 0x01, 0x13
849
850 The first byte indicates 3 values in this table. Decoding the
851 bytes:
852 0x42 = 5 codes of 3 bits long
853 0x01 = 1 code of 2 bits long
854 0x13 = 2 codes of 4 bits long
855
856 This would generate the original bit length array of:
857 (3, 3, 3, 3, 3, 2, 4, 4)
858
859 There are 8 codes in this table for the values 0 thru 7. Using
860 the algorithm to obtain the Shannon-Fano codes produces:
861
862 Reversed Order Original
863 Val Sorted Constructed Code Value Restored Length
864 --- ------ ----------------- -------- -------- ------
865 0: 2 1100000000000000 11 101 3
866 1: 3 1010000000000000 101 001 3
867 2: 3 1000000000000000 001 110 3
868 3: 3 0110000000000000 110 010 3
869 4: 3 0100000000000000 010 100 3
870 5: 3 0010000000000000 100 11 2
871 6: 4 0001000000000000 1000 1000 4
872 7: 4 0000000000000000 0000 0000 4
873
874 The values in the Val, Order Restored and Original Length columns
875 now represent the Shannon-Fano encoding tree that can be used for
876 decoding the Shannon-Fano encoded data. How to parse the
877 variable length Shannon-Fano values from the data stream is beyond
878 the scope of this document. (See the references listed at the end of
879 this document for more information.) However, traditional decoding
880 schemes used for Huffman variable length decoding, such as the
881 Greenlaw algorithm, can be successfully applied.
882
883 The compressed data stream begins immediately after the
884 compressed Shannon-Fano data. The compressed data stream can be
885 interpreted as follows:
886
887 loop until done
888 read 1 bit from input stream.
889
890 if this bit is non-zero then (encoded data is literal data)
891 if Literal Shannon-Fano tree is present
892 read and decode character using Literal Shannon-Fano tree.
893 otherwise
894 read 8 bits from input stream.
895 copy character to the output stream.
896 otherwise (encoded data is sliding dictionary match)
897 if 8K dictionary size
898 read 7 bits for offset Distance (lower 7 bits of offset).
899 otherwise
900 read 6 bits for offset Distance (lower 6 bits of offset).
901
902 using the Distance Shannon-Fano tree, read and decode the
903 upper 6 bits of the Distance value.
904
905 using the Length Shannon-Fano tree, read and decode
906 the Length value.
907
908 Length <- Length + Minimum Match Length
909
910 if Length = 63 + Minimum Match Length
911 read 8 bits from the input stream,
912 add this value to Length.
913
914 move backwards Distance+1 bytes in the output stream, and
915 copy Length characters from this position to the output
916 stream. (if this position is before the start of the output
917 stream, then assume that all the data before the start of
918 the output stream is filled with zeros).
919 end loop
920
921 Tokenizing - Method 7
922 --------------------
923
924 This method is not used by PKZIP.
925
926 Deflating - Method 8
927 -----------------
928
929 The Deflate algorithm is similar to the Implode algorithm using
930 a sliding dictionary of up to 32K with secondary compression
931 from Huffman/Shannon-Fano codes.
932
933 The compressed data is stored in blocks with a header describing
934 the block and the Huffman codes used in the data block. The header
935 format is as follows:
936
937 Bit 0: Last Block bit This bit is set to 1 if this is the last
938 compressed block in the data.
939 Bits 1-2: Block type
940 00 (0) - Block is stored - All stored data is byte aligned.
941 Skip bits until next byte, then next word = block
942 length, followed by the ones compliment of the block
943 length word. Remaining data in block is the stored
944 data.
945
946 01 (1) - Use fixed Huffman codes for literal and distance codes.
947 Lit Code Bits Dist Code Bits
948 --------- ---- --------- ----
949 0 - 143 8 0 - 31 5
950 144 - 255 9
951 256 - 279 7
952 280 - 287 8
953
954 Literal codes 286-287 and distance codes 30-31 are
955 never used but participate in the huffman construction.
956
957 10 (2) - Dynamic Huffman codes. (See expanding Huffman codes)
958
959 11 (3) - Reserved - Flag a "Error in compressed data" if seen.
960
961 Expanding Huffman Codes
962 -----------------------
963 If the data block is stored with dynamic Huffman codes, the Huffman
964 codes are sent in the following compressed format:
965
966 5 Bits: # of Literal codes sent - 256 (256 - 286)
967 All other codes are never sent.
968 5 Bits: # of Dist codes - 1 (1 - 32)
969 4 Bits: # of Bit Length codes - 3 (3 - 19)
970
971 The Huffman codes are sent as bit lengths and the codes are built as
972 described in the implode algorithm. The bit lengths themselves are
973 compressed with Huffman codes. There are 19 bit length codes:
974
975 0 - 15: Represent bit lengths of 0 - 15
976 16: Copy the previous bit length 3 - 6 times.
977 The next 2 bits indicate repeat length (0 = 3, ... ,3 = 6)
978 Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will
979 expand to 12 bit lengths of 8 (1 + 6 + 5)
980 17: Repeat a bit length of 0 for 3 - 10 times. (3 bits of length)
981 18: Repeat a bit length of 0 for 11 - 138 times (7 bits of length)
982
983 The lengths of the bit length codes are sent packed 3 bits per value
984 (0 - 7) in the following order:
985
986 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
987
988 The Huffman codes should be built as described in the Implode algorithm
989 except codes are assigned starting at the shortest bit length, i.e. the
990 shortest code should be all 0's rather than all 1's. Also, codes with
991 a bit length of zero do not participate in the tree construction. The
992 codes are then used to decode the bit lengths for the literal and
993 distance tables.
994
995 The bit lengths for the literal tables are sent first with the number
996 of entries sent described by the 5 bits sent earlier. There are up
997 to 286 literal characters; the first 256 represent the respective 8
998 bit character, code 256 represents the End-Of-Block code, the remaining
999 29 codes represent copy lengths of 3 thru 258. There are up to 30
1000 distance codes representing distances from 1 thru 32k as described
1001 below.
1002
1003 Length Codes
1004 ------------
1005 Extra Extra Extra Extra
1006 Code Bits Length Code Bits Lengths Code Bits Lengths Code Bits Length(s)
1007 ---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- ---------
1008 257 0 3 265 1 11,12 273 3 35-42 281 5 131-162
1009 258 0 4 266 1 13,14 274 3 43-50 282 5 163-194
1010 259 0 5 267 1 15,16 275 3 51-58 283 5 195-226
1011 260 0 6 268 1 17,18 276 3 59-66 284 5 227-257
1012 261 0 7 269 2 19-22 277 4 67-82 285 0 258
1013 262 0 8 270 2 23-26 278 4 83-98
1014 263 0 9 271 2 27-30 279 4 99-114
1015 264 0 10 272 2 31-34 280 4 115-130
1016
1017 Distance Codes
1018 --------------
1019 Extra Extra Extra Extra
1020 Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
1021 ---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
1022 0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
1023 1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
1024 2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
1025 3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
1026 4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
1027 5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
1028 6 2 9-12 14 6 129-192 22 10 2049-3072
1029 7 2 13-16 15 6 193-256 23 10 3073-4096
1030
1031 The compressed data stream begins immediately after the
1032 compressed header data. The compressed data stream can be
1033 interpreted as follows:
1034
1035 do
1036 read header from input stream.
1037
1038 if stored block
1039 skip bits until byte aligned
1040 read count and 1's compliment of count
1041 copy count bytes data block
1042 otherwise
1043 loop until end of block code sent
1044 decode literal character from input stream
1045 if literal < 256
1046 copy character to the output stream
1047 otherwise
1048 if literal = end of block
1049 break from loop
1050 otherwise
1051 decode distance from input stream
1052
1053 move backwards distance bytes in the output stream, and
1054 copy length characters from this position to the output
1055 stream.
1056 end loop
1057 while not last block
1058
1059 if data descriptor exists
1060 skip bits until byte aligned
1061 read crc and sizes
1062 endif
1063
1064 Decryption
1065 ----------
1066
1067 The encryption used in PKZIP was generously supplied by Roger
1068 Schlafly. PKWARE is grateful to Mr. Schlafly for his expert
1069 help and advice in the field of data encryption.
1070
1071 PKZIP encrypts the compressed data stream. Encrypted files must
1072 be decrypted before they can be extracted.
1073
1074 Each encrypted file has an extra 12 bytes stored at the start of
1075 the data area defining the encryption header for that file. The
1076 encryption header is originally set to random values, and then
1077 itself encrypted, using three, 32-bit keys. The key values are
1078 initialized using the supplied encryption password. After each byte
1079 is encrypted, the keys are then updated using pseudo-random number
1080 generation techniques in combination with the same CRC-32 algorithm
1081 used in PKZIP and described elsewhere in this document.
1082
1083 The following is the basic steps required to decrypt a file:
1084
1085 1) Initialize the three 32-bit keys with the password.
1086 2) Read and decrypt the 12-byte encryption header, further
1087 initializing the encryption keys.
1088 3) Read and decrypt the compressed data stream using the
1089 encryption keys.
1090
1091 Step 1 - Initializing the encryption keys
1092 -----------------------------------------
1093
1094 Key(0) <- 305419896
1095 Key(1) <- 591751049
1096 Key(2) <- 878082192
1097
1098 loop for i <- 0 to length(password)-1
1099 update_keys(password(i))
1100 end loop
1101
1102 Where update_keys() is defined as:
1103
1104 update_keys(char):
1105 Key(0) <- crc32(key(0),char)
1106 Key(1) <- Key(1) + (Key(0) & 000000ffH)
1107 Key(1) <- Key(1) * 134775813 + 1
1108 Key(2) <- crc32(key(2),key(1) >> 24)
1109 end update_keys
1110
1111 Where crc32(old_crc,char) is a routine that given a CRC value and a
1112 character, returns an updated CRC value after applying the CRC-32
1113 algorithm described elsewhere in this document.
1114
1115 Step 2 - Decrypting the encryption header
1116 -----------------------------------------
1117
1118 The purpose of this step is to further initialize the encryption
1119 keys, based on random data, to render a plaintext attack on the
1120 data ineffective.
1121
1122 Read the 12-byte encryption header into Buffer, in locations
1123 Buffer(0) thru Buffer(11).
1124
1125 loop for i <- 0 to 11
1126 C <- buffer(i) ^ decrypt_byte()
1127 update_keys(C)
1128 buffer(i) <- C
1129 end loop
1130
1131 Where decrypt_byte() is defined as:
1132
1133 unsigned char decrypt_byte()
1134 local unsigned short temp
1135 temp <- Key(2) | 2
1136 decrypt_byte <- (temp * (temp ^ 1)) >> 8
1137 end decrypt_byte
1138
1139 After the header is decrypted, the last 1 or 2 bytes in Buffer
1140 should be the high-order word/byte of the CRC for the file being
1141 decrypted, stored in Intel low-byte/high-byte order. Versions of
1142 PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is
1143 used on versions after 2.0. This can be used to test if the password
1144 supplied is correct or not.
1145
1146 Step 3 - Decrypting the compressed data stream
1147 ----------------------------------------------
1148
1149 The compressed data stream can be decrypted as follows:
1150
1151 loop until done
1152 read a character into C
1153 Temp <- C ^ decrypt_byte()
1154 update_keys(temp)
1155 output Temp
1156 end loop
1157
1158 In addition to the above mentioned contributors to PKZIP and PKUNZIP,
1159 I would like to extend special thanks to Robert Mahoney for suggesting
1160 the extension .ZIP for this software.
1161
1162 References:
1163
1164 Fiala, Edward R., and Greene, Daniel H., "Data compression with
1165 finite windows", Communications of the ACM, Volume 32, Number 4,
1166 April 1989, pages 490-505.
1167
1168 Held, Gilbert, "Data Compression, Techniques and Applications,
1169 Hardware and Software Considerations", John Wiley & Sons, 1987.
1170
1171 Huffman, D.A., "A method for the construction of minimum-redundancy
1172 codes", Proceedings of the IRE, Volume 40, Number 9, September 1952,
1173 pages 1098-1101.
1174
1175 Nelson, Mark, "LZW Data Compression", Dr. Dobbs Journal, Volume 14,
1176 Number 10, October 1989, pages 29-37.
1177
1178 Nelson, Mark, "The Data Compression Book", M&T Books, 1991.
1179
1180 Storer, James A., "Data Compression, Methods and Theory",
1181 Computer Science Press, 1988
1182
1183 Welch, Terry, "A Technique for High-Performance Data Compression",
1184 IEEE Computer, Volume 17, Number 6, June 1984, pages 8-19.
1185
1186 Ziv, J. and Lempel, A., "A universal algorithm for sequential data
1187 compression", Communications of the ACM, Volume 30, Number 6,
1188 June 1987, pages 520-540.
1189
1190 Ziv, J. and Lempel, A., "Compression of individual sequences via
1191 variable-rate coding", IEEE Transactions on Information Theory,
1192 Volume 24, Number 5, September 1978, pages 530-536.

mercurial