File: lzip.info, Node: Quality assurance, Next: Algorithm, Prev: Invoking lzip, Up: Top 4 Design, development, and testing of lzip ****************************************** There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. -- C.A.R. Hoare Lzip has been designed, written, and tested with great care to replace gzip and bzip2 as the standard general-purpose compressed format for Unix-like systems. This chapter describes the lessons learned from these previous formats, and their application to the design of lzip. The lzip format specification has been reviewed carefully and is believed to be free from design errors. 4.1 Format design ================= When gzip was designed in 1992, computers and operating systems were much less capable than they are today. The designers of gzip tried to work around some of those limitations, like 8.3 file names, with additional fields in the file format. Today those limitations have mostly disappeared, and the format of gzip has proved to be unnecessarily complicated. It includes fields that were never used, others that have lost their usefulness, and finally others that have become too limited. Bzip2 was designed 5 years later, and its format is simpler than the one of gzip. Probably the worst defect of the gzip format from the point of view of data safety is the variable size of its header. If the byte at offset 3 (flags) of a gzip member gets corrupted, it may become difficult to recover the data, even if the compressed blocks are intact, because it can't be known with certainty where the compressed blocks begin. By contrast, the header of a lzip member has a fixed length of 6. The LZMA stream in a lzip member always starts at offset 6, making it trivial to recover the data even if the whole header becomes corrupt. Bzip2 also provides a header of fixed length and marks the begin and end of each compressed block with six magic bytes, making it possible to find the compressed blocks even in case of file damage. But bzip2 does not store the size of each compressed block, as lzip does. Lziprecover is able to provide unique data recovery capabilities because the lzip format is extraordinarily safe. The simple and safe design of the file format complements the embedded error detection provided by the LZMA data stream. Any distance larger than the dictionary size acts as a forbidden symbol, allowing the decompressor to detect the approximate position of errors, and leaving very little work for the check sequence (CRC and data sizes) in the detection of errors. Lzip is usually able to detect all possible bit flips in the compressed data without resorting to the check sequence. It would be difficult to write an automatic recovery tool like lziprecover for the gzip format. And, as far as I know, it has never been written. Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the decompressed data because it provides optimal accuracy in the detection of errors up to a compressed size of about 16 GiB, a size larger than that of most files. In the case of lzip, the additional detection capability of the decompressor reduces the probability of undetected errors several million times more, resulting in a combined integrity checking optimally accurate for any member size produced by lzip. Preliminary results suggest that the lzip format is safe enough to be used in critical safety avionics systems. The lzip format is designed for long-term archiving. Therefore it excludes any unneeded features that may interfere with the future extraction of the decompressed data. 4.1.1 Gzip format (mis)features not present in lzip --------------------------------------------------- 'Multiple algorithms' Gzip provides a CM (Compression Method) field that has never been used because it is a bad idea to begin with. New compression methods may require additional fields, making it impossible to implement new methods and, at the same time, keep the same format. This field does not solve the problem of format proliferation; it just makes the problem less obvious. 'Optional fields in header' Unless special precautions are taken, optional fields are generally a bad idea because they produce a header of variable size. The gzip header has 2 fields that, in addition to being optional, are zero-terminated. This means that if any byte inside the field gets zeroed, or if the terminating zero gets altered, gzip won't be able to find neither the header CRC nor the compressed blocks. 'Optional CRC for the header' Using an optional CRC for the header is not only a bad idea, it is an error; it circumvents the Hamming distance (HD) of the CRC and may prevent the extraction of perfectly good data. For example, if the CRC is used and the bit enabling it is reset by a bit flip, then the header seems to be intact (in spite of being corrupt) while the compressed blocks seem to be totally unrecoverable (in spite of being intact). Very misleading indeed. 'Metadata' The gzip format stores some metadata, like the modification time of the original file or the operating system on which compression took place. This complicates reproducible compression (obtaining identical compressed output from identical input). 4.1.2 Lzip format improvements over gzip and bzip2 -------------------------------------------------- '64-bit size field' Probably the most frequently reported shortcoming of the gzip format is that it only stores the least significant 32 bits of the uncompressed size. The size of any file larger or equal than 4 GiB gets truncated. Bzip2 does not store the uncompressed size of the file. The lzip format provides a 64-bit field for the uncompressed size. Additionally, lzip produces multimember output automatically when the size is too large for a single member, allowing for an unlimited uncompressed size. 'Distributed index' The lzip format provides a distributed index that, among other things, helps plzip to decompress several times faster than pigz and helps lziprecover do its job. Neither the gzip format nor the bzip2 format do provide an index. A distributed index is safer and more scalable than a monolithic index. The monolithic index introduces a single point of failure in the compressed file and may limit the number of members or the total uncompressed size. 4.2 Quality of implementation ============================= Our civilization depends critically on software; it had better be quality software. -- Bjarne Stroustrup 'Accurate and robust error detection' The lzip format provides 3-factor integrity checking, and the decompressors report mismatches in each factor separately. This method detects most false positives for corruption. If just one byte in one factor fails but the other two factors match the data, it probably means that the data are intact and the corruption just affects the mismatching factor (CRC, data size, or member size) in the member trailer. 'Multiple implementations' Just like the lzip format provides 3-factor protection against undetected data corruption, the development methodology of the lzip family of compressors provides 3-factor protection against undetected programming errors. Three related but independent compressor implementations, lzip, clzip, and minilzip/lzlib, are developed concurrently. Every stable release of any of them is tested to check that it produces identical output to the other two. This guarantees that all three implement the same algorithm, and makes it unlikely that any of them may contain serious undiscovered errors. In fact, no errors have been discovered in lzip since 2009. Additionally, the three implementations have been extensively tested with unzcrash, valgrind, and 'american fuzzy lop' without finding a single vulnerability or false negative. *Note Unzcrash: (lziprecover)Unzcrash. 'Dictionary size' Lzip automatically adapts the dictionary size to the size of each file. In addition to reducing the amount of memory required for decompression, this feature also minimizes the probability of being affected by RAM errors during compression. 'Exit status' Returning a warning status of 2 is a design flaw of compress that leaked into the design of gzip. Both bzip2 and lzip are free from this flaw.