File: lzip.info, Node: Algorithm, Next: File format, Prev: Quality assurance, Up: Top 5 Algorithm *********** In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a concrete algorithm; it is more like "any algorithm using the LZMA coding scheme". LZMA compression consists in describing the uncompressed data as a succession of coding sequences from the set shown in Section 'What is coded' (*note what-is-coded::), and then encoding them using a range encoder. For example, the option '-0' of lzip uses the scheme in almost the simplest way possible; issuing the longest match it can find, or a literal byte if it can't find a match. Inversely, a much more elaborated way of finding coding sequences of minimum size than the one currently used by lzip could be developed, and the resulting sequence could also be coded using the LZMA coding scheme. Lzip currently implements two variants of the LZMA algorithm: fast (used by option '-0') and normal (used by all other compression levels). The high compression of LZMA comes from combining two basic, well-proven compression ideas: sliding dictionaries (LZ77) and markov models (the thing used by every compression algorithm that uses a range encoder or similar order-0 entropy coder as its last stage) with segregation of contexts according to what the bits are used for. Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder, which reduces redundancy by translating chunks of data to their corresponding distance-length pairs. The second stage is a range encoder that uses a different probability model for each type of data: distances, lengths, literal bytes, etc. Here is how it works, step by step: 1) The member header is written to the output stream. 2) The first byte is coded literally, because there are no previous bytes to which the match finder can refer to. 3) The main encoder advances to the next byte in the input data and calls the match finder. 4) The match finder fills an array with the minimum distances before the current byte where a match of a given length can be found. 5) Go back to step 3 until a sequence (formed of pairs, repeated distances, and literal bytes) of minimum price has been formed. Where the price represents the number of output bits produced. 6) The range encoder encodes the sequence produced by the main encoder and sends the bytes produced to the output stream. 7) Go back to step 3 until the input data are finished or until the member or volume size limits are reached. 8) The range encoder is flushed. 9) The member trailer is written to the output stream. 10) If there are more data to compress, go back to step 1. During compression, lzip reads data in large blocks (one dictionary size at a time). Therefore it may block for up to tens of seconds any process feeding data to it through a pipe. This is normal. The blocking intervals get longer with higher compression levels because dictionary size increases (and compression speed decreases) with compression level. The ideas embodied in lzip are due to (at least) the following people: Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrei Markov (for the definition of Markov chains), G.N.N. Martin (for the definition of range encoding), Igor Pavlov (for putting all the above together in LZMA), and Julian Seward (for bzip2's CLI).