info lzip


File: lzip.info,  Node: Algorithm,  Next: File format,  Prev: Quality assurance,  Up: Top

5 Algorithm
***********

In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a
concrete algorithm; it is more like "any algorithm using the LZMA coding
scheme". LZMA compression consists in describing the uncompressed data as a
succession of coding sequences from the set shown in Section 'What is
coded' (*note what-is-coded::), and then encoding them using a range
encoder. For example, the option '-0' of lzip uses the scheme in almost the
simplest way possible; issuing the longest match it can find, or a literal
byte if it can't find a match. Inversely, a much more elaborated way of
finding coding sequences of minimum size than the one currently used by
lzip could be developed, and the resulting sequence could also be coded
using the LZMA coding scheme.

   Lzip currently implements two variants of the LZMA algorithm: fast (used
by option '-0') and normal (used by all other compression levels).

   The high compression of LZMA comes from combining two basic, well-proven
compression ideas: sliding dictionaries (LZ77) and markov models (the thing
used by every compression algorithm that uses a range encoder or similar
order-0 entropy coder as its last stage) with segregation of contexts
according to what the bits are used for.

   Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder,
which reduces redundancy by translating chunks of data to their
corresponding distance-length pairs. The second stage is a range encoder
that uses a different probability model for each type of data: distances,
lengths, literal bytes, etc.

   Here is how it works, step by step:

   1) The member header is written to the output stream.

   2) The first byte is coded literally, because there are no previous
bytes to which the match finder can refer to.

   3) The main encoder advances to the next byte in the input data and
calls the match finder.

   4) The match finder fills an array with the minimum distances before the
current byte where a match of a given length can be found.

   5) Go back to step 3 until a sequence (formed of pairs, repeated
distances, and literal bytes) of minimum price has been formed. Where the
price represents the number of output bits produced.

   6) The range encoder encodes the sequence produced by the main encoder
and sends the bytes produced to the output stream.

   7) Go back to step 3 until the input data are finished or until the
member or volume size limits are reached.

   8) The range encoder is flushed.

   9) The member trailer is written to the output stream.

   10) If there are more data to compress, go back to step 1.


   During compression, lzip reads data in large blocks (one dictionary size
at a time). Therefore it may block for up to tens of seconds any process
feeding data to it through a pipe. This is normal. The blocking intervals
get longer with higher compression levels because dictionary size increases
(and compression speed decreases) with compression level.

The ideas embodied in lzip are due to (at least) the following people:
Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrei Markov (for the
definition of Markov chains), G.N.N. Martin (for the definition of range
encoding), Igor Pavlov (for putting all the above together in LZMA), and
Julian Seward (for bzip2's CLI).