manpagez: man pages & more
info gawk
Home | html | info | man

File: gawk.info,  Node: Persistent Memory,  Next: Extension Philosophy,  Prev: Profiling,  Up: Advanced Features

12.7 Preserving Data Between Runs
=================================

Starting with version 5.2, 'gawk' supports "persistent memory".  This
experimental feature stores the values of all of 'gawk''s variables,
arrays and user-defined functions in a persistent heap, which resides in
a file in the filesystem.  When persistent memory is not in use (the
normal case), 'gawk''s data resides in ephemeral system memory.

   Persistent memory is enabled on certain 64-bit systems supporting the
'mmap()' and 'munmap()' system calls.  'gawk' must be compiled as a
non-PIE (Position Independent Executable) binary, since the persistent
store ends up holding pointers to functions held within the 'gawk'
executable.  This also means that to use the persistent memory, you must
use the same 'gawk' executable from run to run.

   You can see if your version of 'gawk' supports persistent memory like
so:

     $ gawk --version
     -| GNU Awk 5.2.0, API 3.2, PMA Avon 7, (GNU MPFR 4.0.1, GNU MP 6.1.2)
     -| Copyright (C) 1989, 1991-2022 Free Software Foundation.
     ...

If you see the 'PMA' with a version indicator, then it's supported.

   As of this writing, persistent memory has only been tested on
GNU/Linux, Cygwin, Solaris 2.11, macOS systems,(1) FreeBSD 13.1 and
OpenBSD 7.1.  On all others, persistent memory is disabled by default.
You can force it to be enabled by exporting the shell variable
'REALLY_USE_PERSIST_MALLOC' with a nonempty value before running
'configure' (*note Quick Installation::).  If you do so and all the
tests pass, please let the maintainer know.

   To use persistent memory, follow these steps:

  1. Create a new, empty sparse file of the desired size.  For example,
     four gigabytes.  On a GNU/Linux system, you can use the 'truncate'
     utility:

          $ truncate -s 4G data.pma

  2. Provide the path to the data file in the 'GAWK_PERSIST_FILE'
     environment variable.  This is best done by placing the value in
     the environment just for the run of 'gawk', like so:

          $ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }'
          1

  3. Use the same data file in subsequent runs to use the preserved data
     values:

          $ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }'
          2
          $ GAWK_PERSIST_FILE=data.pma gawk 'BEGIN { print ++i }'
          3

     As shown, in subsequent runs using the same data file, the values
     of 'gawk''s variables are preserved.  However, 'gawk''s special
     variables, such as 'NR', are reset upon each run.  Only the
     variables defined by the program are preserved across runs.

   Interestingly, the program that you execute need not be the same from
run to run; the persistent store only maintains the values of variables,
arrays, and user-defined functions, not the totality of 'gawk''s
internal state.  This lets you share data between unrelated programs,
eliminating the need for scripts to communicate via text files.

   Terence Kelly, the author of the persistent memory allocator 'gawk'
uses, provides the following advice about the backing file:

     Regarding backing file size, I recommend making it far larger than
     all of the data that will ever reside in it, assuming that the file
     system supports sparse files.  The "pay only for what you use"
     aspect of sparse files ensures that the actual storage resource
     footprint of the backing file will meet the application's needs but
     will be as small as possible.  If the file system does _not_
     support sparse files, there's a dilemma: Making the backing file
     too large is wasteful, but making it too small risks memory
     exhaustion, i.e., 'pma_malloc()' returns 'NULL'.  But persistent
     'gawk' should still work even without sparse files.

   You can disable the use of the persistent memory allocator in 'gawk'
with the '--disable-pma' option to the 'configure' command at the time
that you build 'gawk' (*note Unix Installation::).

   You can set the 'PMA_VERBOSITY' environment variable to a value
between zero and three to control how much debugging and error
information the persistent memory allocator will print.  'gawk' sets the
default to one.  See the 'support/pma.c' source code to understand what
the different verbosity levels are.

     NOTE: If you use MPFR mode (the '-M' option) on the first run of a
     program using persistent memory, you _must_ continue to use it on
     all subsequent runs.  Similarly, if you don't use '-M' on the first
     run, do not use it on any subsequent runs.

     Mixing and matching MPFR mode and regular mode with the same
     backing file will lead to strange results and/or core dumps.
     'gawk' does not currently detect such a situation and may not do so
     in the future either.

     Additionally, the GNU/Linux CIFS filesystem is known to not work
     well with the PMA allocator.  Don't use a backing file on a CIFS
     filesystem.

   Terence Kelly has provided a separate 'Persistent-Memory 'gawk' User
Manual' document, which is included in the 'gawk' distribution.  It is
worth reading.  *Note General Introduction: (pm-gawk)Top.

   Here are additional articles and web links that provide more
information about persistent memory and why it's useful in a scripting
language like 'gawk'.


     This is the canonical source for Terence Kelly's Persistent Memory
     Allocator (PMA). The latest source code and user manual will always
     be available at this location.  Kelly may be reached directly at
     any of the following email addresses: ,
     , or .

'Persistent Memory Allocation'
     Terence Kelly, Zi Fan Tan, Jianan Li, and Haris Volos, ACM 'Queue'
     magazine, Vol.  20 No.  2 (March/April 2022), PDF
     (https://dl.acm.org/doi/pdf/10.1145/3534855), HTML
     (https://queue.acm.org/detail.cfm?id=3534855).  This paper explains
     the design of the PMA allocator used in persistent 'gawk'.

'Persistent Scripting'
     Zi Fan Tan, Jianan Li, Haris Volos, and Terence Kelly, Non-Volatile
     Memory Workshop (NVMW) 2022, .  This
     paper motivates and describes a research prototype of persistent
     'gawk' and presents performance evaluations on Intel Optane
     non-volatile memory; note that the interface differs slightly.

'Persistent Memory Programming on Conventional Hardware'
     Terence Kelly, ACM 'Queue' magazine Vol.  17 No.  4 (July/Aug
     2019), PDF (https://dl.acm.org/doi/pdf/10.1145/3358955.3358957),
     HTML (https://queue.acm.org/detail.cfm?id=3358957).  This paper
     describes simple techniques for persistent memory for C/C++ code on
     conventional computers that lack non-volatile memory hardware.

'Is Persistent Memory Persistent?'
     Terence Kelly, ACM 'Queue' magazine Vol.  18 No.  2 (March/April
     2020), PDF (https://dl.acm.org/doi/pdf/10.1145/3400899.3400902),
     HTML (https://queue.acm.org/detail.cfm?id=3400902).  This paper
     describes a simple and robust testbed for testing software against
     real power failures.

'Crashproofing the Original NoSQL Key/Value Store'
     Terence Kelly, ACM 'Queue' magazine Vol.  19 No.  4 (July/Aug
     2021), PDF (https://dl.acm.org/doi/pdf/10.1145/3487019.3487353),
     HTML (https://queue.acm.org/detail.cfm?id=3487353).  This paper
     describes a crash-tolerance feature added to GNU DBM' ('gdbm').

   When Terence Kelly published his papers, his collaborators produced a
prototype integration of PMA with 'gawk'.  That version used a
(mandatory!)  option '--persist=FILE' to specify the file for storing
the persistent heap.  If this option is given to 'gawk', it produces a
fatal error message instructing the user to use the 'GAWK_PERSIST_FILE'
environment variable instead.  Except for this paragraph, that option is
otherwise undocumented.

   The prototype only supported persistent data; it did not support
persistent functions.

   As noted earlier, support for persistent memory is _experimental_.
If it becomes burdensome,(2) then the feature will be removed.

   ---------- Footnotes ----------

   (1) For reasons explained in 'README_d/README.macosx', 'gawk' is
always built as an Intel architecture executable, even on M1 systems.

   (2) Meaning, there are too many bug reports, or too many strange
differences in behavior from when 'gawk' is run normally.

© manpagez.com 2000-2025
Individual documents may contain additional copyright information.