Greenmask 0.2.0 - 0.2.5 Releases

Posted on 2024-12-12 by Greenmask.io
Related Open Source

PostgreSQL database anonymization and synthetic data generation tool

These releases mark major milestones, significantly expanding Greenmask's functionality and transforming it into a simple, extensible, and reliable solution for database security, data anonymization, and everyday operations. Our goal is to build a core system that serves as the foundation for comprehensive dynamic staging environments and robust data security.

These updates introduce new features such as database subsetting, pgzip support, restoration in topological order, and refactored transformers, greatly enhancing Greenmask's flexibility to meet diverse business needs. They also include numerous fixes and improvements.

Greenmask Overview

Greenmask is a powerful open-source utility that is designed for logical database backup dumping, anonymization, synthetic data generation and restoration. It is stateless and does not require any changes to your database schema. It is designed to be highly customizable and backward-compatible with existing PostgreSQL utilities, fast and reliable.

Is perfect for:

  • Backup and Restoration: Streamline daily tasks like logical backups, table restoration after truncation, or replacing pg_dump and pg_restore with ease.
  • Anonymization and Data Masking: Simplify staging environment setup and analytical tasks by anonymizing and transforming backups, ensuring consistent, secure data for faster

Greenmask on GitHub

Notable changes

  • PostgreSQL 17 support - revised ported library to support PostgreSQL 17

  • Database Subset - a new feature that allows you to define a subset of the database, allowing you to scale down the dump size (#110). This is robust for multipurpose and especially useful for testing and development environments. It supports:

    • References with NULL values - generate the LEFT JOIN query for the FK reference with NULL values to include them in the subset.
    • Supports virtual references (virtual foreign keys) - create a logical FK in Greenmask that will be used for subset dependencies graph. The virtual reference can be defined for a column or an expression, allowing you to get the value from JSON and similar.
    • Supports circular references - Greenmask will automatically resolve circular dependencies in the subset by generating a recursive query. The query is generated with integrity checks of the subset ensuring that the data gathered from circular dependencies is consistent.
    • Fully covered with documentation including troubleshooting and examples.
    • Supports FK and PK that have more than one column (or expression).
    • Multi-cycles resolution in one strong connected component (SCC) is supported - Greenmask will generate a recursive query for the SCC whether it is a single cycle or multiple cycles, making the subset system universal for any database schema.
    • Supports polymorphic relationships - You can define a virtual reference for a table with polymorphic references using polymorphic_exprs attribute and use greenmask to generate a subset for such tables.
  • Transformation conditions - execute a defined transformation only if a specified condition is met. #133

  • Transformation inheritance - transformation inheritance for partitioned tables and tables with foreign keys. Define once and apply to all. [#229]
  • pgzip support for faster compression and decompression — setting --pgzip can speed up the dump and restoration processes through parallel compression. In some tests, it shows up to 5x faster dump and restore operations.
  • Restoration in topological order - This flag ensures that dependent tables are not restored until the tables they depend on have been restored. This is useful when you want to be notified of errors as immediately as possible without waiting for the entire table to be restored.
  • Insert format restoration - For a flexible restoration process, Greenmask now supports data restoration in the INSERT format. It generates the insert statements based on COPY records from the dump. You do not need to re-dump your data to use this feature; it can be defined in the restore command. The list of new features related to the INSERT format:

    • Generate INSERT statements with the ON CONFLICT DO NOTHING clause if the flag --on-conflict-do-nothing is set.
    • Error exclusion list in the config to skip certain errors and continue inserting subsequent rows from the dump.
    • Use cases - incremental dump and restoration for logical data. For example, if you have a database, and you want to insert data periodically from another source, this can be used together with the database subset and transformations to catch up the target database.
  • Restore data batching (#173) - By default, the COPY protocol returns the error only on transaction commit. To override this behavior, use the --batch-size flag to specify the number of rows to insert in a single batch during the COPY command. This is useful when you want to control the transaction size and commit.

  • Introduced keep_null parameter for RandomPerson transformer.

  • Introduced dynamic parameters in the transformers

    • Most transformers now support dynamic parameters where applicable.
    • Dynamic parameters are strictly enforced. If you need to cast values to another type, Greenmask provides templates and predefined cast functions accessible via cast_to. These functions cover frequent operations such as UnixTimestampToDate and IntToBool.
  • The transformation logic has been significantly refactored, making transformers more customizable and flexible than before.
  • Introduced transformation engines

    • random - generates transformer values based on pseudo-random algorithms.
    • hash - generates transformer values using hash functions. Currently, it utilizes sha3 hash functions, which are secure but perform slowly. In the stable release, there will be an option to choose between sha3 and SipHash.
  • Introduced static parameters value template

  • Dumps retention management - Introduced retention parameters (#201) for the delete command. Introduced two new statuses: failed and in progress. A dump is considered failed if it lacks a "done" heartbeat or if the last heartbeat timestamp exceeds 30 minutes. The delete command now supports the following retention parameters:

    • --dry-run: Runs the deletion operation in test mode with verbose output, without actually deleting anything.
    • --before-date 2024-08-27T23:50:54+00:00: Deletes dumps older than the specified date. The date must be provided in RFC3339Nano format, for example: 2021-01-01T00:00:00Z.
    • --retain-recent 10: Retains the N most recent dumps, where N is specified by the user.
    • --retain-for 1w2d3h4m5s6ms7us8ns: Retains dumps for the specified duration. The format supports weeks (w), days (d), hours (h), minutes (m), seconds (s), milliseconds (ms), microseconds (us), and nanoseconds (ns).
    • --prune-failed: Prunes (removes) all dumps that have failed.
    • --prune-unsafe: Prunes dumps with "unknown-or-failed" statuses. This option only works in conjunction with --prune-failed.

Releases list:

Links

Feel free to reach out to us if you have any questions or need assistance: