Changelog

Version 3.2

20 March 2024

enhancement Model training time improvements

Training time for models has been improved, with an ~2x speedup for training models across a variety of datasets.

enhancement Synthetic data quality improvements

Synthesized’s synthetic data quality has been improved when using the Pandas interface, with better support for complex data distributions resulting in more accurate synthetic data generation.

enhancement Use configuration objects in-place of a large number of arguments

Previously, a large number of arguments had to be provided when using the TableSynthesizer class methods from_data_interface and from_meta_collection. This has been considerably simplified through the use of a pydantic data transfer object. Now, users can provide a TrainConfig object in when creating a TableSynthesizer instance using these methods:

v3.1

v3.2

from synthesized3 import TableSynthesizer

synth = TableSynthesizer.from_data_interface(
    data_interface,
    meta_overrides=[
        {"name": "colA", "type": "ConstantMeta"},
        {"name": "colB", "type": "BooleanMeta"},
    ],
    min_num_unique=5,
)
from synthesized3 import TableSynthesizer
from synthesized3.schema import TrainConfig

synth = TableSynthesizer.from_data_interface(
    data_interface,
    config=TrainConfig(
        meta_overrides=[
                {"name": "colA", "type": "ConstantMeta"},
            {"name": "colB", "type": "BooleanMeta"}
        ],
        min_num_unique=5,
    )
)

security Bumping python version support

Python 3.8 is no longer supported, in preparation for its transition to end-of-life. Python 3.11 is now supported.

Version 3.1

26 January 2024

feature YAML config auto-generation

It is now possible to automatically generate YAML config files for datasets.

feature YAML schema and hinting

A YAML schema for YAML config files can now be set in IDEs to enable YAML config file type hinting for improved and easier writing of YAML config files. I.e. users can now hit the tab button when writing YAML config files and see the available configuration options for the SDK.

feature Spark DateType native support

Native support to train and synthesize Spark DateType columns was added (in addition to the TimestampType and TimestampNTZType data types already supported).

enhancement Faster Spark Meta Extraction

2x faster extraction of Spark dataset meta information was achieved by implementing various performance optimisations.

enhancement Automatic Sampling

Automatic detection of very high cardinality columns was added, with such columns now automatically modelled with the SamplingModel model, matching the behaviour of SDK 2.9 for minimal code-conversion impact.

enhancement Automatic Enumeration

Automatic detection of enumerated columns (i.e. columns with predictable increases in values, like ID columns) was added, with such columns now automatically modelled with the EnumerationModel model, matching the behaviour of SDK 2.9 for minimal code-conversion impact.

Version 3.0

01 December 2023

enhancement Native Spark support

Synthesized’s SDK now natively supports Spark, allowing you to easily generate synthetic data for your Spark dataframes. It also supports distributed training of models on Spark clusters allowing you to scale your synthetic data generation to large datasets.