Changelog

Version 2.0

15 July 2022

Version 2.0 of the python SDK. (Wheel archive ).

enhancement Internal Framework Rebuild

With v2.0 the underlying framework of the SDK has been rebuilt, making it easier to extend in preparation for a wealth of new features planned for upcoming versions. The internal restructure paves the way for more native integration with a host of datasources, as well as provides some slight performance improvements with the majority of supported datatypes.

enhancement Documentation

The documentation pages have been revamped and improved.

feature YAML configuration for command line synthesis

Previously in v1.10 a command line synthesis feature was added. Moving towards greater integrations with CICD and process flows, YAML files can also be used to specify synthesis feature options. This means all the Synthesized manipulations can be specified in an easy-to-write YAML file and passed to the synthesize command above, allowing developers, devops engineers, data engineers, and the like to write synthetic data specifications in clear YAML and run it without having to touch a line of python.

Specify a config file using the -c or --config flags followed by the name of the config file. i.e.:

$ synthesize -h
usage: synthesize [-h] [-c config.yaml] [-n N] [-s steps] [-o out_file] file

Create a synthetic copy of a given csv file.

positional arguments:
  file                  The path to the original csv file.

optional arguments:
  -h, --help            show this help message and exit
  -c config.yaml, --config config.yaml
                        Path to an optional yaml config file.
  -n N                  The number of rows to synthesize. (default: The same
                        number as the original data)
  -s steps              The number of training steps. (default: Use learning
                        manager instead)
  -o out_file, --output out_file
                        The destination path for the synthesized data.
                        (default: outputs to stdout)

The YAML file structure should look something like:

---
annotations:
  customer:
    type: person
    labels:
      fullname: name
      gender: sex
      email: mail
      username: username

type_overrides:
  card_expire_date:
    type: date_time
    date_format: '%m/%y'
  serial:
    type: formatted_string
    pattern: '\d{3}-\d{2}-\d{4}'
...

breaking change Annotations

The config required for the Annotation files has been simplified. Where previously the input arguments ended in _label, now the _label ending has been removed so just the keywords are required. Below is an example with the Person annotations, but the change has been made for all annotations.

v1.11

v2.0

person = Person(
    name='person',
    labels=PersonLabels(
        gender_label='gender',
        title_label='title',
        firstname_label='first_name',
        lastname_label='last_name',
        email_label='email'
    )
)
person = Person(
    name='person',
    labels=PersonLabels(
        gender='gender',
        title='title',
        firstname='first_name',
        lastname='last_name',
        email='email'
    )
)

breaking change Produce NaNs

The default value for produce_nans has been changed from False to True. Previously, the default behaviour of the SDK was to impute NaNs in the output data. After some consideration, it was decided that the default behaviour should be to most accurately represent the raw input data, NaNs included, and that imputation of NaNs is a special feature of the SDK that can be turned on at will.

To ensure NaNs are imputed in the output data in v2.0, produce_nans must now be manually set to True during synthesis.

v1.11

v2.0

...
synth.learn(df)

# Previusly, to produce NaNs - specify parameter
synth.synthesize(1000, produce_nans=True)

# Previously, NaNs imputed by default
synth.synthesize(1000)
...
...
synth.learn(df)

# Now, produce NaNs by default
synth.synthesize(1000)

# Now, to impute NaNs - specify parameter
synth.synthesize(1000, produce_nans=False)
...

bug String nulls not cast correctly

A bug causing nulls in String category columns not to be cast properly has been fixed.

bug NaN associations with non-NaN columns

If NaN associations were attempted on columns with no NaNs present, previously an error occurred. A fix has been added to inform the user there are no NaNs in the specified column and to continue the Association without the non-NaN column.

Version 1.11

24 April 2022

Version 1.11 of the python SDK. (Wheel archive ).

bug Timedelta datatype generation

A bug causing Timedelta and NaT data generation to raise an exception in some situations has been fixed.

bug Person annotation causing error

A bug causing the Person annotation to raise an exception has been fixed.

Version 1.10

14 April 2022

Version 1.10 of the python SDK. (Wheel archive ).

feature Simple time-series synthesis

We’ve been working hard to add more advanced time-series capabilities to the SDK. This release contains the initial framework for synthesizing and assessing time-series data.

Setting DataFrame indices

synthesized.MetaExtractor.extract now has two optional arguments to specify which columns are the ID & time indices.

import pandas as pd
from synthesized import MetaExtractor
df = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/datasets/master/time-series/sandp500_5yr.csv")
df_meta = MetaExtractor.extract(df, id_index="Name", time_index="date")
df_meta.set_indices(df)

The index of the DataFrame is a pd.MultiIndex and allows the DataFrame to be neatly reformatted into a panel which cross sections can be taken from:

df.xs("AAL")

Time-series plots

In order to plot and compare different time-series values for different entities, we can plot time series with four different options of ShareSetting.

  1. Entities share the same plot ShareSetting.PLOT

  2. Entities have different plots but share the same x- and y-axis. ShareSetting.AXIS

  3. Entities have different plots but share the same x-axis. ShareSetting.X_AXIS

  4. No sharing. Each plot is independent. ShareSetting.NONE

For example:

# Full script
import pandas as pd
from synthesized import MetaExtractor
from synthesized.testing.plotting.series import plot_multi_index_dataframes, ShareSetting

# Account IDs to plot
categories_to_plot = [2378,  576,  704, 3818, 1972]

# Columns to plot
continuous_ids = ["balance", "index"]
categorical_ids = ["bank", "k_symbol"]
ids = continuous_ids + categorical_ids

# Load data
df_categorical = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/datasets/master/time-series/transactions_sample_10k.csv")
# Reduce data down to smaller volume for processing
df_categorical = df_categorical[df_categorical.type != "VYBER"]
df_categorical = df_categorical[ids + ["account_id", "date"]]

# Extract metadata
df_categorical_meta = MetaExtractor.extract(df_categorical, id_index="account_id", time_index="date")

# Plot dataframe
plot_multi_index_dataframes(df_categorical, df_categorical_meta, columns_to_plot=ids, categories_to_group_plots=categories_to_plot, share_setting=ShareSetting.AXES)
Plot timeseries data with categories.

Synthesizing time-series with Regression Models

You can now create synthetic data using the Regression model.

feature Synthesize from the command line

Calling synthesize after installing the SDK package with pip will allow users to create synthetic copies of csv data files from the command line.

Usage:

$ synthesize -h
usage: synthesize [-h] [-n N] [-s steps] [-o out_file] file

Create a synthetic copy of a given csv file.

positional arguments:
file                  The path to the original csv file.

optional arguments:
-h, --help            show this help message and exit
-n N                  The number of rows to synthesize. (default: The same number as the
                        original data)
-s steps              The number of training steps. (default: Use learning manager instead)
-o out_file, --output out_file
                        The destination path for the synthesized data. (default: outputs to
                        stdout)

bug AttributeInferenceAttackML causes OOM issues with large categorical columns

The :class:`AttributeInferenceAttackML` has been optimized to avoid allocating excessively large amounts of memory when handling categorical columns. This resolves an issue where relatively small datasets would cause out of memory (OOM) issues.

bug Assessor doesn’t work with null columns

Previously, the :class:`Assessor` would fail when attempting to plot a dataset containing a completely empty column (NaNs only). This has been resolved.

The Assessor now returns an empty plot containing the text "NaN" for these columns.

bug Support FormatPreservingTransformer with MaskingTransformerFactory

Previously, there was no way to create synthesized.privacy.FormatPreservingTransformer using synthesized.privacy.MaskingTransformerFactory. Attempting to do so would raise an error:

ValueError: Given masking technique 'format_preserving|[abc]{3}' for column '{column}' not supported

You can now correctly create the Transformer with the MaskingDataFactory. For example:

mtf = MaskingTransformerFactory()
df_transformer = mtf.create_transformers({"col1": r"format_preserving|\d{3}"})
fp_transformer = dfm_trans._transformers[0]
assert isinstance(fp_transformer, FormatPreservingTransformer)  # True

Version 1.9

6 Feb 2022

Version 1.9 of the python SDK. (Wheel archive ).


feature Command to validate installation

After running pip install, you can now use the terminal command synth-validate to confirm the SDK is working.

This command will log licence info to the terminal and attempt to synthesize a small dataset. It should take under 1 minute to complete.

enhancement Support python 3.9

Synthesized now supports python 3.6, 3.7, 3.8, and 3.9 on Windows, MacOS and Linux. Wheels are built and tested for all 12 versions.