Changelog
Version 2.2
24 November 2022
Version 2.2 of the python SDK. (Wheel archive ).
enhancement Dense Layers with Batch Normalisation don’t need Bias
Dense Layers can be described by
where are the weights and biases of the layer and is the activation function.
When batch normalisation is used, it’s applied before the activation function and normalises by the mean and standard deviation of the batch. Batch normalisation also scales the output with two learned parameters and , i.e.
The scaled is then passed to the activation function .
The expectation value of over a given batch, , is given by
Substituting this and the first equation in for the expression for gives
where the bias
meaning the bias
This means that the dense layers with batch normalisation have the unnecessary overhead of learning a bias which will take more time to train and result in a larger overall model. This redundancy was addressed in this enhancement.
enhancement Use value_counts
instead of Moving Average in CategoricalValue
Instead of calculating the moving average of categorical counts during training (which has fluctuations) we can get the categorical value counts once before training begins and set those values as constants during training.
This has three benefits:
-
It is faster to train. as we don’t calculate moving average.
-
It is more accurate as the counts from the entire dataframe aren’t just an estimate of the frequencies.
-
It allows us to JIT compile the model in tensorflow. The moving average layer was the only TF layer that could not be JIT compiled.
enhancement TensorFlow matrix multiplication speed-up
The performance of learning and synthesizing has been improved by utilizing tensorflows’s
compilation optimizations for matrix multiplication. This optimization requires configuration
changes and improves the HighDimSynthesizer
, TimeSeriesSynthesizer
and EventSynthesizer
.
feature TimeSeriesSynthesizer
for regular time series and EventSynthesizer
for event-based synthesis
beta
In addition to tabular data Synthesized now supports two more forms of data:
-
Time series: Synthesize regularly spaced time-series data.
-
Event data: Create synthetic event-based data.
import pandas as pd
from synthesized import TimeSeriesSynthesizer
df = pd.read_csv(...)
synth = TimeSeriesSynthesizer(
df,
id_idx="id",
time_idx="timestamp",
event_cols=["event"],
)
synth.fit(dataset, epochs=15, steps_per_epoch=5000)
synth.synthesize(200)
feature Add .from_df()
constructor to HighDimSynthesizer
As a shortcut to quickly create a HighDimSynthesizer
from a
pandas.DataFrame
, the .from_df()
constructor has been added.
with |
with |
|
|
feature Optionally use StandardScalar
instead of QuantileTransformer
Previously, the QuantileTransformer
was always used when training any model. However,
this is an especially non-linear process and can negatively impact a model’s
ability to impute nan values. Now, it is possible to configure the
ContinuousTransformer
to optionally use a StandardScalar
instead of the
QuantileTransformer
.
synth = HighDimSynthesizer(df_meta, config=HighDimConfig(quantile=False))
feature Optionally show the training metrics with the progress callbacks
It is now possible to set 3 different levels of verbosity (0, 1, 2) for the training progress of HighDimSynthesizer
synth.learn(df, verbose=0)
bug Histogram probabilities do not sum to 1
When synthesizing some forms of categorical data, an error was thrown due to the
Histogram
module not pulling through the correct probabilities
for categories to appear. This has now been fixed.
Version 2.1
5 August 2022
Version 2.1 of the python SDK. (Wheel archive ).
feature PyPI integration
Synthesized is now available for install via PyPI! See Installation.
feature 30 Day Trial Licence
Synthesized now supports a free 30 day trial licence which can be requested on import of synthesized
or by running
the synth-validate
cli command. See Setting the licence key.
Version 2.0
15 July 2022
Version 2.0 of the python SDK. (Wheel archive ).
enhancement Internal Framework Rebuild
With v2.0 the underlying framework of the SDK has been rebuilt, making it easier to extend in preparation for a wealth of new features planned for upcoming versions. The internal restructure paves the way for more native integration with a host of datasources, as well as providing some slight performance improvements with the majority of supported datatypes.
feature YAML configuration for command line synthesis
Previously in v1.10 a command line synthesis feature was added.
Moving towards greater integrations with CICD and process flows, YAML files can also be used to specify synthesis
feature options. This means all the Synthesized manipulations can be specified in an easy-to-write YAML file and
passed to the synthesize
command above, allowing developers, devops engineers, data engineers, and the like to
write synthetic data specifications in clear YAML and run it without having to touch a line of python.
Specify a config file using the -c
or --config
flags followed by the name of the config file. i.e.:
$ synthesize -h
usage: synthesize [-h] [-c config.yaml] [-n N] [-s steps] [-o out_file] file
Create a synthetic copy of a given csv file.
positional arguments:
file The path to the original csv file.
optional arguments:
-h, --help show this help message and exit
-c config.yaml, --config config.yaml
Path to an optional yaml config file.
-n N The number of rows to synthesize. (default: The same
number as the original data)
-s steps The number of training steps. (default: Use learning
manager instead)
-o out_file, --output out_file
The destination path for the synthesized data.
(default: outputs to stdout)
The YAML file structure should look something like:
---
annotations:
customer:
type: person
labels:
fullname: name
gender: sex
email: mail
username: username
type_overrides:
card_expire_date:
type: date_time
date_format: '%m/%y'
serial:
type: formatted_string
pattern: '\d{3}-\d{2}-\d{4}'
...
breaking change Annotations
The config required for the Annotation files has been simplified. Where previously the input arguments ended in
_label
, now the _label
ending has been removed so just the keywords are required. Below is an example with the
Person annotations, but the change has been made for all annotations.
v1.11 |
v2.0 |
|
|
breaking change Produce NaNs
The default value for produce_nans
has been changed from False
to True
.
Previously, the default behaviour of the SDK was to impute NaNs in the output data. After some consideration, it was
decided that the default behaviour should be to most accurately represent the raw input data, NaNs included, and that
imputation of NaNs is a special feature of the SDK that can be turned on at will.
To ensure NaNs are imputed in the output data in v2.0, produce_nans
must now be manually set to True
during
synthesis.
v1.11 |
v2.0 |
|
|
Version 1.11
24 April 2022
Version 1.11 of the python SDK. (Wheel archive ).
Version 1.10
14 April 2022
Version 1.10 of the python SDK. (Wheel archive ).
feature Simple time-series synthesis
We’ve been working hard to add more advanced time-series capabilities to the SDK. This release contains the initial framework for synthesizing and assessing time-series data.
Setting DataFrame indices
MetaExtractor.extract
now has two optional arguments to specify which columns
are the ID & time indices.
import pandas as pd
from synthesized import MetaExtractor
df = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/datasets/master/time-series/sandp500_5yr.csv")
df_meta = MetaExtractor.extract(df, id_index="Name", time_index="date")
df_meta.set_indices(df)
The index of the DataFrame is a pd.MultiIndex
and allows the DataFrame to
be neatly reformatted into a panel which cross sections can be taken from:
df.xs("AAL")
Time-series plots
In order to plot and compare different time-series values for different entities, we can plot time
series with four different options of ShareSetting
.
-
Entities share the same plot
ShareSetting.PLOT
-
Entities have different plots but share the same x- and y-axis.
ShareSetting.AXIS
-
Entities have different plots but share the same x-axis.
ShareSetting.X_AXIS
-
No sharing. Each plot is independent.
ShareSetting.NONE
For example:
# Full script
import pandas as pd
from synthesized import MetaExtractor
from synthesized.testing.plotting.series import plot_multi_index_dataframes, ShareSetting
# Account IDs to plot
categories_to_plot = [2378, 576, 704, 3818, 1972]
# Columns to plot
continuous_ids = ["balance", "index"]
categorical_ids = ["bank", "k_symbol"]
ids = continuous_ids + categorical_ids
# Load data
df_categorical = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/datasets/master/time-series/transactions_sample_10k.csv")
# Reduce data down to smaller volume for processing
df_categorical = df_categorical[df_categorical.type != "VYBER"]
df_categorical = df_categorical[ids + ["account_id", "date"]]
# Extract metadata
df_categorical_meta = MetaExtractor.extract(df_categorical, id_index="account_id", time_index="date")
# Plot dataframe
plot_multi_index_dataframes(df_categorical, df_categorical_meta, columns_to_plot=ids, categories_to_group_plots=categories_to_plot, share_setting=ShareSetting.AXES)

feature Synthesize from the command line
Calling synthesize after installing the SDK package with pip will allow users to create synthetic copies of csv data files from the command line.
Usage:
$ synthesize -h
usage: synthesize [-h] [-n N] [-s steps] [-o out_file] file
Create a synthetic copy of a given csv file.
positional arguments:
file The path to the original csv file.
optional arguments:
-h, --help show this help message and exit
-n N The number of rows to synthesize. (default: The same number as the
original data)
-s steps The number of training steps. (default: Use learning manager instead)
-o out_file, --output out_file
The destination path for the synthesized data. (default: outputs to
stdout)
bug AttributeInferenceAttackML causes OOM issues with large categorical columns
The :class:`AttributeInferenceAttackML` has been optimized to avoid allocating excessively large amounts of memory when handling categorical columns. This resolves an issue where relatively small datasets would cause out of memory (OOM) issues.
bug Assessor doesn’t work with null columns
Previously, the :class:`Assessor` would fail when attempting to plot a dataset containing a completely empty column (NaNs only). This has been resolved.
The Assessor now returns an empty plot containing the text "NaN" for these columns.
bug Support FormatPreservingTransformer with MaskingTransformerFactory
Previously, there was no way to create
synthesized.privacy.FormatPreservingTransformer
using
synthesized.privacy.MaskingTransformerFactory
. Attempting to do so
would raise an error:
ValueError: Given masking technique 'format_preserving|[abc]{3}' for column '{column}' not supported
You can now correctly create the Transformer with the MaskingDataFactory. For example:
mtf = MaskingTransformerFactory()
df_transformer = mtf.create_transformers({"col1": r"format_preserving|\d{3}"})
fp_transformer = dfm_trans._transformers[0]
assert isinstance(fp_transformer, FormatPreservingTransformer) # True
Version 1.9
6 Feb 2022
Version 1.9 of the python SDK. (Wheel archive ).