Changelog#

Version 1.10 #

14 April 2022

Version 1.10 of the python SDK. (Wheel archive ).


🧿 feature Simple time-series synthesis#

We’ve been working hard to add more advanced time-series capabilities to the SDK. This release contains the initial framework for synthesizing and assessing time-series data.

Setting DataFrame indices#

synthesized.MetaExtractor.extract() now has two optional arguments to specify which columns are the ID & time indices.

In [1]: import pandas as pd

In [2]: from synthesized import MetaExtractor

In [3]: df = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/datasets/master/time-series/sandp500_5yr.csv")

In [4]: df_meta = MetaExtractor.extract(df, id_index="Name", time_index="date")

In [5]: df_meta.set_indices(df)
Out[5]: 
                  open   high    low  close    volume
Name date                                            
AAL  2013-02-08  15.07  15.12  14.63  14.75   8407500
     2013-02-11  14.89  15.01  14.26  14.46   8882000
     2013-02-12  14.45  14.51  14.10  14.27   8126000
     2013-02-13  14.30  14.94  14.25  14.66  10259500
     2013-02-14  14.94  14.96  13.16  13.99  31879900
...                ...    ...    ...    ...       ...
ZTS  2018-02-01  76.84  78.27  76.69  77.82   2982259
     2018-02-02  77.53  78.12  76.73  76.78   2595187
     2018-02-05  76.64  76.92  73.18  73.83   2962031
     2018-02-06  72.74  74.56  72.13  73.27   4924323
     2018-02-07  72.70  75.00  72.69  73.86   4534912

[618101 rows x 5 columns]

The index of the DataFrame is a pd.MultiIndex and allows the DataFrame to be neatly reformatted into a panel which cross sections can be taken from:

In [6]: df.xs("AAL")
Out[6]: 
             open   high    low  close    volume
date                                            
2013-02-08  15.07  15.12  14.63  14.75   8407500
2013-02-11  14.89  15.01  14.26  14.46   8882000
2013-02-12  14.45  14.51  14.10  14.27   8126000
2013-02-13  14.30  14.94  14.25  14.66  10259500
2013-02-14  14.94  14.96  13.16  13.99  31879900
...           ...    ...    ...    ...       ...
2018-02-01  54.00  54.64  53.59  53.88   3623078
2018-02-02  53.49  53.99  52.03  52.10   5109361
2018-02-05  51.99  52.39  49.75  49.76   6878284
2018-02-06  49.32  51.50  48.79  51.18   6782480
2018-02-07  50.91  51.98  50.89  51.40   4845831

[1259 rows x 5 columns]

Time-series plots#

In order to plot and compare different time-series values for different entities, we can plot time series with four different options of ShareSetting.

  1. Entities share the same plot ShareSetting.PLOT

  2. Entities have different plots but share the same x- and y-axis. ShareSetting.AXIS

  3. Entities have different plots but share the same x-axis. ShareSetting.X_AXIS

  4. No sharing. Each plot is independent. ShareSetting.NONE

For example:

# Full script
import pandas as pd
from synthesized import MetaExtractor
from synthesized.testing.plotting.series import plot_multi_index_dataframes, ShareSetting

# Account IDs to plot
categories_to_plot = [2378,  576,  704, 3818, 1972]

# Columns to plot
continuous_ids = ["balance", "index"]
categorical_ids = ["bank", "k_symbol"]
ids = continuous_ids + categorical_ids

# Load data
df_categorical = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/datasets/master/time-series/transactions_sample_10k.csv")
# Reduce data down to smaller volume for processing
df_categorical = df_categorical[df_categorical.type != "VYBER"]
df_categorical = df_categorical[ids + ["account_id", "date"]]

# Extract metadata
df_categorical_meta = MetaExtractor.extract(df_categorical, id_index="account_id", time_index="date")

# Plot dataframe
plot_multi_index_dataframes(df_categorical, df_categorical_meta, columns_to_plot=ids, categories_to_group_plots=categories_to_plot, share_setting=ShareSetting.AXES)
Plot timeseries data with categories.

Synthesizing time-series with Regression Models#

You can now create synthetic data using the Regression model.

🧿 feature Synthesize from the command line#

Calling synthesize after installing the SDK package with pip will allow users to create synthetic copies of csv data files from the command line.

Usage:

$ synthesize -h
usage: synthesize [-h] [-n N] [-s steps] [-o out_file] file

Create a synthetic copy of a given csv file.

positional arguments:
file                  The path to the original csv file.

optional arguments:
-h, --help            show this help message and exit
-n N                  The number of rows to synthesize. (default: The same number as the
                        original data)
-s steps              The number of training steps. (default: Use learning manager instead)
-o out_file, --output out_file
                        The destination path for the synthesized data. (default: outputs to
                        stdout)

🧿 bug AttributeInferenceAttackML causes OOM issues with large categorical columns#

The AttributeInferenceAttackML has been optimized to avoid allocating excessively large amounts of memory when handling categorical columns. This resolves an issue where relatively small datasets would cause out of memory (OOM) issues.

🧿 bug Assessor doesn’t work with null columns#

Previously, the Assessor would fail when attempting to plot a dataset containing a completely empty column (NaNs only). This has been resolved.

The Assessor now returns an empty plot containing the text “NaN” for these columns.

🧿 bug Support FormatPreservingTransformer with MaskingTransformerFactory#

Previously, there was no way to create FormatPreservingTransformer using MaskingTransformerFactory. Attempting to do so would raise an error:

ValueError: Given masking technique 'format_preserving|[abc]{3}' for column '{column}' not supported

You can now correctly create the Transformer with the MaskingDataFactory. For example:

mtf = MaskingTransformerFactory()
df_transformer = mtf.create_transformers({"col1": r"format_preserving|\d{3}"})
fp_transformer = dfm_trans._transformers[0]
assert isinstance(fp_transformer, FormatPreservingTransformer)  # True

Version 1.9 #

6 Feb 2022

Version 1.9 of the python SDK. (Wheel archive ).


🧿 feature Command to validate installation#

After running pip install, you can now use the terminal command synth-validate to confirm the SDK is working.

This command will log licence info to the terminal and attempt to synthesize a small dataset. It should take under 1 minute to complete.

🧿 enhancement Support python 3.9#

Synthesized now supports python 3.6, 3.7, 3.8, and 3.9 on Windows, MacOS and Linux. Wheels are built and tested for all 12 versions.