synthesized.complex.TwoTableSynthesizer

class TwoTableSynthesizer(df_metas, keys, relation=None)

Synthesizer that can learn to generate data from two tables.

The tables can have a one-to-one or one-to-many relationship and are linked with a single primary and foreign key. This ensures that the generated data:

  1. Keeps the referential integrity between the two tables, i.e the primary and foreign keys can be joined.

  2. Accurately generates the distribution of the foreign key counts in the case of a one-to-many relationship.

It is assumed that each row in the first table is a unique entity, such as a customer, with no duplicates. The second table must relate to the first through a foreign key that matches the primary key, e.g the history of purchases for each customer.

Parameters
  • df_meta ([Tuple[DataFrameMeta, DataFrameMeta]]) – A tuple of the extracted DataFrameMeta for the two tables. Both tables must have a unique primary key, and the second table is assumed to have a many-to-one relationship with the first table with a corresponding unique foreign key.

  • keys (Tuple[str, str]) – Tuple[str]: The column names that identify the primary keys of each table. The primary key of the first table must exist in the second table as as foreign key.

  • relation (Dict[str, str], Optional) – A dictionary that maps the primary key column name of the first table to the foreign key column name in the second table if they are not identical. Defaults to None.

Example

Load two tables that have a primary and foreign key relation.

>>> df_customer = pd.read_csv('customer_table.csv')
>>> df_transactions = pd.read_csv('transaction_table.csv')

Extract the DataFrameMeta for each table:

>>> df_metas = (MetaExtractor.extract(df_cust), MetaExtractor.extract(df_tran))

Initialise the TwoTableSynthesizer. The column names of the primary keys of each table are specified using keys parameter. The foreign key in df_transaction is ‘customer_id’, and this has the same column name as the primary key in df_customer”:

>>> synthesizer = TwoTableSynthesizer(df_metas=df_metas, keys=('customer_id', 'transaction_id'))

Train the Synthesizer:

>>> synthesizer.learn(df_train=dfs)

Generate 1000 rows of new data:

>>> df_customer_synthetic, df_transaction_synthetic = synthesizer.synthesize(num_rows=1000)

Methods

__init__(df_metas, keys[, relation])

Initialize self.

learn(df_train[, num_iterations, callback, …])

Train the TwoTableSynthesizer.

synthesize(num_rows[, produce_nans, …])

Generate the given number of new data rows for table 1, and the associated rows of table 2

learn(df_train, num_iterations=None, callback=None, callback_freq=0)

Train the TwoTableSynthesizer.

Parameters
  • df_train (Tuple[pd.DataFrame]) – The training data for each table.

  • callback (Optional[Callable[[Synthesizer, int, dict], bool]]) – A callback function, e.g. for logging purposes. Takes the synthesizer instance, the iteration number, and a dictionary of values (usually the losses) as arguments. Aborts training if the return value is True.

  • callback_freq (int) – Callback frequency.

  • num_iterations (Optional[int]) –

Return type

None

synthesize(num_rows, produce_nans=False, progress_callback=None)

Generate the given number of new data rows for table 1, and the associated rows of table 2

Parameters
  • num_rows (int) – The number of rows to generate.

  • produce_nans (bool) – Whether to produce NaNs.

  • progress_callback (Optional[Callable[[int], None]]) – Progress bar callback.

Returns

The generated data for table 1. df_2 (pd.DataFrame): The generated data for table 2.

Return type

df_1 (pd.DataFrame)