This tutorial assumes that you have already installed the Synthesized package. While it is not required to have read the single table synthesis tutorial, it is highly recommended that you do so, as it outlines best practices when generating and evaluating synthetic data. It is assumed that you are also familiar with the format of a dataset with temporal correlations as laid out in the documentation.
In this tutorial, the
EventSynthesizer is introduced as a means to generate a dataset where there is a temporal component.
In particular, the
EventSynthesizer is used where there are irregular time intervals between consecutive events or measurements.
In the time series tutorial the
TimeSeriesSynthesizer was introduced as a means of generating
data with a regular interval between events. This model excels at learning the correlations between temporal features and measurements
(e.g. between the day of the week and the number of workers at an office). The
EventSynthesizer is a complementary model that
learns relationships between events in processes, focussing in particular on the correlations between a sequence of events and the
interval between the events (e.g. that if event A occurs then event B is likely to occur 3 days later).
In this tutorial, the
EventSynthesizer will be trained with a patient pathway dataset, detailing a sequence of events for around 20
patients as well as their age and gender. This dataset has been modified from a publicly available one on Kaggle.
import pandas as pd df = pd.read_csv("restricted_patient_pathways.csv") df
UID Age Gender Date Event_Code 0 Id_e45a846f 26 M 2011-02-01 V035 1 Id_e45a846f 26 M 2011-02-01 8830 2 Id_e45a846f 26 M 2011-02-01 0549 3 Id_e45a846f 26 M 2011-02-01 0539 4 Id_e45a846f 26 M 2011-02-01 5276 ... ... ... ... ... ... 359 Id_e45d67f9 63 M 2013-06-01 4610 360 Id_e45d67f9 63 M 2013-06-01 9921 361 Id_e45d67f9 63 M 2013-06-01 V035 362 Id_e45d67f9 63 M 2013-06-01 9921 363 Id_e45d67f9 63 M 2013-11-01 5276 [364 rows × 5 columns]
As with every use case, it is best practice to perform an exploratory data analysis (EDA) prior to using any of the features of the SDK. In a dataset such as this, EDA may involve some form of process mining to understand details of each patient’s event log. For many tasks it is unnecessary to obtain a full process map and it suffices to identify aggregated features such as the most frequent start and end events, possible pairs of consecutive events and average path lengths. In this case, all event logs begin with the same event:
UID Age Gender Date Event_Code 0 Id_e45a846f 26 M 2011-02-01 V035 21 Id_e45af97e 60 M 2011-02-01 V035 36 Id_e45c0a2a 51 M 2011-04-01 V035 54 Id_e45c3054 43 M 2011-03-01 V035 72 Id_e45c56f9 29 M 2011-04-01 V035 92 Id_e45c5768 70 F 2011-04-01 V035 110 Id_e45ca513 43 F 2011-04-01 V035 127 Id_e45ca516 50 F 2011-07-01 V035 146 Id_e45ca518 56 F 2011-12-01 V035 169 Id_e45ca521 61 M 2011-10-01 V035 186 Id_e45ca555 22 F 2013-04-01 V035 204 Id_e45ccbe5 39 F 2011-05-01 V035 219 Id_e45ccc02 54 M 2011-09-01 V035 236 Id_e45cf32c 78 M 2011-12-01 V035 256 Id_e45d19b0 51 F 2011-07-01 V035 270 Id_e45d19fb 29 M 2011-12-01 V035 289 Id_e45d4113 54 M 2012-12-01 V035 303 Id_e45d4114 54 F 2011-11-01 V035 317 Id_e45d4115 74 F 2012-01-01 V035 337 Id_e45d67c4 29 M 2011-02-01 V035 348 Id_e45d67f9 63 M 2012-05-01 V035
This tutorial will be concerned with generating a dataset that contains new entities and produces realistic event data, respecting the logic that precludes impossible pairs of events occurring consecutively. In addition, the PII such as patient ID’s and age will be masked to make the generated dataset compliant with any governance requirements.
Where masking is required, it is recommended that the privacy masks are applied to the original dataset prior to any training.
In this case, the combination of
"Sex" could be used to identify a patient (as code the value of
we will not mask these values before synthesis as they are required by the
EventSynthesizer when specifying entities. For more
information on this see the documentation on time series column specification.
At any rate, the
"UID" column will not be produced when new entities are synthesized).
To generalize the
"Age" column, the
RoundingMask is used along with the number of bins to bucket the data into:
from synthesized.privacy import RoundingMask rounding_mask = RoundingMask("Age", bins=5) df_masked = rounding_mask.fit_transform(df.copy()) df_masked
UID Age Gender Date Event_Code 0 Id_e45a846f (21.999, 29.0] M 2011-02-01 V035 1 Id_e45a846f (21.999, 29.0] M 2011-02-01 8830 2 Id_e45a846f (21.999, 29.0] M 2011-02-01 0549 3 Id_e45a846f (21.999, 29.0] M 2011-02-01 0539 4 Id_e45a846f (21.999, 29.0] M 2011-02-01 5276 ... ... ... ... ... ... 359 Id_e45d67f9 (54.0, 63.0] M 2013-06-01 4610 360 Id_e45d67f9 (54.0, 63.0] M 2013-06-01 9921 361 Id_e45d67f9 (54.0, 63.0] M 2013-06-01 V035 362 Id_e45d67f9 (54.0, 63.0] M 2013-06-01 9921 363 Id_e45d67f9 (54.0, 63.0] M 2013-11-01 5276 [364 rows × 5 columns]
For a complete list of the masking capabilities available as part of the SDK see our documentation.
In order to ensure that the business logic within the dataset is respected during training and generation, the
Association class can be leveraged to preclude impossible sets of events happening
Association instances are applied between columns with a well defined relationship, however in this case
the aim is to associate pairs of events within a row. Therefore, some preprocessing of the dataset is required to ensure that
Association class can be used here. In particular, in order to associate an event with the one immediately following it,
an additional column needs to be created, shifted such that an event and the one immediately following appear in the same row:
df_masked["Event_Code_n+1"] = df_masked.groupby(["UID"])["Event_Code"].shift(-1, fill_value="end") df_masked = df_masked.reset_index(drop=True) df_masked
UID Age Gender Date Event_Code Event_Code_n+1 0 Id_e45a846f (21.999, 29.0] M 2011-02-01 V035 8830 1 Id_e45a846f (21.999, 29.0] M 2011-02-01 8830 0549 2 Id_e45a846f (21.999, 29.0] M 2011-02-01 0549 0539 3 Id_e45a846f (21.999, 29.0] M 2011-02-01 0539 5276 4 Id_e45a846f (21.999, 29.0] M 2011-02-01 5276 0539 ... ... ... ... ... ... ... 359 Id_e45d67f9 (54.0, 63.0] M 2013-06-01 4610 9921 360 Id_e45d67f9 (54.0, 63.0] M 2013-06-01 9921 V035 361 Id_e45d67f9 (54.0, 63.0] M 2013-06-01 V035 9921 362 Id_e45d67f9 (54.0, 63.0] M 2013-06-01 9921 5276 363 Id_e45d67f9 (54.0, 63.0] M 2013-11-01 5276 end [364 rows × 5 columns]
Note the ordering of operations - a
groupby() followed by a
shift() ensures that event logs for different entities are
Association instance is created between
Event_Code_n+1, ready to be used on instantiation of an
from synthesized.metadata.rules import Association association = Association(associations=["Event_Code", "Event_Code_n+1"])
Similar to the
TimeSeriesSynthesizer, the workflow for producing time-series data is slightly different from that when
producing tabular data because of a number of preprocessing steps that occur under-the-hood. For example, rather than
first performing a meta extraction and creating the model with the resulting meta object, the meta extraction is handled
EventSynthesizer itself. The reason for this is that an event-based dataset is formed of data from multiple
entities, as outlined in our documentation concerning Time-Series Synthesis, and needs to be preprocessed
before meta extraction occurs. This preprocessing ensures, amongst other things, that the interval between consecutive
events is calculated and each unique entity has the same number of time steps in the dataset.
Only after this preprocessing can the meta data be extracted.
In order to train the model with the desired number of time steps we need to specify
DeepStateConfig, used to configure the underlying model. The
max_time_steps argument controls
the maximum number of time steps to process from each unique entity. By default this is set to 100. In this case, it
is set to the maximum number of time steps in the event log for a single entity:
from synthesized.config import DeepStateConfig max_time_steps = max(df["UID"].value_counts()) config = DeepStateConfig(parsing_nan_fraction_threshold=0.0, max_time_steps=max_time_steps, batch_size=5)
parsing_nan_fraction_threshold is used in order to correctly interpret the
columns as strings. When performing dtype inference, the SDK will attempt to cast certain column. If more than a certain
proportion of the function can be cast to a specific dtype, then that column is interpreted as that specific dtype. Most of
the values in
Event_Code are integers, however there are a few genuine strings. Since it is desired that the values of this
column are understood to be strings, and the column is to be modelled categorically, the proportion of values that cannot be parsed
as integers is set to 0.
Association instances, along with a set of column specifications,
are then used to create an instance of the
from synthesized import EventSynthesizer id_idx = "UID" time_idx = "Date" const_cols = ["Age", "Gender"] event_cols = ["Event_Code", "Event_Code_n+1"] df_masked[time_idx] = pd.to_datetime(df_masked[time_idx]) synth = EventSynthesizer( df_masked, id_idx, time_idx, event_cols, const_cols, config=config, associations=[association] ) synth.learn()
A set of new patients can then be created using the
df_synth =  for i in range(df[id_idx].nunique()): df_temp = synth.synthesize(n=max(df[id_idx].value_counts())) df_temp.loc[:, id_idx] = len(df_temp) * [i] first_nan_idx = df_temp["Event_Code"].isna().idxmax() df_synth.append(df_temp.loc[:first_nan_idx-1]) df_synth = pd.concat(df_synth).reset_index(drop=True) df_synth
Age Gender Date Event_Code Event_Code_n+1 UID 0 (29.0, 50.0] F 2011-07-01 V035 0549 0 1 (29.0, 50.0] F 2011-07-01 0549 4610 0 2 (29.0, 50.0] F 2011-07-01 4610 0549 0 3 (29.0, 50.0] F 2011-07-01 0549 4610 0 4 (29.0, 50.0] F 2011-09-16 4610 0549 0 ... ... ... ... ... ... ... 358 (50.0, 54.0] F 2012-06-30 0539 0549 20 359 (50.0, 54.0] F 2012-06-30 0549 0539 20 360 (50.0, 54.0] F 2012-06-30 0539 4097 20 361 (50.0, 54.0] F 2012-06-30 4097 V035 20 362 (50.0, 54.0] F 2012-06-30 V035 end 20 [363 rows × 6 columns]
While the event logs in the training data were of different lengths, it is required that a number of rows,
supplied in the call to
synthesize(). However, by truncating the event log for each entity in the postprocessing at
the first missing value, event logs with different time steps can be generated. As mentioned above, the
preprocesses the training input such that the event logs are of the same length. It does so by padding shorter event logs with
missing values, and truncating longer event logs. As such, it is acceptable to interpret
NaN values as an end token and
use them to truncate the generated data.
As discussed in the introduction, the evaluation of event-based data is highly domain specific and dependant on the downstream
task at hand. Numerous open-source packages are available to aid in the task of process mining (for example,
pm4py). For most
data access use cases, for example where data is needed for developing feature engineering pipelines, it suffices to analyse the
aggregate metrics. Here, the cross tables between
"Event_Code_n+1" will be analysed to confirm that
the generated dataset adheres to the same business logic as the original, in the sense that events that do not occur consecutively
in the original do not occur consecutively in the generated data:
from synthesized.testing.plotting.distributions import plot_cross_tables df_masked["Event_Code"] = df_masked["Event_Code"].astype(str) df_masked["Event_Code_n+1"] = df_masked["Event_Code_n+1"].astype(str) df_synth["Event_Code"] = df_synth["Event_Code"].astype(str) df_synth["Event_Code_n+1"] = df_synth["Event_Code_n+1"].astype(str) plot_cross_tables(df_masked, df_synth, "Event_Code", "Event_Code_n+1")
A cross table is a way of graphically showing how many times the values in two columns occur in the same row. In this case,
since the two columns being compared in the cross table are
Event_Code_n+1, the cross table is showing
how many times one event occurs after another.
Where there are zeroes in the cross table of original data there are zeros in the generated, confirming that the
has precluded impossible journeys from occurring. Note that there are a few additional zeroes in the generated table where there
are values in the original. This is to be expected due to the (deliberately) statistically noisy procedure of generating synthetic
data. In all cases where this occurs it is due to a very small numbers of occurrences of these event pairs in the training