Event Based Data

Event based, sometimes also known as irregular, time-series data is when there is not a constant time-interval between events or measurements. Event based data could be, for example, bank transactions for many users over a given time period:

import pandas as pd
df = pd.read_csv("simple_fraud.csv")
df
         customer      age gender     merchant   amount  fraud                date
0     C1021896897  46 - 55   MALE    M17379832   559.85      1 2022-03-10 00:00:00
1     C1021896897  46 - 55   MALE   M480139044   138.93      0 2022-03-10 13:18:10
2     C1021896897  46 - 55   MALE   M855959430   187.65      1 2022-03-11 02:36:20
3     C1021896897  46 - 55   MALE   M547558035   119.71      0 2022-03-11 15:54:30
4     C1021896897  46 - 55   MALE   M480139044  1510.34      1 2022-03-12 05:12:41
...           ...      ...    ...          ...      ...    ...                 ...
7035   C989321907  46 - 55   MALE    M85975013    28.11      0 2022-04-12 19:28:27
7036   C989321907  46 - 55   MALE  M1823072687    37.82      0 2022-05-03 07:40:48
7037   C989321907  46 - 55   MALE  M1823072687     3.26      0 2022-05-04 23:35:18
7038   C989321907  46 - 55   MALE  M1823072687    25.60      0 2022-06-03 21:56:34
7039   C989321907  46 - 55   MALE    M85975013    75.15      0 2022-06-14 10:41:49

[7040 rows x 7 columns]

The above dataset shows transactions completed by unique customers, including information regarding the transaction amount, the merchant involved in the transaction and whether the transaction was fraudulent or not.

The below plots shows transaction events over time for a particular customer and is flagged as fraudulent or not.

Real Transaction Data

Similarly to the TimeSeriesSynthesizer we configure the model to train over a maximum number of time steps, corresponding to the maximum number of transactions for a given customer:

from synthesized import DeepStateConfig
config = DeepStateConfig()
value_counts = df["customer"].value_counts()
config.max_time_steps = max(value_counts)

We instantiate the EventSynthesizer using the DeepStateConfig and providing the specification for the columns

from synthesized import EventSynthesizer
synth = EventSynthesizer(
df,
id_idx="customer",
time_idx="date",
event_cols=["merchant", "amount", "fraud"],
const_cols=["gender", "age"]
config=config
)
synth.learn()

Synthesising data is then possible through nearly the same process as generating regular time-series data, with some small differences regarding the arguments that should be specified:

  • n: number of new time-steps to synthesize

  • df_exogenous: Optional exogenous variables linked to the time-series. Must have the same number of rows as n.

  • id: This optional argument can be used to specify the unique ID of the sequence. If provided, it must correspond to an ID in the raw dataset used during training. If this argument is not specified then a random ID is sampled and time-series data is generated.

  • df_const: Constant values linked to the given id. Note that the EventSynthesizer considers the initial timestamp for a given unique identity as a constant, referring to it as f"{time_idx}_0"`. The remaining elements of the DataFrame should be those provided in const_cols on instantiation. If id is provided then this argument should also be specified.

from random import randint
n = randint(min(value_counts), max(value_counts))
df_synth = synth.synthesize(n=n)
Synthetic Transaction Data

Since we haven’t specified the id, we have generated an entirely new customer with a particular transaction history. Note that the synthetic data contains fraudulent transactions, similarly to the original data.