Regular Time-Series Data

Regular time-series data has constant time intervals between measurements. We will use an example dataset of a year’s worth of stock information of a subset of 25 companies in the S&P 500:

import pandas as pd
df = pd.read_csv("sp_500_subset.csv")
df
            date   open    high    low  close   volume Name
0     2013-02-08  44.55  44.630  44.25  44.57  2312792  AEP
1     2013-02-11  44.57  44.740  44.48  44.73  1372326  AEP
2     2013-02-12  44.73  44.940  44.60  44.90  1695150  AEP
3     2013-02-13  44.87  45.040  44.76  44.93  1324472  AEP
4     2013-02-14  44.76  44.810  44.41  44.77  3665271  AEP
...          ...    ...     ...    ...    ...      ...  ...
6295  2014-02-03  95.10  95.920  93.52  93.62  5448605  UPS
6296  2014-02-04  94.14  94.230  93.19  93.89  3537284  UPS
6297  2014-02-05  93.85  94.325  93.50  93.76  4595639  UPS
6298  2014-02-06  94.15  94.933  94.01  94.74  4195577  UPS
6299  2014-02-07  95.30  95.600  94.57  95.37  2898813  UPS

[6300 rows x 7 columns]

Note that this dataset is regular and not event-based as there is a constant interval between measurements of one business day.

Currently the TimeSeriesSynthesizer requires that measurements are taken over identical time-periods for every unique entity, i.e. measurements should start and end on the same day, and the interval should be the same for each set of time-series data across the whole dataset.

In order to train the model with the desired number of timesteps we need to specify max_time_steps in DeepStateConfig, used to configure the underlying model. The max_time_steps argument controls the maximum number of time steps to process from each unique entity, By default this is set to 100.

from synthesized.config import DeepStateConfig
config = DeepStateConfig(max_time_steps=50)

The DeepStateConfig instance can then be passed to the TimeSeriesSynthesizer upon model creation, along with the original dataset and the time-series specification

from synthesized import TimeSeriesSynthesizer
synth = TimeSeriesSynthesizer(
    df,
    id_idx="Name",
    time_idx="date",
    event_cols=["open", "high", "low", "close", "volume"],
    const_cols=[],
    exog_cols=[],
    config=config
)
synth.learn()

In this case, there are no columns containing constant values (per unique ID) or exogenous variables.

It is not required to call MetaExtractor.extract(df) prior to creating a TimeSeriesSynthesizer object, unlike the HighDimSynthesizer. The preprocessing carried out on instantiation also means that the original DataFrame is not required to be passed in as an argument to the learn() method.

Having trained the model we are now in a position to synthesize a new sequence of time-series data, however there are additional positional arguments that can be specified when using the TimeSeriesSynthesizer, compared to the HighDimSynthesizer, that can generate synthetic time-series data for specific unique entities, or over specific time-periods.

  • n: The total number of time-steps to synthesize. This includes any time-steps used by df_time_series to prime the generator.

  • df_exogenous: Exogenous variables linked to the time-series. Must have the same number of rows as the value n. Currently this argument must be provided for regular time-series synthesis. Note that the TimeSeriesSynthesizer also treats the column specified by time_idx as an exogenous variable during synthesis.

  • id: This optional argument can be used to specify the unique ID of the sequence. If provided, it must correspond to an ID in the raw dataset used during training. If this argument is not specified then a random ID is sampled and time-series data is generated.

  • df_const: Constant values linked to the given id. This should be a DataFrame containing a single row of values, where the columns correspond to those given in const_cols on instantiating the TimeSeriesSynthesizer. If id is provided and constant_cols was defined in the initialization then this argument should also be specified.

  • df_time_series: Time series measurements linked to the given id. Providing df_time_series can influence the underlying state space model to generate a particular initial state that is then used during time-series generation. The provided DataFrame must contain the same columns specified in event_cols. Any number of rows ≤ n can be provided. The example below uses the first 50 rows of the original dataset to prime the model; The second 50 rows are new data points.

df_synth = synth.synthesize(
    n=100,
    id="GWW",
    df_exogenous=df.loc[df["Name"] == "GWW", "date"].iloc[:100].to_frame().reset_index(drop=True),
    df_time_series=df.loc[df["Name"] == "GWW", ["open", "high", "low", "close", "volume"]].iloc[:50]
)
Synthetic and original regular time-series data