Regular Time-Series Data
Regular time-series data has constant time intervals between measurements. We will use an example dataset of a year’s worth of stock information of a subset of 25 companies in the S&P 500:
import pandas as pd
df = pd.read_csv("sp_500_subset.csv")
df
date open high low close volume Name
0 2013-02-08 44.55 44.630 44.25 44.57 2312792 AEP
1 2013-02-11 44.57 44.740 44.48 44.73 1372326 AEP
2 2013-02-12 44.73 44.940 44.60 44.90 1695150 AEP
3 2013-02-13 44.87 45.040 44.76 44.93 1324472 AEP
4 2013-02-14 44.76 44.810 44.41 44.77 3665271 AEP
... ... ... ... ... ... ... ...
6295 2014-02-03 95.10 95.920 93.52 93.62 5448605 UPS
6296 2014-02-04 94.14 94.230 93.19 93.89 3537284 UPS
6297 2014-02-05 93.85 94.325 93.50 93.76 4595639 UPS
6298 2014-02-06 94.15 94.933 94.01 94.74 4195577 UPS
6299 2014-02-07 95.30 95.600 94.57 95.37 2898813 UPS
[6300 rows x 7 columns]
Note that this dataset is regular and not event-based as there is a constant interval between measurements of one business day.
Currently the TimeSeriesSynthesizer requires that measurements are taken over identical time-periods for every
unique entity, i.e. measurements should start and end on the same day, and the interval should be the same for each
set of time-series data across the whole dataset.
|
In order to train the model with the desired number of timesteps we need to specify max_time_steps
in
DeepStateConfig
, used to configure the underlying model. The max_time_steps
argument controls
the maximum number of time steps to process from each unique entity, By default this is set to 100.
from synthesized.config import DeepStateConfig
config = DeepStateConfig(max_time_steps=50)
The DeepStateConfig
instance can then be passed to the TimeSeriesSynthesizer
upon model creation, along with the
original dataset and the time-series specification
from synthesized import TimeSeriesSynthesizer
synth = TimeSeriesSynthesizer(
df,
id_idx="Name",
time_idx="date",
event_cols=["open", "high", "low", "close", "volume"],
const_cols=[],
exog_cols=[],
config=config
)
synth.learn()
In this case, there are no columns containing constant values (per unique ID) or exogenous variables.
It is not required to call MetaExtractor.extract(df) prior to creating a TimeSeriesSynthesizer object,
unlike the HighDimSynthesizer . The preprocessing carried out on instantiation also means that the original DataFrame is
not required to be passed in as an argument to the learn() method.
|
Having trained the model we are now in a position to synthesize a new sequence of time-series data, however there are
additional positional arguments that can be specified when using the TimeSeriesSynthesizer
, compared to the
HighDimSynthesizer
, that can generate synthetic time-series data for specific unique entities, or over specific
time-periods.
-
n
: The total number of time-steps to synthesize. This includes any time-steps used by df_time_series to prime the generator. -
df_exogenous
: Exogenous variables linked to the time-series. Must have the same number of rows as the valuen
. Currently this argument must be provided for regular time-series synthesis. Note that theTimeSeriesSynthesizer
also treats the column specified bytime_idx
as an exogenous variable during synthesis. -
id
: This optional argument can be used to specify the unique ID of the sequence. If provided, it must correspond to an ID in the raw dataset used during training. If this argument is not specified then a random ID is sampled and time-series data is generated. -
df_const
: Constant values linked to the givenid
. This should be a DataFrame containing a single row of values, where the columns correspond to those given inconst_cols
on instantiating theTimeSeriesSynthesizer
. Ifid
is provided andconstant_cols
was defined in the initialization then this argument should also be specified. -
df_time_series
: Time series measurements linked to the givenid
. Providingdf_time_series
can influence the underlying state space model to generate a particular initial state that is then used during time-series generation. The provided DataFrame must contain the same columns specified inevent_cols
. Any number of rows ≤n
can be provided. The example below uses the first 50 rows of the original dataset to prime the model; The second 50 rows are new data points.
df_synth = synth.synthesize(
n=100,
id="GWW",
df_exogenous=df.loc[df["Name"] == "GWW", "date"].iloc[:100].to_frame().reset_index(drop=True),
df_time_series=df.loc[df["Name"] == "GWW", ["open", "high", "low", "close", "volume"]].iloc[:50]
)
