Hyperparameter Tuning

The full source code for this example is available for download here.

Prerequisites

This tutorial assumes that you have already installed the Synthesized package, including the tune module. The tune module can be installed through:

pip install synthesized[tune]

If you are new to Synthesized, it is recommended you start with the quickstart guide and/or single table synthesis tutorial before jumping into this tutorial.

Introduction

Hyperparameter tuning is the process of finding the best set of hyperparameters for a machine learning model. For a synthetic data model this means finding the set of parameters that produce the highest quality synthetic data. The SynthOptimizer class handles this process for you. It has very straightforward default usage, as well as the flexibility to be targeted towards a more specific use case.

This tutorial demonstrates the ability to perform hyperparameter optimisation for the HighDimSynthesizer model, although the process is very similar for the TimeSeriesSynthesizer model. A custom loss function and a custom build and train function are used to illustrate how the tuning process can be run for a specific use case. A public credit scoring dataset from Kaggle, also available from the synthesized.util module as an example dataset to train the synthetic data model.

import synthesized

data = synthesized.util.get_example_data()

Possible Hyperparameters

When using the HighDimSynthesizer or TimeSeriesSynthesizer model, the following hyperparameters are available:

HighDimSynthesizer
TimeSeriesSynthesizer

"latent_size" (int): The size of the latent space within which the data representation is learned. Larger, more complex datasets with many columns may require a larger latent space.
"capacity" (int): The capacity of the model. This is a proxy for the complexity of the relationships between data features that the model can learn. More complex and interdependent datasets may require increased capacity.
"learning_rate" (float): The learning rate of the deep model fitting.
"batch_size" (int): The batch size used for training.
"weight_decay" (float): The weight decay used for training.
"num_layers" (int): The number of layers in the networks of the model.
"continuous_weight" (float): The relative weight of the continuous loss.
"categorical_weight" (float): The relative weight of the categorical loss.

"latent_size" (int) The size of the latent space within which the data representation is learned. Larger, more complex datasets with many columns may require a larger latent space.
"capacity" (int): The capacity of the model. This is a proxy for the complexity of the relationships between data features that the model can learn. More complex and interdependent datasets may require increased capacity.
"num_time_layers" (int): The number of deep layers dedicated to modelling the time dimension.
"num_id_layers" (int): Th number of deep layers dedicated to the identification of each individual timeseries in the dataset.
"id_capacity" (int): The capacity allocated to modelling the identification of each individual timeseries in the dataset.
"learning_rate" (float): The learning rate of the deep model fitting.
"batch_size" (int): The batch size used for training.
"max_time_steps" (int): The maximum number of time steps that the model should learn.
"dropout_rate" (float): The dropout rate used for training.

Defining the hyperparameter search space

The first step is to define the search space over each of the hyperparameters chosen for investigate. The total search space is defined as a list of parameter dictionaries. The search space for a single hyperparameter is defined by a dictionary with the following key-value pairs:

Required elements:

"name" name of parameter, (string)
"type" type of search strategy: "range", "fixed", or "choice", (string)

and one of the following:

"bounds" for range parameters (list of two values, lower bound first)
"values" for choice parameters (list of values)
"value" for fixed parameters (single value).

Optional elements:

"log_scale" for float-valued range parameters, (bool),
"value_type" to specify type that values of this parameter should take; expects "float", "int","bool" or "str".
"is_fidelity" (bool) and "target_value" (float) for fidelity parameters.
"is_ordered" for choice parameters (bool).
"is_task" for task parameters (bool).
"digits" (int) for float-valued range parameters.

In this example, a small search space with a two hyperparameters, capacity and latent_size, is built:

parameters = [
    {"name": "capacity", "type": "range", "bounds": [32, 64], "value_type": "int"},
    {"name": "latent_size", "type": "range", "bounds": [32, 64], "value_type": "int"},
]

Running the hyperparameter search

The hyperparameter search using the SynthOptimizer class can now be run. This class takes the following arguments: Required arguments:

orig_df (pd.DataFrame): The original dataset, to train the synthetic model on.
parameters (list): The list of hyperparameter dictionaries.

Optional arguments:

build_and_train_function (function): The function that constructs and trains the synthetic model. If not provided then a default model is built and trained.
synthetic_model (string): The type of model that will be built by default, if build_and_train_function is not provided. This can take the values "HighDimSynthesizer" or "TimeSeriesSynthesizer". By default this is "HighDimSynthesizer".
loss_name (string): The name of the loss metric to use for the hyperparameter search.
custom_loss_function (function): The function that computes the loss of the synthetic model.
max_parallelism (int): The maximum number of parallel processes (models trained in parallel) to use for the hyperparameter search.
num_cpus (int): The number of CPUs to use for the hyperparameter search (per parallel process).

See below for more details on how to specify a custom build_and_train_function. Where build_and_train_function is not specified the default process of model creation and metadata extraction is carried out, as described in the single table synthesis tutorial.

If N cpus are available, max_parallelism * num_cpus should be no greater than N. If running a hyperparameter search with a large number of trials (>2N) it may be more performant to set num_cpus to 1 and max_parallelism to N.

To initialize the hyperparameter search provide the original dataset and the hyperparameter search space:

from synthesized.tune import SynthOptimizer

optimizer = SynthOptimizer(
    orig_df=data,
    parameters=parameters,
    max_parallelism=4,
    num_cpus=2
)

To run the optimizer, call the optimize method. This method only takes one argument - the number of trials. Ideally the number of trials should be greater than 10 but will be dependent on the size of the search space.

optimizer.optimize(20)

A logging window will be displayed showing the progress of the hyperparameter search:

Hyperparameter search results

Once the hyperparameter search has completed, the SynthesizerOptimizer will contain the results of the search. SynthesizerOptimizer provides useful utility methods for accessing and visualising the results. The method optimizer.plot_results() is called to visualize the results. This will produce 4 useful plots that help in understanding the results of the hyperparameter search.

It is only possible to generate or visualize results if the SynthOptimizer has run more than 10 trials. The steps below will otherwise fail.

optimizer.plot_results()

The first is the feature importance:

This shows the relative importance of each hyperparameter in the search, based on the parameter’s impact on the loss.

Distribution of loss as a function of capacity. In this case

Distribution of loss as a function of latent size. In this case

The second plot shows the loss as a function of each hyperparameter in the search range. This plot can be useful to understand where the optimum lies with respect to each hyperparameter, for example if the optimum lies at the edge of the search range for a given hyperparameter, it may be worth extending the search range for that hyperparameter. The blue boundaries around the distribution show an uncertainty estimate on the loss:

The third plot shows 2D distributions for combinations of parameters in terms of their impact on the loss and the uncertainty about the loss. This can be useful to understand relationships between hyperparameters.

The final plot shows the loss as a function of the number of trials. This can be useful to give an impression on how big an impact the hyperparameter tuning is having and whether it is necessary running more trials.

Best hyperparameters

Finally, the best hyperparameters are retrieved from the optimizer using the get_best_params() method. The output of this method is a dictionary containing a few bits of key information:

params_raw - The best hyperparameters that were observed during the hyperparameter search.
params_raw_mean_value - The mean value of the loss metric for the best hyperparameters.
params_estimate - The estimated optimal parameters based on the hyperparameter search results. These may not have been observed during the hyperparameter search, but are implied from the results.
params_estimate_mean_value - The mean value of the loss metric for the estimated optimal hyperparameters.
params_estimate_variance - The variance of the loss metric for the estimated optimal hyperparameters.

optimizer.get_best_params()

The results of all the trials can be displayed as a dataframe using the get_trial_results_as_df() method:

optimizer.get_trial_results_as_df()

Advanced usage

Custom loss function

In some cases, there may be a specific and measurable evaluation metric for the synthetic data being generated. In these cases it is possible to define a custom loss function to use for the hyperparameter search that will specifically evaluate the synthetic data quality for the use case.

In this example, the synthetic data will be used to train a downstream classifier model. A custom loss function can then be defined to train the classifier on the synthetic data and evaluate its performance on real data (a process known as train-synthetic-test-real).

A custom loss function can be passed as an argument to SynthOptimizer to use it in the hyperparameter optimisation. This function must take two inputs; the trained synthetic model and the original dataset. It must return a dictionary mapping the metric name (string) to its value (float). Note that the SynthOptimizer class will always attempt to minimise the loss function, so if you wish to maximise a metric the function should return the negative of that metric.

An example implementation for a custom loss function that trains a classifier on the synthetic data and evaluates its performance is shown below:

import pandas as pd

from synthesized import HighDimSynthesizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

def custom_loss_function(synth: HighDimSynthesizer, orig_df: pd.DataFrame):
    # drop nans
    orig_df_no_nan = orig_df.dropna()

    # Generate synthetic data
    # Don't produce nans for logistic regression
    synthetic_data = synth.sample(len(orig_df), produce_nans=False)

    # Train a classifier on the synthetic data
    model = LogisticRegression()
    X_train = synthetic_data.drop(columns=["SeriousDlqin2yrs"])
    y_train = synthetic_data["SeriousDlqin2yrs"]

    # Evaluate the model on the original data
    X_test = orig_df_no_nan.drop(columns=["SeriousDlqin2yrs"])
    y_test = orig_df_no_nan["SeriousDlqin2yrs"]

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    return {
        "accuracy_score": - accuracy_score(y_test, y_pred),
        "precision_score": - precision_score(y_test, y_pred),
        "recall_score": - recall_score(y_test, y_pred),
    }

This custom loss function can then be passed to the SynthOptimizer class and the optimization run as before. The function returns a dictionary of metrics about the model classifier performance. All these metrics are tracked in the optimisation but only the accuracy is optimized for. The metric to be optimized for is passed in as the value of loss_name to the SynthOptimizer:

optimizer = SynthOptimizer(
    orig_df=data,
    parameters=parameters,
    custom_loss_function=custom_loss_function,
    loss_name="accuracy_score",
)

optimizer.optimize(20)

Custom build and train function

Analogous to defining a custom loss function, it is also possible to customize the manner in which the model is initialized and trained. The SynthOptimizer class can be used to pass a custom build and train function to the optimizer. This function must take the following arguments:

parameters (dict): A dictionary of hyperparameters to use for the synthetic model.
orig_df (pd.DataFrame): The original dataset, to train the synthetic model on.

and must return a trained synthetic model. As an example, a function called build_and_train_function can be used to create and train an instance of the TimeSeriesSynthesizer where max_time_steps is fixed and the number of epochs the model is trained over and the batch size per training step is optimized for:

from synthesized import TimeSeriesSynthesizer
from synthesized.config import DeepStateConfig

parameters = [
    {"name": "epochs", "type": "range", "bounds": [5, 15], "value_type": "int"},
    {"name": "batch_size", "type": "range", "bounds": [2, 10], "value_type": "int"}
]

df_time_series = df = pd.read_csv("sp_500_subset.csv")

def build_and_train_function(parameters, orig_df):

    config = DeepStateConfig(max_time_steps=50, batch_size=parameters["batch_size"])
    synth = TimeSeriesSynthesizer(
                df,
                id_idx="Name",
                time_idx="date",
                event_cols=["open", "close", "high", "low", "volume"],
                config=config
            )

    synth.learn(epochs=parameters["epochs"])

    return synth

(The dataset "sp_500_subset.csv" is available for download in the time-series tutorial).

The build and train function is then passed to the SynthOptimizer class as an argument:

optimizer = SynthOptimizer(
    orig_df=df_time_series,
    parameters=parameters,
    build_and_train_function=build_and_train_function,
)

optimizer.optimize(20)

For more information on the use of the TimeSeriesSynthesizer see our time-series synthesis tutorial.