# Single Table Synthesis - Spark notebook


In this tutorial the most important features of the Synthesized SDK are explained using examples of a complete
end-to-end generation of synthetic data from an original dataset. The examples demonstrate how an end-to-end synthesis
can be achieved using the default parameters, as well as going into more detail on how non-default settings can be
configured by the user at each stage of the process. The generative models leveraged by the SDK have been tuned on a
wide variety of datasets and data types, meaning that the "out-of-the-box" behaviour of the SDK is performant and produces
synthetic data statistically similar to the original. However, where custom scenarios and constraints are required
a user can define a custom configuration when using the SDK.


## Fitness Dataset

This example will use a [modified public dataset from Kaggle](https://www.kaggle.com/datasets/ddosad/datacamps-data-science-associate-certification)
detailing the attendance of a range of fitness classes available at a gym:


When using Spark, normally data would be pulled from a delta lake or similar data source with well-defined schemas that
Spark can derive a correct schema from.
To make this tutorial standalone, a CSV file of data is used, however since CSV files do not store schemas they need
to be manually specified in this tutorial.


In [None]:
import pyspark.sql as ps
import pyspark.sql.types as st

# Create Spark Session here - connect to a cluster. If using databricks this parameter exists automatically so skip
spark = ps.SparkSession.builder.master("local[1]").appName("synthesized-single-table-tutorial").getOrCreate()

schema = st.StructType([
    st.StructField(name="booking_id", dataType=st.IntegerType()),
    st.StructField(name="months_as_member", dataType=st.IntegerType()),
    st.StructField(name="weight", dataType=st.FloatType()),
    st.StructField(name="days_before", dataType=st.IntegerType()),
    st.StructField(name="day_of_week", dataType=st.IntegerType()),
    st.StructField(name="time", dataType=st.StringType()),
    st.StructField(name="category", dataType=st.StringType()),
    st.StructField(name="attended", dataType=st.IntegerType()),
    st.StructField(name="membership", dataType=st.StringType()),
])

df = spark.read.csv("fitness_data.csv", header=True, schema=schema)

df.show(10)
df.printSchema()


This dataset has a number of interesting features:

- The column `booking_id` is an enumerated column, i.e. it increases in constant sized steps from a starting value
- The column `day_of_week` communicates what day of the week the class was held, but as an integer
- There is a `membership` column, communicating what type of plan the member is on

- "Anytime" members can attend classes on any day of the week
- "Weekend" members can only attend classes on a Saturday or Sunday


This tutorial will demonstrate how the SDK can be used create synthetic data with the
same statistical fidelity as the original dataset, and new values of `booking_id` in order to supplement the original data.
This is a common task within an organisation, especially where the volume of the original dataset is small.
Before any synthetic data is generated, it is best practice to perform some exploratory data analysis (EDA) to gain a deeper understanding of the dataset.
For example, EDA could be done to determine whether there are missing values, understand the distribution of outliers or calculate/plot
any univariate and multivariate statistics that may aid in the development of any hypothesis testing. This is by no means a comprehensive
list of techniques and tools encompassed within EDA and the actual EDA process is highly individual and dependent on how
the synthetic data will be leveraged downstream.


## Simple Synthesis

The fastest way to get up and running with the SDK is using the Python API functions `train` and `generate`.
First, create and train a Synthesizer with the `train` function:


In [None]:
from synthesized3.api import train

synth = train(df)


Second, generate Synthetic data using the `generate` function:


In [None]:
from synthesized3.api import generate

df_synth = generate(synth, 1000, spark=spark)

df_synth.show(10)


## Model and Meta inspection

Internally the `train` function above performs a few steps, including extraction of metadata about each column in the
dataset and deciding how each column is to be modelled. It is possible to overwrite the default values by supplying
the `train` function with a `SynthesizerConfig` object. Before providing an override though (next section),
the default meta information can be inspected as follows:


#### Meta inspection


In [None]:
metas = synth.meta_collection.metas
print("Metas:")
for meta in metas:
    print(meta)


In the next section it is demonstrated how to get the Synthesizer to treat one or more of these columns differently
(e.g. to treat `day_of_week` as a category, or `attended` as a boolean).


#### Model inspection

The default models to be used can be inspected similarly:

In [None]:
models = synth.model_collection.models
print("Models:")
for model in models:
    print(model)


All columns are being modelled by the DeepTableModel.
In the next section it is demonstrated how to get the Synthesizer to model one or more columns differently
(e.g. to make the synthetic `booking_id` column be systematically generated in a sequential order, an
`EnumerationModel` can be used).


## Model and Meta overrides

The default Models and Metas of the Synthesizer can be overwritten by supplying the `train` function with a
`SynthesizerConfig` object with custom Model and Meta specifications. The whole `SynthesizerConfig` configuration can
be specified in a YAML file and loaded in, or specified via the Python API. Both setups are demonstrated below.

Meta overrides and Model overrides (for one or multiple columns) can be specified in one `SynthesizerConfig`
configuration, or just one can be specified if only one is required.
In the example below, both are specified in one config.

In this example a Meta override is used to specify that the `attended` column should be treated as a boolean,
and a Model override is used to specify that the `booking_id` column should be modelled using an `EnumerationModel`
(which systematically generates values in a sequential order with a fixed step size).


#### YAML

First, place the following YAML code in a file named `config.yaml`:
```yaml
# config.yaml file
train:
  meta_overrides:
    attended: BooleanMeta
  model_overrides:
    - columns:
        - booking_id
      type: EnumerationModel
      kwargs:
        start: 1000
```

In [None]:
# Using config.yaml file as specified above:
import yaml
from synthesized3.api import load_and_validate_config

with open("config.yaml") as f:
    config = yaml.safe_load(f)

synth_config = load_and_validate_config(config)

synth = train(df, synth_config)


#### Python API

The same can be achieved by interacting directly with the Python API:

In [None]:
from synthesized3.schema import SynthesizerConfig

synth_config = SynthesizerConfig(
    train={
        "meta_overrides": {
            "attended": "BooleanMeta",
        },
        "model_overrides": [
            {"columns": ["booking_id"],
             "type": "EnumerationModel",
             "kwargs":{"start": 1000}},
        ],
    }
)

synth = train(df, synth_config)


The Metas being used can be inspected as before:


In [None]:
print(synth.meta_collection["attended"])


It can be seen that `attended` now has a `BooleanMeta`.
The Models being used can be inspected as before:


In [None]:
for model in synth.model_collection.models:
    print(model)


It can be seen that `booking_id` is now being modelled with an EnumerationModel.

Generate data from this synthesizer:


In [None]:
from synthesized3.api import generate

df_synth = generate(synth, num_rows=1000, spark=spark)

df_synth.show(10)


It can be seen that the `booking_id` column does indeed have values enumerating from 1000 (specified in the config)
in step sizes of 1 (the default).


## Evaluation

#### Insight package
In this section it is assumed the Synthesized `insight` package for evaluating data has been installed:

```
pip install insight
```

Further explanations and documentation of the `insight` package can be found
[here](https://github.com/synthesized-io/insight).


#### Data Quality Evaluation

When evaluating the quality of synthetic data, there are generally three perspectives to consider:

- Statistical Quality
- Predictive Utility
- Privacy

It is recommended best practice to evaluate the statistical quality of the synthetic data (i.e. its fidelity
compared to the original data) before considering other metrics, to ensure that the synthetic data quality meets the
requirements of the user.
Here, the focus will be on evaluating only the statistical quality of the synthetic data.


#### Note:

The evaluation code currently utilises Pandas exclusively, so any Spark dataframes will need to be converted to Pandas
dataframes before running the code in this section (both raw and synthetic dataframes). This can be done in 1 line of code:

```
df_pandas = df_spark.toPandas()
```

This will collect the entire dataframe on the host node. If the dataframe is too large to fit on the host node, take
a representative sample of the data before converting to Pandas dataframe. A representative data sample will have the
same statistical patterns and correlations as the population, meaning any comparisons done will still be valid. A random sample can be taken and converted to a Pandas dataframe using the following code:

```
fraction_of_data_to_sample = 0.05
df_pandas = df_spark.sample(fraction_of_data_to_sample).toPandas()
```

#### Evaluation code

A simple first step is to analyse whether similar numbers of missing values are present in the original and synthetic
dataframes. Where there are no missing values in the original, it should be confirmed there are none in the synthetic.


In [None]:
fraction_of_data_to_sample = 0.05
df = df.sample(fraction_of_data_to_sample).toPandas()
df_synth = df_synth.sample(fraction_of_data_to_sample).toPandas()


In [None]:
import pandas as pd

pd.concat([df.isna().sum(), df_synth.isna().sum()], axis=1, keys=["original_nan_count", "synthetic_nan_count"])


Next, the Synthesized `insight` package will be utilized in order to compare the univariate and bivariate statistics
across the synthetic and original datasets. As an initial sanity check, the distributions of the continuous and
categorical features in the original and synthetic datasets  are visually compared (excluding the `"booking_id"` feature
since this is an enumerated unique ID):


In [None]:
import insight.plotting as p
fig = p.plot_dataset([df, df_synth])


A quick visual inspection confirms that the synthetic data is qualitatively similar to the original, however by calculating
appropriate distances between distributions in the two datasets the user can better quantify the statistical quality of the data.
For categorical distributions, for example, the Earth Mover&#8217;s distance can be calculated:


In [None]:
import insight.metrics as m
emd = m.EarthMoversDistance()
emd(df["category"], df_synth["category"])


#### Note

The above analysis should not be considered exhaustive and the evaluation of the statistical quality of the data should be
informed by the downstream problem at hand and the EDA of the original dataset prior to any synthesis jobs being run.
