Fine-tuning with the SDK

https://colab.research.google.com/assets/colab-badge.svg

The Synthesized package was designed to be used easily out of the box to create realistic synthetic data that can be used as a drop-in replacement for many data science and machine learning tasks. The product comes with recommended default settings which have been tested under many different scenarios and datasets, and evaluated under a range of metrics.

However, for more advanced users and in certain situations, the default settings might not be enough to get the utility or privacy from the Synthetic data that might be desired. The Synthesized package does allow for more configuration for these specific cases. On this page we give a few instructions for making the most out of the Synthesized package in these cases.

To demonstrate these capabilities, we use an example dataset and work through some steps that improve on the quality of generated data from the default settings. We go through 3 items of fine-tuning:

  1. Altering the type of modelling of a column to improve realism.

  2. Reformatting data for the Synthesizer to easily understand.

  3. Annotating data to prevent PII identification.

Example Dataset

Let’s load an example dataset in and see what we can do with the default Synthesized setup. We use a dataset taken from Kaggle which is a selection of bookings taken from two separate hotels. It originally appears in the article hotel booking demand dataset by Nuno Antonio, Ana de Almeida, Luis Nunes in Data in Brief.

The columns contain a selection of different categorical, continuous and date data.

In [1]: import pandas as pd

In [2]: df = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/synthesized-notebooks/master/data/hotel_bookings.csv")

Default Settings

We first run the data through the default process as detailed in the quickstart.

In [3]: import synthesized

In [4]: df_meta = synthesized.MetaExtractor.extract(df)

In [5]: synthesizer = synthesized.HighDimSynthesizer(df_meta)

In [6]: synthesizer.fit(df)

In [7]: df_synth = synthesizer.sample(len(df))

To see how well this data is synthesized we can use the evaluation tools in the Synthesized package. Here we’re looking at the Kolmogorov-Smirnov distance between the real and synthetic columns the Synthesizer thinks are continuous and the Earth Mover’s distance between columns the Synthesizer thinks are categorical. Smaller distances mean that the columns are closer in distribution.

In [8]: import matplotlib.pyplot as plt

In [9]: from synthesized.testing import Assessor

In [10]: from synthesized.insight.metrics import KolmogorovSmirnovDistance, EarthMoversDistance

In [11]: assessor = Assessor(df_meta)

In [12]: assessor.show_first_order_metric_distances(df, df_synth, KolmogorovSmirnovDistance())

In [13]: assessor.show_first_order_metric_distances(df, df_synth, EarthMoversDistance())
../_images/finetuning_example_13_0.png ../_images/finetuning_example_13_1.png

The quality of the generated data here is good but there are a number of columns that have some issues with them, the most obvious ones here are the company and days_in_waiting_list columns. Let’s see if we can make some common sense changes to how we generate data to improve this process.

Altering the modelling of the HighDimSynthesizer to improve realism

This stage of fine-tuning involves the use of model overrides to change how the HighDimSynthesizer treats certain columns. Currently, as we can see in the above plots, the two columns days_in_waiting_list and company aren’t very close in distribution to their original counterparts. This is because by default, the HighDim has decided to treat them as continuous distributions. We can see this by examining the HighDimSynthesizer._df_model attribute.

In [14]: df_meta = synthesized.MetaExtractor.extract(df)

In [15]: synthesizer = synthesized.HighDimSynthesizer(df_meta)

In [16]: synthesizer._df_model["company"], synthesizer._df_model["days_in_waiting_list"],
Out: (KernelDensityEstimate(meta=<Scale[i8]: Integer(name=company)>),
 KernelDensityEstimate(meta=<Scale[i8]: Integer(name=days_in_waiting_list)>))

The KernelDensityEstimate model tells us that the synthesizer thinks these columns are continuous. If instead, we want to change this, we can use Histogram models and pass them in as a argument to the HighDimSynthesizer

In [17]: from synthesized.model.models import Histogram

In [18]: waiting_list_model = Histogram(df_meta["days_in_waiting_list"])

In [19]: company_model = Histogram(df_meta["company"])

In [20]: synthesizer = synthesized.HighDimSynthesizer(df_meta, type_overrides=[waiting_list_model, company_model])

In [21]: synthesizer.fit(df)

In [22]: df_synth = synthesizer.sample(len(df))
In [23]: from synthesized.testing import Assessor

In [24]: from synthesized.insight.metrics import KolmogorovSmirnovDistance, EarthMoversDistance

In [25]: assessor = Assessor(df_meta)

In [26]: assessor.show_first_order_metric_distances(df, df_synth, KolmogorovSmirnovDistance())

In [27]: assessor.show_first_order_metric_distances(df, df_synth, EarthMoversDistance())

In [28]: plt.show()
../_images/finetuning_example_21_0.png ../_images/finetuning_example_21_1.png

We can see now that these columns are much closer in terms of Kolmogorov-Smirnov distance.

Reformatting data for the Synthesizer to easily understand.

The next change we’re going to make is to merge the “arrival date” columns into a single column, the HighDimSynthesizer isn’t able to recognise that these columns are linked and represent a date. By merging them into a standard format “dd/mm/yyyy” then the HighDimSynthesizer will recognise this as a date. The side effect of this is that the training for other columns may also improve now it has a better understanding of the data.

# mappings tell pandas how to alter each column
In [29]: month_map = {
   ....:     "January": "01",
   ....:     "February": "02",
   ....:     "March": "03",
   ....:     "April": "04",
   ....:     "May": "05",
   ....:     "June": "06",
   ....:     "July": "07",
   ....:     "August": "08",
   ....:     "September": "09",
   ....:     "October": "10",
   ....:     "November": "11",
   ....:     "December": "12",
   ....: }
   ....: 

In [30]: def day_map(i):
   ....:     if i < 10:
   ....:         return f"0{i}"
   ....:     return str(i)
   ....: 

In [31]: df_altered = df.copy()

# use mappings to create new column
In [32]: df_altered["arrival_date"] = df["arrival_date_day_of_month"].map(day_map).str.cat(
   ....:                                 df["arrival_date_month"].map(month_map), sep="/").str.cat(
   ....:                                 df["arrival_date_year"].astype(str), sep="/")
   ....: 

# drop rest of arrival date columns
In [33]: df_altered = df_altered.drop(columns=["arrival_date_day_of_month", "arrival_date_month", "arrival_date_year", "arrival_date_week_number"])

To see if the HighDimSynthesizer is treating this column as a date, all we need to do is to index into our df_meta variable and we can see it is a DateTime class.

In [34]: df_meta = synthesized.MetaExtractor.extract(df_altered)

In [35]: print(df_meta["arrival_date"])
Out: <Affine[M8[ns]]: DateTime(name=arrival_date)>

Finally we can synthesize our data as before, we could also keep the type_overrides argument from the section before.

In [36]: synthesizer = synthesized.HighDimSynthesizer(df_meta) # type_overrides can be provided here again

In [37]: synthesizer.fit(df_altered)

In [38]: df_synth = synthesizer.sample(len(df_altered))

Now the output of these dates will make logical sense and the quality of our data will be improved.

Annotating data to prevent PII identification.

Another potential issue in this data is that the origin country is learnt as a categorical data with the rest of the data. In fact, we might want to independently generate this this, this helps us stop the potential for linkage attacks against the country as well as generate data from other countries not present in the dataset. We us the FormattedString annotation to specify a regex of country codes to generate from.

In [39]: from synthesized.metadata.value.categorical import FormattedString

In [40]: country_codes_regex = ("(AFG|ALA|ALB|DZA|ASM|AND|AGO|AIA|ATA|ATG|ARG|ARM|ABW|AUS|AUT|AZE|BHS|BHR|BGD|BRB|BLR|BEL|BLZ|"
   ....:                         "BEN|BMU|BTN|BOL|BES|BIH|BWA|BVT|BRA|IOT|BRN|BGR|BFA|BDI|CPV|KHM|CMR|CAN|CYM|CAF|TCD|CHL|CHN|"
   ....:                         "CXR|CCK|COL|COM|COG|COD|COK|CRI|CIV|HRV|CUB|CUW|CYP|CZE|DNK|DJI|DMA|DOM|ECU|EGY|SLV|GNQ|ERI|"
   ....:                         "EST|SWZ|ETH|FLK|FRO|FJI|FIN|FRA|GUF|PYF|ATF|GAB|GMB|GEO|DEU|GHA|GIB|GRC|GRL|GRD|GLP|GUM|GTM|"
   ....:                         "GGY|GIN|GNB|GUY|HTI|HMD|VAT|HND|HKG|HUN|ISL|IND|IDN|IRN|IRQ|IRL|IMN|ISR|ITA|JAM|JPN|JEY|JOR|"
   ....:                         "KAZ|KEN|KIR|PRK|KOR|KWT|KGZ|LAO|LVA|LBN|LSO|LBR|LBY|LIE|LTU|LUX|MAC|MDG|MWI|MYS|MDV|MLI|MLT|"
   ....:                         "MHL|MTQ|MRT|MUS|MYT|MEX|FSM|MDA|MCO|MNG|MNE|MSR|MAR|MOZ|MMR|NAM|NRU|NPL|NLD|NCL|NZL|NIC|NER|"
   ....:                         "NGA|NIU|NFK|MKD|MNP|NOR|OMN|PAK|PLW|PSE|PAN|PNG|PRY|PER|PHL|PCN|POL|PRT|PRI|QAT|REU|ROU|RUS|"
   ....:                         "RWA|BLM|SHN|KNA|LCA|MAF|SPM|VCT|WSM|SMR|STP|SAU|SEN|SRB|SYC|SLE|SGP|SXM|SVK|SVN|SLB|SOM|ZAF|"
   ....:                         "SGS|SSD|ESP|LKA|SDN|SUR|SJM|SWE|CHE|SYR|TWN|TJK|TZA|THA|TLS|TGO|TKL|TON|TTO|TUN|TUR|TKM|TCA|"
   ....:                         "TUV|UGA|UKR|ARE|GBR|USA|UMI|URY|UZB|VUT|VEN|VNM|VGB|VIR|WLF|ESH|YEM|ZMB|ZWE)")
   ....: 

In [41]: address = FormattedString(
   ....:     name="country",
   ....:     pattern=country_codes_regex
   ....: )
   ....: 

In [42]: df_meta = synthesized.MetaExtractor.extract(df_altered, annotations=[address])
In [43]: synthesizer = synthesized.HighDimSynthesizer(df_meta)

In [44]: synthesizer.fit(df_altered)

In [45]: df_synth = synthesizer.sample(len(df_altered))