Fine-tuning with the SDK#

https://colab.research.google.com/assets/colab-badge.svg

The Synthesized package was designed to be used easily out of the box to create realistic synthetic data that can be used as a drop-in replacement for many data science and machine learning tasks. The product comes with recommended default settings which have been tested under many different scenarios and datasets, and evaluated under a range of metrics.

For more advanced users and in certain situations the default settings might not be enough to achieve the utility or privacy from the Synthetic data that might be desired. In these cases, the Synthesized package allows for more configuration.

On this page we give a few instructions for making the most out of the Synthesized package. We use an example dataset and work through some steps that improve on the quality of generated data from the default settings. We go through 3 items of fine-tuning:

  1. Specifying column types to improve realism.

  2. Reformatting data for the Synthesizer to more easily understand.

  3. Annotating data to prevent PII identification.

Example Dataset#

Let’s load an example dataset in and see what we can do with the default Synthesized setup. We use a dataset taken from Kaggle which is a selection of bookings taken from two separate hotels. It originally appears in the article hotel booking demand dataset by Nuno Antonio, Ana de Almeida, Luis Nunes in Data in Brief.

The columns contain a selection of different categorical, continuous and date data.

In [1]: import pandas as pd

In [2]: df = pd.read_csv("https://raw.githubusercontent.com/synthesized-io/synthesized-notebooks/master/data/hotel_bookings.csv")

Default Settings#

We first run the data through the default process as detailed in the quickstart guide.

In [3]: import synthesized

In [4]: df_meta = synthesized.MetaExtractor.extract(df)

In [5]: synthesizer = synthesized.HighDimSynthesizer(df_meta)

In [6]: synthesizer.fit(df)

In [7]: df_synth = synthesizer.sample(len(df))

To see how well this data is synthesized we can use the evaluation tools in the Synthesized package. Here we’re looking at the Kolmogorov-Smirnov distance between the real and synthetic columns the Synthesizer thinks are continuous, and the Earth Mover’s distance between columns the Synthesizer thinks are categorical. Smaller distances mean that the columns are closer in distribution, usually a distances lower than 0.1 mean that a lot of the statistical properties are preserved.

In [8]: import matplotlib.pyplot as plt

In [9]: from synthesized.testing import Assessor

In [10]: from synthesized.insight.metrics import KolmogorovSmirnovDistance, EarthMoversDistance

In [11]: assessor = Assessor(df_meta)

In [12]: assessor.show_first_order_metric_distances(df, df_synth, KolmogorovSmirnovDistance())

In [13]: assessor.show_first_order_metric_distances(df, df_synth, EarthMoversDistance())
../../_images/finetuning_example_13_0.png ../../_images/finetuning_example_13_1.png

The quality of the generated data here is good but there are a number of columns that have some issues with them, the most obvious one being the days_in_waiting_list column. Let’s see if we can make some common sense changes to how we generate data to improve this process.

Specifying column types to improve realism#

This stage of fine-tuning involves the use of model overrides to change how the HighDimSynthesizer treats certain columns. Currently, as we can see in the above plots, the column days_in_waiting_list isn’t very close in distribution to the original column. This is because by default, the HighDim has decided to treat it as a continuous distribution. We can see this by examining the HighDimSynthesizer._df_model attribute.

In [14]: df_meta = synthesized.MetaExtractor.extract(df)

In [15]: synthesizer = synthesized.HighDimSynthesizer(df_meta)

In [16]: synthesizer._df_model["days_in_waiting_list"],
Out: KernelDensityEstimate(meta=<Scale[i8]: Integer(name=days_in_waiting_list)>)

The KernelDensityEstimate model tells us that the synthesizer thinks this column is continuous. If instead, we want to change this to categorical, we can use a Histogram model and pass it in as a argument to the HighDimSynthesizer

In [17]: from synthesized.model.models import Histogram

In [18]: waiting_list_model = Histogram(df_meta["days_in_waiting_list"])

In [19]: synthesizer = synthesized.HighDimSynthesizer(df_meta, type_overrides=[waiting_list_model])

In [20]: synthesizer.fit(df)

In [21]: df_synth = synthesizer.sample(len(df))
In [22]: from synthesized.testing import Assessor

In [23]: from synthesized.insight.metrics import KolmogorovSmirnovDistance, EarthMoversDistance

In [24]: assessor = Assessor(df_meta)

In [25]: assessor.show_first_order_metric_distances(df, df_synth, KolmogorovSmirnovDistance())

In [26]: assessor.show_first_order_metric_distances(df, df_synth, EarthMoversDistance())

In [27]: plt.show()
../../_images/finetuning_example_21_0.png ../../_images/finetuning_example_21_1.png

We can see now that these columns are much closer in terms of Kolmogorov-Smirnov distance.

Reformatting data for the Synthesizer to more easily understand#

The “arrival_date” information in this dataset is spread out among different columns (e.g. arrival_date_month, arrival_date_year are columns). In order for the HighDimSynthesizer to recognise that these columns are linked and represent a date we can merge them into the standard format “dd/mm/yyyy”. Not only will the distribution of these columns be improved but the training for other columns may also improve now it has a better understanding of the data.

# mappings tell pandas how to alter each column
In [28]: month_map = {
   ....:     "January": "01",
   ....:     "February": "02",
   ....:     "March": "03",
   ....:     "April": "04",
   ....:     "May": "05",
   ....:     "June": "06",
   ....:     "July": "07",
   ....:     "August": "08",
   ....:     "September": "09",
   ....:     "October": "10",
   ....:     "November": "11",
   ....:     "December": "12",
   ....: }
   ....: 

In [29]: def day_map(i):
   ....:     if i < 10:
   ....:         return f"0{i}"
   ....:     return str(i)
   ....: 

In [30]: df_altered = df.copy()

# use mappings to create new column
In [31]: df_altered["arrival_date"] = df["arrival_date_day_of_month"].map(day_map).str.cat(
   ....:                                 df["arrival_date_month"].map(month_map), sep="/").str.cat(
   ....:                                 df["arrival_date_year"].astype(str), sep="/")
   ....: 

# drop rest of arrival date columns
In [32]: df_altered = df_altered.drop(columns=["arrival_date_day_of_month", "arrival_date_month", "arrival_date_year", "arrival_date_week_number"])

To see if the HighDimSynthesizer is treating this column as a date, all we need to do is to index into our df_meta variable and we can see it is a DateTime class.

In [33]: df_meta = synthesized.MetaExtractor.extract(df_altered)

In [34]: print(df_meta["arrival_date"])
Out: <Affine[M8[ns]]: DateTime(name=arrival_date)>

Finally we can synthesize our data as before, we could also keep the type_overrides argument from the section before.

In [35]: synthesizer = synthesized.HighDimSynthesizer(df_meta) # type_overrides can be provided here again

In [36]: synthesizer.fit(df_altered)

In [37]: df_synth = synthesizer.sample(len(df_altered))

Now the output of these dates will make logical sense and the quality of our data will be improved.

Annotating data to prevent PII identification.#

Another potential issue in this data is that the origin country is learnt as categorical data with the rest of the data. In fact, we might want to independently generate this as it removes the potential for linkage attacks against the country as well as generate data from other countries not present in the dataset. We us the FormattedString annotation to specify a regex of country codes to generate from.

In [38]: from synthesized.metadata.value.categorical import FormattedString

In [39]: country_codes_regex = ("(AFG|ALA|ALB|DZA|ASM|AND|AGO|AIA|ATA|ATG|ARG|ARM|ABW|AUS|AUT|AZE|BHS|BHR|BGD|BRB|BLR|BEL|BLZ|"
   ....:                         "BEN|BMU|BTN|BOL|BES|BIH|BWA|BVT|BRA|IOT|BRN|BGR|BFA|BDI|CPV|KHM|CMR|CAN|CYM|CAF|TCD|CHL|CHN|"
   ....:                         "CXR|CCK|COL|COM|COG|COD|COK|CRI|CIV|HRV|CUB|CUW|CYP|CZE|DNK|DJI|DMA|DOM|ECU|EGY|SLV|GNQ|ERI|"
   ....:                         "EST|SWZ|ETH|FLK|FRO|FJI|FIN|FRA|GUF|PYF|ATF|GAB|GMB|GEO|DEU|GHA|GIB|GRC|GRL|GRD|GLP|GUM|GTM|"
   ....:                         "GGY|GIN|GNB|GUY|HTI|HMD|VAT|HND|HKG|HUN|ISL|IND|IDN|IRN|IRQ|IRL|IMN|ISR|ITA|JAM|JPN|JEY|JOR|"
   ....:                         "KAZ|KEN|KIR|PRK|KOR|KWT|KGZ|LAO|LVA|LBN|LSO|LBR|LBY|LIE|LTU|LUX|MAC|MDG|MWI|MYS|MDV|MLI|MLT|"
   ....:                         "MHL|MTQ|MRT|MUS|MYT|MEX|FSM|MDA|MCO|MNG|MNE|MSR|MAR|MOZ|MMR|NAM|NRU|NPL|NLD|NCL|NZL|NIC|NER|"
   ....:                         "NGA|NIU|NFK|MKD|MNP|NOR|OMN|PAK|PLW|PSE|PAN|PNG|PRY|PER|PHL|PCN|POL|PRT|PRI|QAT|REU|ROU|RUS|"
   ....:                         "RWA|BLM|SHN|KNA|LCA|MAF|SPM|VCT|WSM|SMR|STP|SAU|SEN|SRB|SYC|SLE|SGP|SXM|SVK|SVN|SLB|SOM|ZAF|"
   ....:                         "SGS|SSD|ESP|LKA|SDN|SUR|SJM|SWE|CHE|SYR|TWN|TJK|TZA|THA|TLS|TGO|TKL|TON|TTO|TUN|TUR|TKM|TCA|"
   ....:                         "TUV|UGA|UKR|ARE|GBR|USA|UMI|URY|UZB|VUT|VEN|VNM|VGB|VIR|WLF|ESH|YEM|ZMB|ZWE)")
   ....: 

In [40]: address = FormattedString(
   ....:     name="country",
   ....:     pattern=country_codes_regex
   ....: )
   ....: 

In [41]: df_meta = synthesized.MetaExtractor.extract(df_altered, annotations=[address])
In [42]: synthesizer = synthesized.HighDimSynthesizer(df_meta)

In [43]: synthesizer.fit(df_altered)

In [44]: df_synth = synthesizer.sample(len(df_altered))