Batch processing for large datasets

Sometimes it will be desired to work with more data than can comfortably fit and be processed on the resources available - e.g., to synthesize a 50GB dataset but a 16GB RAM machine is all there is to work with. There are a few ways to tackle this problem - in order of preference:

1. Use a machine with more resources

The easiest solution, and often the fastest to implement, is simply to scale up the machine running the software. This solution is particularly applicable to cloud based solutions where a larger machine can often be initialized at the click of a button and be available within a few minutes.

If you cannot (or if it is very timely to) spin a larger instance, or your data is so large that it cannot fit on any single instance available, consider one of the following options.

2. Take a representative subset of the data

The synthesizer tends not to require huge quantities (e.g., 10s of GBs) of data to detect and replicate the trends seen within a dataset. Usually a few hundred MBs of data is enough for most complex trends. The synthesizer will automatically terminate training when it has detected trends to a high enough threshold. After this point, re-training on new data will generally have little effect on the output of the data. What is usually happening is that the synthesizer is fitting to the new set of data while losing its learned relationships from the previous set of data. Generally if you have randomly split your data, the relationships being learned will be the same because both sets of data will be representative, so overall there should not be a statistically significant difference.

Training on larger datasets also takes more time and, as above, after a certain point tends not to give improved data quality. It is recommended to take a representative subset of large datasets and use that for training as data quality will generally not be effected, and training should progress faster allowing data teams to analyze the data quicker and reduce iteration times when fine-tuning the SDK.

For very large datasets, representative samples can usually be achieved by taking random samples. To be more stringent about ensuring distributions are preserved (e.g., if you do not feel random sampling will guarantee enough representation) then stratified sampling can be applied. For use cases where minority datasets (e.g. fraud detection) are the main interest, all of the minority cases can be used in the sample, especially if their representation in the original dataset is so sparse that random sampling will not guarantee their presence.

As an exercise it might be demonstrative to train one synthesizer on large quantities of data using the below batch method, and another synthesizer on a subset of data, and compare the data generated by both models. You should find them similar.

3. Batch process the large data

Finally, the SDK offers the ability to do batch training over the data. The idea is to load data from disk in chunks and train on each chunk iteratively until you have been through the whole dataset. For large datasets this tends not to be ideal as the model will learn features from later batches more strongly than earlier batches, which is why it is often preferable to use a representative subset instead (see above).

Batch training should happen in two stages: First, the metadata for the entire dataset should be extracted. Second, the data patterns should be learned (i.e., the synthesizer trained). Once the synthesizer has been trained, synthetic data generation can proceed as before.

Example 1. Batch Training Script:
import pandas as pd
from synthesized import MetaExtractor, HighDimSynthesizer

# Variable setup
file_path = "my_data.csv"
number_of_rows_in_batch = 100_000

# First: Extract all metadata in batches
with pd.read_csv(file_path, iterator=True, chunksize=number_of_rows_in_batch) as data_chunks:
    df_meta = None
    for chunk_number, df in enumerate(data_chunks):
        print(f"Extracting meta from chunk {chunk_number} of the data. Chunk size {len(df)}")
        if df_meta is None:
            df_meta = MetaExtractor.extract(df)
        else:
            df_meta.update_meta(df)

# Second: Train the synthesizer in batches
synth = HighDimSynthesizer(df_meta)
with pd.read_csv(file_path, iterator=True, chunksize=number_of_rows_in_batch) as data_chunks:
    for chunk_number, df in enumerate(data_chunks):
        print(f"Training synthesizer on chunk {chunk_number} of the data. Chunk size {len(df)}")
        synth.learn(df)

# Finally: Synthesize data as normal
df_synth = synth.synthesize(1000)
In order to save large quantities of (synthetic) data, batch generate and save to disk in a similar fashion.