Spark/PySpark allows for fast large scale data processing across machines on a cluster. For now, the Synthesized package is designed to run on a single machine and have all data located within memory. To incorporate this process into a Spark pipeline, users will need to accumulate a sample of the data they are interested in within a single machine and train the Synthesizer there.
Synthesized requires data to be in the
pandas.DataFrame format, the PySpark
toPandas converts a Spark DataFrame into the required format.
spark.createDataFrame will convert a synthesized
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() from datetime import datetime, date import pandas as pd from pyspark.sql import Row df_spark = spark.createDataFrame([ Row(a=1, b=2., c='string1'), Row(a=2, b=3., c='string2'), Row(a=4, b=5., c='string3'), Row(a=1, b=2., c='string1'), Row(a=2, b=3., c='string2'), Row(a=4, b=5., c='string3'), Row(a=1, b=2., c='string1'), Row(a=2, b=3., c='string2'), Row(a=4, b=5., c='string3'), ]) # create simple Spark DataFrame df = df_spark.sample(fraction=2/3).toPandas() # sample a subset, collect and convert to pandas
Now we have a sample of data in a
pandas.DataFrame, we can use the standard
workflow for the Synthesized package, documented in the
Single Table Guide.
import synthesized df_meta = synthesized.MetaExtractor.extract(df) synth = synthesized.HighDimSynthesizer(df_meta) synth.fit(df) df_synth = synth.sample(num_rows=100) df_synth_spark = spark.createDataFrame(df_synth) # convert back to spark
Incorporating into a distributed pipeline
The key step in this process is
df = df_spark.sample(fraction=2/3).toPandas()
Here we grab a sample of the data and load it into memory on the driver node of the cluster, it is therefore important that the amount of data that is sampled easily fits into memory and that the driver node has a good amount of processing power to train the Synthesizer. If instead, it is desirable to use a worker node to run the Synthesizer, it is possible to set up the Synthesized package on that worker node and to remotely execute a similar script.