Type 1: Cross-sectional Data#
This is the standard time-independent dataset that we use the HighDimSynthesizer on.
Note
Possible actions:
Synthesize new panel members
Synthesize the same panel members
import pandas as pd
from synthesized import HighDimSynthesizer, MetaExtractor
# Original Data
df = pd.read_csv("claim_prediction.csv")
print(df)
age sex bmi children smoker region charges insuranceclaim
0 19 0 27.900 0 1 3 16884.92400 1
1 18 1 33.770 1 0 2 1725.55230 1
2 28 1 33.000 3 0 2 4449.46200 0
3 33 1 22.705 0 0 1 21984.47061 0
4 32 1 28.880 0 0 1 3866.85520 1
... ... ... ... ... ... ... ... ...
1333 50 1 30.970 3 0 1 10600.54830 0
1334 18 0 31.920 0 0 0 2205.98080 1
1335 18 0 36.850 0 0 2 1629.83350 1
1336 21 0 25.800 0 0 3 2007.94500 0
1337 61 0 29.070 0 1 1 29141.36030 1
[1338 rows x 8 columns]
df_meta = MetaExtractor.extract(df)
from synthesized.model import DataFrameModel
DataFrameModel(df_meta).fit(df).plot();

synth = HighDimSynthesizer(df_meta)
synth.learn(df_train=df)
Reached Stopping Criteria, finishing training: 100%|██████████
df_synth = synth.synthesize(num_rows=len(df))
print(df_synth)
age sex bmi children smoker region charges insuranceclaim
0 41 1 24.320000 2 1 2 19682.501953 1
1 63 0 25.840000 0 0 2 17583.591797 0
2 46 0 26.410000 0 0 3 7333.937500 0
3 58 0 35.725399 3 1 2 47455.164062 1
4 20 0 21.469999 0 0 3 1656.546021 0
... ... ... ... ... ... ... ... ...
1333 40 0 36.067009 3 0 1 7745.080078 0
1334 42 0 36.443886 5 1 3 22296.542969 1
1335 34 1 25.741701 2 0 0 6357.533691 0
1336 44 0 34.099998 1 0 3 7381.229980 1
1337 51 1 29.196758 0 0 0 8529.495117 1
[1338 rows x 8 columns]