Data Quality Automation

Many data science and machine learning projects will never mature, for the simple lack of high quality data. At Synthesized, we have identified the issues leading to low quality data and developed a number of features to ensure our synthetic data has the highest utility possible, including:

  • Data Rebalancing: when training classification models, it is desirable to train using a dataset that possesses an equal distribution of the classes. Often, however, data is highly unbalanced which can degrade model accuracy and lead to biased results. Using the ConditionalSampler, a synthetic dataset can be generated with user defined marginal distributions while retaining the original statistical properties in the raw dataset as best as possible.

  • Rules: just as the ConditionalSampler can be used to control the distributions of classes within the synthetic data, various rules can be enforced to constrain the values and correlations present in a synthetic dataset. While the HighDimSynthesizer is highly optimised and will generally learn all the correlations present in an input dataset, it may not always learn deterministic rules and can occasionally generate synthetic data with impossible results. Pre-defining these deterministic rules can preclude such an event and tune the information present in a synthetic dataset for a custom scenario.

  • features:quality/constraints/annotations.adoc: certain fields within a dataset are often linked, for example the combination of ('title', 'first_name', 'last_name', 'gender', 'email') describes a unique person. Furthermore, the information in these fields is often related to one another - in the above example, an email address is often comprised of a first name and last name. In order to appropriately model linked fields, a user can annotate a set of columns as comprising one unique entity. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.

  • Data Imputation: missing values in a dataset can severely degrade the utility of a training dataset, especially when clustered around similar regions. The DataImputer can be used to impute missing values into your dataset that accurately reflect the population distributions.

Overriding default behaviour

The SDK will automatically infer data types and model the data accordingly. For example, the SDK will automatically detect that a column of strings is actually a datetime, or that a set of floating point values should actually be interpreted as integers. However, there are occasions where a user may want to adjust the default behaviour of the SDK, either overriding how the data is interpreted or how it is modelled. For more information on overriding the default behaviour of the SDK see Overrides.