FAQs
Basics
Does the SDK support time-series data generation, preserving relationships both between features and over time in the resultant synthetic output?
Yes! We support time-series data with measurements taken at regular and irregular intervals. See more about our TimeSeriesSynthesizer
and EventSynthesizer
here.
Data Quality
When synthesizing data, do you have a mechanism for ensuring that the synthesized result maintains relationships between attributes, e.g. Attribute C = Attribute A * Attribute B?
By default, the SDK will learn all correlations present in the original data and reproduce them when generating synthetic data. However, on occasion it is helpful for a user to describe a set of pre-determined rules to constrain the output of the SDK and preclude the existence of impossible relationships in the synthetic data. In this case the SDK offers the option to define configurable expressions, associations and rules which can be added to ensure certain behaviour on the output synthetic data.
-
Expressions such as "col1 = 2 * col2 + col3" can be used so that instead of synthesizing column "col1" the
HighDimSynthesizer
computes its value from "col2" and "col2". -
Strict associated relationships such as (make, model) can also be used to the model to make sure the output doesn’t contain any (make, model) = ("Volkswagen", "Fiesta") or (make, model) = ("Ford", "Polo").
-
More complex rules such as "x1 < x2 - 3" or "x3 in ('A', 'B', 'C')" can also be configured, and the model will output data that is only contained within the given rules.
For more information on adding constraints to the synthetic data see Rules.
Does rebalancing data introduce bias, i.e. when upsampling a minority class using the ConditionalSampler
, is a model trained on this rebalanced dataset now biased towards this minority class?
No, in fact without rebalancing a model is likely to be biased against underrepresented groups within the dataset.
When large dataset imbalances are present, ML models can achieve apparently stellar performance just by predicting everything as the majority outcome. For example, in a dataset detailing bank transactions where 99% are legitimate transactions, such a model would have an accuracy of 99%. While these results might look near optimal, when the imbalance is taken into account this becomes much less impressive when the real value is incorrectly identifying the transactions that were fraudulent.
Synthesized has done a lot of work and research on mitigating biases in original data. We have published a number of articles and projects such as Fairlens[https://www.synthesized.io/fairlens] around this topic. The SDK offers a scalable rebalancing solution that goes beyond simple rebalancing of individual classes, and enables tweaking and reshaping of the arbitrary groups within a dataset. This allows users to generate a range of custom scenarios for testing and development purposes.
Analysis and Evaluation
Do you offer analysis of the synthetic dataset to demonstrate that the characteristics of the dataset are aligned with those of the original dataset?
Yes, Synthesized provides two ways to evaluate the quality of generated data:
-
Statistical Resemblance
Synthesized provides an extensive list of metrics to evaluate the statistical resemblance of both original and synthesized datasets. Metrics are related to distribution distances (Kolmogorov–Smirnov test and Earth Mover’s Distance) and correlation distances (Kendall’s Tau and Cramer’s V among others).
-
Data Utility
Synthesized also provides a framework of metrics to compare performance of Machine Learning models on both original and synthetic datasets. All steps in model performance evaluation (data pre-processing, training the model, compute metric…) happen under the hood but can be personalized.
These frameworks are also available with a plotting and visualization framework to make the evaluation experience more user-friendly.
See the Data Evaluation documentation for more details.
Does your solution include built-in privacy statistics to easily compare synthetic output with original, e.g. closest points?
Yes, Synthesized provides a framework to evaluate the privacy in the generated data, and ensure data quality is maintained.
For privacy, the user can: Simulate different attribute inference attacks using CAP (Correct Attribution Probability) or ML (Machine Learning models). Simulate a linkage attack for different values of t-closeness and k-distance, highlighting vulnerable groups within the data.
These frameworks are also available with a plotting and visualization framework to make the evaluation experience more user-friendly.
See our Privacy Evaluation documentation for more details.
Is there a mechanism for a user to perform visual checks on the synthetic, de-sensitized data and confirm its security?
Yes, Synthesized provides visualisations of linkage attacks on output synthetic data to highlight vulnerable groups.
Privacy and de-identification of PII
Do you offer differential privacy?
Yes, Synthesized can generate a synthetic data "twin". Users are given the option of specifying differential privacy requirements in the generation procedure.
Read more in our Differential Privacy guide.
Does the SDK have an option to guarantee that no synthetic data points coincide with any real data point?
Yes, Generally speaking, privacy in our solution is ensured by design since data is sampled from random noise and transformed into realistic distribution.
In order to ensure validity of this design we perform a number of statistical attacks to demonstrate that values are truly synthetic.
However, let’s consider the following case: imagine original data containing only one column which has few categorical values, say true and false. By random chance it might be the case that output synthetic data will match the original data (note that random generation is considered totally secure).
It means that when data has small cardinality (the ratio of the number of unique values in a column to the number of rows in the table)
some records produced by our Synthesizer can be duplicated by chance. However,
to completely guard against the possibility of any duplicated records in the synthetic and original data we provide the
Sanitizer
class to remove synthetic data output that matches the original.
For more information on the Sanitizer
class, see the Strict Synthesis guide.
Does the SDK support attribute-level data anonymization where information relating to a data subject (e.g. a clients name) is removed, thereby eliminating the possibility of identifying the data subject?
Yes, Synthesized supports two ways of attribute-level anonymization.
-
Data obfuscation
If needed, the user can obfuscate any data with any of the following techniques:
-
Nulling. The contents of a column can be completely removed, and the output dataset would contain an empty column.
-
Random strings. Generate random strings with similar format to input values, for example "490GH830L" could be transformed into "L3N8O3H2M".
-
Generalization. Individual values of attributes are replaced with a broader category. For example, the value '19' of the attribute 'Age' may be replaced by ' ≤ 20', the value '23' by '20 < Age ≤ 30' , etc.
For more details refer to the Privacy Masks documentation.
-
-
Fake data
Synthesized also supports the substitution of production values with realistic "fake" data. Continuity can be maintained across attributes in a row using the Synthesized annotation feature (e.g. a real name will be replaced by a "fake" name and this same "fake" name can be used to create a "fake" email address etc).
Generated "fake" data is coherent across columns.
For example, the combination of ‘title’, ‘first_name’, ‘last_name’, ‘gender’, ‘email’ describes a unique person in a dataset, there is a strict relationship between these attributes, so when “title” is “Mrs” or “Ms”, "gender" will be "Female", and “first_name” will contain a female name.
When it is important to maintain the correct description of an entity in the generated synthetic data, the dataset can be easily annotated to link the appropriate fields together. When annotated, Synthesized will learn to generate realistic entities as a whole, rather than independently generating individual attributes.
There are currently four annotation entities:
-
Person: Labels such as title, gender, fullname, firstname, lastname, email, username, password, mobile_number, home_number, work_number can be annotated and generated.
-
Address: Labels such as postcode, county, city, district, street, house, flat, house_name, full_address can be annotated and generated. Available on-demand, we can also provide a dictionary of real UK addresses, so that the generated addresses are real (although still independent from the original dataset).
-
Bank Account: Labels such as bic, sort_code, account can be annotated and generated.
-
Formatted String: For a more flexible string generation one can annotate a column as a formatted string, give it a pattern (in form of regular expression), and the software will generate random strings based on the given regular expression. For example, pattern="[A-Z]{4}-[0-9]{3}-[A-Z0-9]{6}" may generate "KWNF-971-K20X8B" or any other string that follows that pattern.
Some of the fake values generated for person and addresses are integrated into the product, while other values come from random data generators. Real addresses can be sampled from a given dictionary, and Synthesized can provide a dictionary of real UK addresses.
In Entity Annotation we explain how you can annotate your attributes to apply fake data instead of synthetic data.
-
Both processes are irreversible; having access to synthetic data one can not reverse engineer the data to obtain the original values.
How can the SDK ensure that "fake" values maintain the characteristics of the original values? For instance, when replacing a Credit Card Number (CCN) for example, an American Express CCN is replaced by a fake American Express CCN, a Visa CCN by a fake Visa CCN, and so on.
Synthesized ensures that synthetic data related to collections of columns maintains the characteristics of the replaced values using custom
business rules available as part of the interface. For complex cases, the user can use the FormattedString
annotation
to ensure data is generated following specific patterns.
For example, synthetic American Express CCN could be generated with pattern="3\d{3} \d{6} \d{5}", Visa CCN with pattern="4\d{3} \d{4} \d{4} \d{4}",
and MasterCard CCN with pattern="[25]\d{3} \d{4} \d{4} \d{4}". For more information on the use of the FormattedString
annotation
see Entity Annotation.
If the SDK includes its own catalogs of "fake" sensitive values, does it allow for consistent "faking" across related sensitive attributes? For example, if a multi-part (address split into separate attributes for say, street, city, state, zip code etc) client address were to be faked, in changing the city from Berlin to Munich, would the postal code be similarly faked to a Munich postal code?
Yes. Addresses and names are produced in a consistent way, i.e. addresses created within specific areas will have the correct post code, first names are consistent with generated genders and so on. During synthesis, related attributes are generated conditionally. For instance, for addresses we first generate a postcode and then sample an address from that post code. For names, we first synthesize gender and then synthesize title and names depending on that.
For more information on how related sensitive fields are handled see Entity Annotation.
Does the SDK have an option to guarantee that (pseudo-)identifying combinations of values are removed from the synthetic data?
Yes, this is the main advantage of synthetic data over approaches such as masking and generalisation. Masked/generalised data can still contain a combination of attributes that can be used to link records statistically. Synthetic data doesn’t contain real rows of data at all. It means subsets of attributes are all generated and do not correspond to any real records.
To validate this property internally we perform statistical attacks on synthetic data to demonstrate that it can not be statistically linked back to the original data.
Deployment and Integrations
Does the SDK support container installation?
Yes, the SDK is available as a Docker image. See here for more details.
Does the SDK support on-premises installation?
Yes
Does the SDK support cloud installation?
Yes, Synthesized supports MS Azure, AWS and GCP private cloud installations.
If additional consultancy services are needed, is this something you offer and can you outline the costs?
Yes, Synthesized end user training is provided as part of the subscription. We structure training in blocks of 2-hour sessions for up to 5 people. Previous experience shows end users can be trained in less than 4 hours.
Admin training is provided during the installation and configuration phase. Additional admin related questions and issues are managed as standard support tickets, included as part of the subscription cost. If required, additional admin training can be provided free of charge.
Synthesized has a dedicated Customer Success team to provide our customers with best-in-class support.
Additional Synthesized end user training (up to 5 users, delivered remotely) can be tailored and delivered as 2 hour blocks priced at £1,000 each.
Consultancy services, should they be required, are available and priced at £1,200 per day per engineer.