BigID integration
Starting from release 1.10, Synthesized TDK is able to enrich its transformation process with insights loaded from a BigID instance. Once you have BigID-enabled distribution, it is only a matter of a few lines of additional configuration to enable this functionality.
BigID support is currently available for TDK CLI only. |
Configuration
Obtain BigID API authentication token
A BigID refresh token is required to enable interaction with a BigID instance. Follow the instructions at the BigID developer portal to obtain one.
Configure BigID sensitivity classification
Since many classifiers in BigID are regex-based and TDK produces synthetic data that is extremely close in shape and format to the original data, the BigID sensitivity classification mechanism can produce false positives, wrongly marking synthetic data as sensitive. To avoid that, TDK output data can be excluded from sensitivity classification by adjusting sensitivity definition to exclude objects with certain tags. The tag name and value being applied can be set in TDK configuration. Consult your BigID user guide for details on how to do adjust sensitivity settings to take the tag into account.
Configure TDK
By default, the TDK configuration is read from a file named bigid.yaml
in the current directory. The default location can be overridden with BIGID_CONFIG_FILE
environment variable which can contain an absolute or a relative path. Example:
BIGID_CONFIG_FILE=/home/johndoe/configs/custom_bigid_config.yaml tdk --help
The configuration file supports a number of settings which are explained in the table below.
Setting | Meaning | Default value |
---|---|---|
auth_token_env |
Name of the environment variable to take BigID API auth token from. |
BIGID_AUTH_TOKEN |
base_url |
Base BigID API URL.All API calls are relative to this URL. |
- |
input_data_source |
Name of the input BigID data source which is used to retrieve input tables metadata. |
- |
classifier_mapping |
Multi-level mapping of sensitivity group → sensitivity level → classifier → transformation params |
- |
output.data_source |
Name of the output datasource where output tables metadata will reside. |
- |
Name of the tag to apply to the output tables. |
- |
|
output.tag_value |
Value of the tag to apply to the output tables. |
- |
scan_max_time_secs |
Maximum time, in seconds, that the output datasource scan can take. |
- |
fallback_to_classifier_regex |
Whether to use regular expressions from classifiers if transformations parameters are not specified explicitly in the classifier mapping |
false |
Example
Imagine there is an instance of configuration saved in a bigid-example.yaml
file in the current directory with the following content:
bigid:
auth_token_env: MY_BIGID_AUTH_TOKEN
base_url: https://my-bigid-host/api/v1
input_data_source: my_input
classifier_mapping:
sensitivity:
Default:
High:
Email:
type: formatted_string_generator
pattern: '\w+@\w{5,}\.(com|org|net)'
output:
data_source: my_output
tag_name: synthetic
tag_value: true
scan_max_time_secs: 30
Invocation of the TDK CLI tool may look like shown below:
MY_BIGID_AUTH_TOKEN=<your refresh token here> \
BIGID_CONFIG_FILE=bigid-example.yaml \
tdk <CLI options>
With the configuration above, TDK will fetch metadata from a BigID instance via the API endpoint
https://my-bigid-host/api/v1
. For authentication, it will use refresh token stored in an environment variable named
MY_BIGID_AUTH_TOKEN
.
TDK will look for classifier metadata using my_input
data source under the specified BigID
instance. Any table having High
sensitivity level under
Default
sensitivity group will be scanned for the presence of Email
classifiers. Any column having the Email
classifier will be processed with a
formatted string generator, with output strings
formatted according to the \w+@\w{5,}\.(com|org|net)
regex.
Upon the TDK run completion,
the data source named my_output
will be rescanned for the resulting table to appear. After that, tag synthetic: true
will be applied to the output table object.