BigID integration

Starting from release 1.10, Synthesized TDK is able to enrich its transformation process with insights loaded from a BigID instance. Once you have BigID-enabled distribution, it is only a matter of a few lines of additional configuration to enable this functionality.

BigID support is currently available for TDK CLI only.

Configuration

Obtain BigID API authentication token

A BigID refresh token is required to enable interaction with a BigID instance. Follow the instructions at the BigID developer portal to obtain one.

Configure BigID sensitivity classification

Since many classifiers in BigID are regex-based and TDK produces synthetic data that is extremely close in shape and format to the original data, the BigID sensitivity classification mechanism can produce false positives, wrongly marking synthetic data as sensitive. To avoid that, TDK output data can be excluded from sensitivity classification by adjusting sensitivity definition to exclude objects with certain tags. The tag name and value being applied can be set in TDK configuration. Consult your BigID user guide for details on how to do adjust sensitivity settings to take the tag into account.

Configure TDK

By default, the TDK configuration is read from a file named bigid.yaml in the current directory. The default location can be overridden with BIGID_CONFIG_FILE environment variable which can contain an absolute or a relative path. Example:

BIGID_CONFIG_FILE=/home/johndoe/configs/custom_bigid_config.yaml tdk --help

The configuration file supports a number of settings which are explained in the table below.

BigID-specific TDK configuration
Setting	Meaning	Default value
auth_token_env	Name of the environment variable to take BigID API auth token from.	BIGID_AUTH_TOKEN
base_url	Base BigID API URL.All API calls are relative to this URL.	-
input_data_source	Name of the input BigID data source which is used to retrieve input tables metadata.	-
classifier_mapping	Multi-level mapping of sensitivity group → sensitivity level → classifier → transformation params	-
output.data_source	Name of the output datasource where output tables metadata will reside.	-
output.tag_name	Name of the tag to apply to the output tables.	-
output.tag_value	Value of the tag to apply to the output tables.	-
scan_max_time_secs	Maximum time, in seconds, that the output datasource scan can take.	-
fallback_to_classifier_regex	Whether to use regular expressions from classifiers if transformations parameters are not specified explicitly in the classifier mapping	false

Example

Imagine there is an instance of configuration saved in a bigid-example.yaml file in the current directory with the following content:

bigid-example.yaml

bigid:
  auth_token_env: MY_BIGID_AUTH_TOKEN
  base_url: https://my-bigid-host/api/v1
  input_data_source: my_input
  classifier_mapping:
    sensitivity:
      Default:
        High:
          Email:
            type: formatted_string_generator
            pattern: '\w+@\w{5,}\.(com|org|net)'
  output:
    data_source: my_output
    tag_name: synthetic
    tag_value: true
    scan_max_time_secs: 30

Invocation of the TDK CLI tool may look like shown below:

MY_BIGID_AUTH_TOKEN=<your refresh token here> \
BIGID_CONFIG_FILE=bigid-example.yaml \
tdk <CLI options>

With the configuration above, TDK will fetch metadata from a BigID instance via the API endpoint https://my-bigid-host/api/v1. For authentication, it will use refresh token stored in an environment variable named MY_BIGID_AUTH_TOKEN.

TDK will look for classifier metadata using my_input data source under the specified BigID instance. Any table having High sensitivity level under Default sensitivity group will be scanned for the presence of Email classifiers. Any column having the Email classifier will be processed with a formatted string generator, with output strings formatted according to the \w+@\w{5,}\.(com|org|net) regex.

Upon the TDK run completion, the data source named my_output will be rescanned for the resulting table to appear. After that, tag synthetic: true will be applied to the output table object.