CSV Support

CSV (Comma Separated Values) is a textual file format used to store tabular data. Values are separated by commas or other delimiters, and each new row is on a new line. TDK supports CSV-to-CSV data transformations.

Directory layout

TDK reads data from an input directory and writes the transformed data to an output directory. At the moment, input and output directories must be local directories. An input directory models a single logical schema with one or more tables.

TDK expects the input directory to have a certain structure. In the root directory, there should be a directory per logical table. Each input table directory must contain at least one file. Different table directories can have different schemas, but all the files in a single table directory should have the same schema. For example, the following directory structure contains three logical tables, users, orders, and products:

my-input-dir/
├── orders/
│   ├── orders1.csv
│   └── orders2.csv
├── products/
│   └── products1.csv
└── users/
    └── users1.csv

At the start of a transformation, the output directory must be empty. When the transformation has finished, the output directory will have the same structure as the input directory, with the same table directories, but with just one file per directory named data.csv. For example, the output directory for the above input directory will look like this:

my-output-dir/
├── orders/
│   └── data.csv
├── products/
│   └── data.csv
└── users/
    └── data.csv

File format

TDK expects input files to have a certain format and produces output files in the same format:

The first row of the file is considered the header row, and the rest of the rows are data rows.
Values are delimited by commas (,).
Each new row is on a new line.
The quote character is ".
Empty strings are indistinguishable from nulls: both a quoted and unquoted empty strings are treated as nulls. For example, the following two CSV files are equivalent to a single-row logical table with a name field set to Alice and a nickname field set to null:

name,nickname
Alice,

and

name,nickname
Alice,""

An example CSV file is shown below:

name,age
Alice,25
Bob,57
Charlie,98

Schema inference

TDK infers the input file schema by reading a limited number of sample files (currently, 5) from each table directory. If the schema is not the same across all sample files, the transformation will fail. TDK recognises integer, floating-point and string data types, which can in addition be nullable.

Configuration

Since the root input directory represents a schema and nested directories represent tables, the TDK configuration for CSV transformations looks identical to the configuration for database transformations. The schema name must always be public (lower-case, case-sensitive).

The following is an example of a CSV transformation configuration for the input directory structure shown above:

default_config:
  mode: MASKING
safety_mode: STRICT
schema_creation_mode: DO_NOT_CREATE

tables:
   - table_name_with_schema: "public.orders"
     target_ratio: 0.98

   - table_name_with_schema: "public.products"
     mode: GENERATION
     target_row_number: 100

   - table_name_with_schema: "public.users"
     mode: GENERATION
     target_ratio: 5

Running a CSV transformation

The CSV transformation execution is similar to database transformations, with the difference that file URLs use the file schema instead of jdbc. The following is an example of a CSV transformation execution:

TDK_WORKINGDIRECTORY_ENABLED=true TDK_WORKINGDIRECTORY_PATH=/my-working-directory \
    tdk -c config.yaml --input-url file:///my-input-dir/ --output-url file:///my-output-dir/

Please note that in order to run CSV transformations, the working directory option must be enabled.

S3 Support

TDK supports reading and writing CSV files from and to S3. The input and output URLs must use the s3 scheme, and the URL must not include the user info. The general format is s3://<endpoint>/<bucket>/<"directory"/>. The examples of valid URLs: s3://s3.amazonaws.com/my-bucket/my-input-dir/ and s3://my-minio-server:9000/my-bucket/input-dir/.

In the S3 URL pattern above, the "directory" is given in quotes because S3 does not have a concept of directories; it is a flat key-value store. The directory structure is typically emulated using forward-slashes in the object keys. The "directory" is just an object key ending with a forward-slash. For example, the object key my-bucket/my-input-dir/ is considered a directory with the name my-input-dir in the bucket my-bucket. The object itself doesn’t have to exist - it rather acts as a prefix for other objects "under" the directory (e.g., my-bucket/my-input-dir/file1.csv or my-bucket/my-input-dir/file2.csv).

The implementation was tested against Amazon S3 and MinIO, but it may work with other S3-compatible storage implementations, as long as they can be accessed via an endpoint and support the "user + password" (or "access key + secret key") authentication method.

Configuration

S3 API-compatible services often require fine-tuning the S3 client to cater for various implementation aspects. At the moment, such configuration options are set globally for the TDK CLI or the Governor instance. These options must be provided as environment variables:

TDK_S3_PROTOCOL (optional): The S3 client protocol. Defaults to https.
TDK_S3_REGION (optional): The S3 client region. For AWS S3, refer to the official AWS documentation for a list of regions. For MinIO, this option is not required. For other S3-compatible services, refer to the service’s documentation.
TDK_S3_FORCE_PATH_STYLE (optional): If set to true, the S3 client will use path-style URLs. Defaults to false. For MinIO, this option must be set to true. For other S3-compatible services, refer to the service’s documentation.

Credentials

S3 credentials for the TDK CLI are provided via the --input-user, --input-password, --output-user, and --output-password options. The --input-user and --input-password options are used for reading data from S3, and the --output-user and --output-password options for writing.

For AWS S3, the user must be an access key ID and the password must be a secret key (see the official AWS documentation for more information on what they are and how to obtain them).

The following example puts the information above together into a CSV transformation command for data on AWS S3. The example assumes that the input and the output data is stored under the my-bucket bucket in the eu-west-1 region. For authentication, the access key ID MY_AWS_ACCESS_KEY_ID and the secret key MY_AWS_SECRET_ACCESS_KEY are used for both the input and the output data.

export TDK_WORKINGDIRECTORY_ENABLED=true
export TDK_WORKINGDIRECTORY_PATH=wd
export TDK_S3_REGION=eu-west-1

tdk -c config.yaml \
    --input-url=s3://s3.amazonaws.com/my-bucket/my-input-dir/ \
    --output-url=s3://s3.amazonaws.com/my-bucket/my-output-dir/ \
    --input-username='MY_AWS_ACCESS_KEY_ID' \
    --input-password='MY_AWS_SECRET_ACCESS_KEY' \
    --output-username='MY_AWS_ACCESS_KEY_ID' \
    --output-password='MY_AWS_SECRET_ACCESS_KEY'