CSV Support

CSV (Comma Separated Values) is a textual file format used to store tabular data. Values are separated by commas or other delimiters, and each new row is on a new line. TDK supports CSV-to-CSV data transformations.

Directory layout

TDK reads data from an input directory and writes the transformed data to an output directory. At the moment, input and output directories must be local directories. An input directory models a single logical schema with one or more tables.

TDK expects the input directory to have a certain structure. In the root directory, there should be a directory per logical table. Each input table directory must contain at least one file. Different table directories can have different schemas, but all the files in a single table directory should have the same schema. For example, the following directory structure contains three logical tables, users, orders, and products:

my-input-dir/
├── orders/
│   ├── orders1.csv
│   └── orders2.csv
├── products/
│   └── products1.csv
└── users/
    └── users1.csv

At the start of a transformation, the output directory must be empty. When the transformation has finished, the output directory will have the same structure as the input directory, with the same table directories, but with just one file per directory named data.csv. For example, the output directory for the above input directory will look like this:

my-output-dir/
├── orders/
│   └── data.csv
├── products/
│   └── data.csv
└── users/
    └── data.csv

File format

TDK expects input files to have a certain format and produces output files in the same format:

  • The first row of the file is considered the header row, and the rest of the rows are data rows.

  • Values are delimited by commas (,).

  • Each new row is on a new line.

  • The quote character is ".

  • Empty strings are indistinguishable from nulls: both a quoted and unquoted empty strings are treated as nulls. For example, the following two CSV files are equivalent to a single-row logical table with a name field set to Alice and a nickname field set to null:

name,nickname
Alice,

and

name,nickname
Alice,""

An example CSV file is shown below:

name,age
Alice,25
Bob,57
Charlie,98

Schema inference

TDK infers the input file schema by reading a limited number of sample files (currently, 5) from each table directory. If the schema is not the same across all sample files, the transformation will fail. TDK recognises integer, floating-point and string data types, which can in addition be nullable.

Configuration

Since the root input directory represents a schema and nested directories represent tables, the TDK configuration for CSV transformations looks identical to the configuration for database transformations. The schema name must always be public (lower-case, case-sensitive).

The following is an example of a CSV transformation configuration for the input directory structure shown above:

default_config:
  mode: MASKING
safety_mode: STRICT
schema_creation_mode: DO_NOT_CREATE

tables:
   - table_name_with_schema: "public.orders"
     target_ratio: 0.98

   - table_name_with_schema: "public.products"
     mode: GENERATION
     target_row_number: 100

   - table_name_with_schema: "public.users"
     mode: GENERATION
     target_ratio: 5

Running a CSV transformation

The CSV transformation execution is similar to database transformations, with the difference that file URLs use the file schema instead of jdbc. The following is an example of a CSV transformation execution:

TDK_WORKINGDIRECTORY_ENABLED=true TDK_WORKINGDIRECTORY_PATH=/my-working-directory \
    tdk -c config.yaml --input-url file:///my-input-dir/ --output-url file:///my-output-dir/

Please note that in order to run CSV transformations, the working directory option must be enabled.