Cloud Run/Scheduler

Google Cloud Run is a fully managed serverless platform that runs docker containers and can scale to 0, making it a cost-effective tool that can run batch workflows.

Google Cloud Scheduler is a platform that allows the scheduling and triggering of endpoints.

Taken together, Cloud Run and Cloud Scheduler can be used together to schedule training of synthesized models and generation of synthesized data - e.g. to create an overnight scheduled batch synthesis.

GCP Cloud Composer can also be used for this purpose and example documentation exists here.

Whilst the Cloud Run/Cloud Schedule setup is one options for triggering batch processes, in the first instance you should utilise existing data pipeline infrastructure that exists within your organisation. Integrate synthesized into existing pipelines via a similar triggering process described below, a python script utilising synthesized, or another method.

The Setup

  1. Create and serve a processing script

  2. Create docker image

  3. Run docker image on Cloud Run

  4. Schedule batch run with Cloud Scheduler

Prerequisites

  • Synthesized License

  • GCP account and credentials with access to Cloud Run, Cloud Scheduler, and Cloud Storage

  • Access to Synthesized .whl file if building docker image from scratch yourself, or to the Synthesized docker registry if not

  • Docker installed locally, to build the image

Create and serve a processing script

If using Cloud Run Jobs you don’t necessarily need to have a service endpoint to trigger the batch processing as you can simply trigger the Job.
synthesis.py
import synthesized
from datetime import datetime
import os
from google.cloud import storage
from flask import Flask

app = Flask(__name__)  (3)


def get_data():
    """Method to be replaced with data import function"""
    return synthesized.util.get_example_data()  (1)


@app.route("/")  (3)
def main():
    # Load data
    df = get_data()

    # Synthesize data  (4)
    df_meta = synthesized.MetaExtractor.extract(df)
    synthesizer = synthesized.HighDimSynthesizer(df_meta)
    synthesizer.learn(df)
    df_synth = synthesizer.synthesize(num_rows=42)

    # Save file locally
    filename = f"synthetic_data_{str(datetime.now().date())}.csv"
    df_synth.to_csv(filename)

    # Push file to GCP bucket  (2)
    client = storage.Client()  # Needs environment variable and json file set
    # bucket = client.create_bucket("db-gcp-container-demo")
    bucket = client.bucket("db-gcp-container-demo")
    new_blob = bucket.blob(filename)
    new_blob.upload_from_filename(filename=filename)

    return f"Successfully synthesized data. Uploaded file {filename} to GCP."


if __name__ == "__main__":
    app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
1 In this example, we use data from synthesized.util.get_example_data() as our input, but in practice this would be stubbed out and replaced with data from a real source
2 To demonstrate a more realistic output setup, we save a csv file locally and then push it to Google Cloud Storage
3 In order to trigger the processing, a service endpoint is exposed at the root location (i.e. /) that, when called, will trigger the process
4 Data synthesis follows the same patterns explained in Single Table Synthesis

Create docker image

Create a Dockerfile (file name Dockerfile in your project directory - see https://docs.docker.com/engine/reference/builder/) with content similar to the following:

Dockerfile
FROM synthesizedio.jfrog.io/synthesized-docker/synthesized:1.10

COPY . ./

# For more production-ready systems, the following keys can be injected when
# the container is spun up or pulled from a secrets manager in the synthesis script
ENV SYNTHESIZED_KEY=<ENTER_YOUR_SYNTHESIZED_KEY_HERE>
ENV GOOGLE_APPLICATION_CREDENTIALS="./gcr_key.json"
ENV PORT=8080

RUN pip install --no-cache-dir -r requirements.txt

# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
# Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling.
# Note: The following command is for a file named "synthesis" with a flask app named "app" within
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 synthesis:app

Place the Dockerfile, synthesis.py file, and the following requirements.txt file in the same location, and open a terminal at that location.

requirements.txt
google-cloud-storage
flask
gunicorn

With all files in the current directory, build and push the Docker image to GCR:

docker build -t gcr.io/{{ GCP-PROJECT-ID }}/synthesizer:0.0.1 .
docker push gcr.io/{{ GCP-PROJECT-ID }}/synthesizer:0.0.1
You will need to have your local docker setup registered with the Synthesized docker registry so you can pull the pre-built Synthesized docker image, and authenticated with your GCP account with permissions to push to GCR: https://cloud.google.com/container-registry/docs/advanced-authentication. You can read more on how to work with GCR here: https://cloud.google.com/container-registry/docs

Run docker image on Cloud Run

Once you have pushed the docker image to GCR (Google Container Registry) you can quickly spin up a Google Cloud Run instance from it by following the instructions on: https://cloud.google.com/run/docs/quickstarts/deploy-container

Schedule batch run with Cloud Scheduler

Once you have a Cloud Run instance set up, your instance will have a URL associated with it that you can go to in order to trigger the processing. Copy this URL and set up a trigger on Google Cloud Scheduler by following the instructions on: https://cloud.google.com/scheduler/docs/schedule-run-cron-job

Summary

With the above set up, you should have a cron job on Cloud Scheduler that will trigger the Cloud Run endpoint which will then execute the processing script, which in turns saves the output synthesized data to Cloud Storage.