Aller au contenu principal

Manage VantageCloud Lake Compute Clusters with Apache Airflow

Overview

This tutorial demonstrates the process of utilizing the Teradata Airflow Compute Cluster Operators to manage VantageCloud Lake compute clusters. The objective is to execute dbt transformations defined on jaffle_shop dbt project through VantageCloud Lake compute clusters.

Remarque

Use The Windows Subsystem for Linux (WSL) on Windows to try this quickstart example.

Prerequisites

  • Ensure you have the necessary credentials and access rights to use Teradata VantageCloud Lake.
    Astuce

    To request a VantageCloud Lake environment, refer to the form provided in this link. If you already have a VantageCloud Lake environment and seek guidance on configuration, please consult this guide.

  • Python 3.8, 3.9, 3.10 or 3.11 and python3-env, python3-pip installed.

Run in Powershell:

Install Apache Airflow and Astronomer Cosmos

  1. Create a new python environment to manage airflow and its dependencies. Activate the environment:

    Remarque

    This will install Apache Airflow as well.

  2. Install the Apache Airflow Teradata provider

  3. Set the AIRFLOW_HOME environment variable.

Install dbt

  1. Create a new python environment to manage dbt and its dependencies. Activate the environment:

  2. Install dbt-teradata and dbt-core modules:

Create a database

Remarque

A database client connected to VantageCloud Lake is needed to execute SQL statements. Vantage Editor Desktop, or dbeaver can be used for this purpose.

Let's create the jaffle_shop database in the VantageCloud Lake instance with TD_OFSSTORAGE as default.

Create a database user

Remarque

A database client connected to VantageCloud Lake is needed to execute SQL statements. Vantage Editor Desktop, or dbeaver can be used to execute CREATE USER query.

Let's create a lake_user user in the VantageCloud Lake instance.

Grant access to user

Remarque

A database client connected to VantageCloud Lake is needed to execute SQL statements. Vantage Editor Desktop, or dbeaver can be used to execute GRANT ACCESS queries.

Let's provide the required privileges to the user lake_user to manage compute clusters.

Setup dbt project

  1. Clone the jaffle_shop repository and cd into the project directory:
  2. Make a new folder, dbt, inside $AIRFLOW_HOME/dags folder. Then, copy/paste jaffle_shop dbt project into $AIRFLOW_HOME/dags/dbt directory

Configure Apache Airflow

  1. Switch to virtual environment where Apache Airflow was installed at Install Apache Airflow and Astronomer Cosmos

  2. Configure the listed environment variables to activate the test connection button, preventing the loading of sample DAGs and default connections in Airflow UI.

  3. Define the path of jaffle_shop project as an environment variable dbt_project_home_dir.

  4. Define the path to the virtual environment where dbt-teradata was installed as an environment variable dbt_venv_dir.

    Remarque

    You might need to change /../../ to the specific path where the dbt_env virtual environment is located.

Start Apache Airflow web server

  1. Run airflow web server
  2. Access the airflow UI. Visit https://localhost:8080 in the browser and log in with the admin account details shown in the terminal. Airflow Password

Define a connection to VantageCloud Lake in Apache Airflow

  1. Click on Admin - Connections
  2. Click on + to define new connection to Teradata VantageCloud Lake instance.
  3. Define new connection with id teradata_lake with Teradata VantageCloud Lake instance details.
    • Connection Id: teradata_lake
    • Connection Type: Teradata
    • Database Server URL (required): Teradata VantageCloud Lake instance hostname or IP to connect to.
    • Database: jaffle_shop
    • Login (required): lake_user
    • Password (required): lake_user

Define DAG in Apache Airflow

Dags in airflow are defined as python files. The dag below runs the dbt transformations defined in the jaffle_shop dbt project using VantageCloud Lake compute clusters. Copy the python code below and save it as airflow-vcl-compute-clusters-manage.py under the directory $AIRFLOW_HOME/files/dags.

Load DAG

When the dag file is copied to $AIRFLOW_HOME/dags, Apache Airflow displays the dag in UI under DAGs section. It will take 2 to 3 minutes to load DAG in Apache Airflow UI.

Run DAG

Run the dag as shown in the image below. Run dag

Summary

In this quick start guide, we explored how to utilize Teradata VantageCloud Lake compute clusters to execute dbt transformations using Teradata compute cluster operators for Airflow.

Further reading

Également intéressant