본문으로 건너뛰기

Execute Airflow workflows that use dbt with Teradata Vantage

Overview

This tutorial demonstrates how to install Airflow on an AWS EC2 VM, configure the workflow to use dbt, and run it against a Teradata Vantage database. Airflow is a task scheduling tool that is typically used to build data pipelines to process and load data. In this example, we go through the Airflow installation process, which creates a Docker-based Airflow environment. Once Airflow is installed, we run several Airflow DAG (Direct Acyclic Graph, or simply workflow) examples that load data into a Teradata Vantage database.

Prerequsites

  1. Access to AWS (Amazon Web Services) with permissions to create a VM.

This tutorial can be adjusted to other compute platforms or even on a bare metal machine as long as it has a computing and storage capacity comparable to the machine mentioned in this document (t2.2xlarge EC2 on AWS with approximately 100GB of storage) and is connected to the internet. If you decide to use a different compute platform, some steps in the tutorial will have to be altered.

  1. An SSH client.

If you are on a Mac or a Linux machine, these tools are already included. If you are on Windows, consider PuTTY or MobaXterm.

  1. Access to a Teradata Vantage database. If you don't have access to Teradata Vantage, explore Vantage Express - a free edition for developers.

Install and execute Airflow

Create a VM

  1. Go to the AWS EC2 console and click on Launch instance.
  2. Select Red Hat for OS image.
  3. Select t2.2xlarge for instance type.
  4. Create a new key pair or use an existing one.
  5. Apply network settings that will allow you ssh to the server and the server will have outbound connectivity to the Internet. Usually, applying the default settings will do.
  6. Assign 100GB of storage.

Install Python

  1. ssh to the machine using ec2-user user.

  2. Check if python is installed (should be Python 3.7 or higher). Type python or python3 on the command line.

  3. If python is not installed (you are getting command not found message) run the commands below to install it. The commands may require you to confirm the installation by typing y and enter.

Create an Airflow environment

  1. Create the Airflow directory structure (from the ec2-user home directory /home/ec2-user)
  1. Use your preferred file transfer tool (scp, PuTTY, MobaXterm, or similar) to upload airflow.cfg file to airflow/config directory.

Install Docker

Docker is a containerization tool that allows us to install Airflow in a containerized environment.

노트

The steps must be executed in airflow directory.

  1. Uninstall podman (RHEL containerization tool)
  1. Install yum utilities
  1. Add docker to yum repository.
  1. Install docker.
  1. Start docker as a service. The first command runs the docker service automatically when the system starts up next time. The second command starts Docker now.
  1. Check if Docker is installed correctly. This command should return an empty list of containers (since we have not started any container yet):

Install docker-compose and docker environment configuration files

  1. Upload Dockerfile and Dockerfile files to the VM and save them in airflow directory.

What docker-compose.yaml and Dockerfile do docker-compose.yaml and Dockerfile files are necessary to build the environment during the installation. The docker-compose.yaml file downloads and installs the Airflow docker container. The container includes the web ui, a Postgres database for metadata, the scheduler, 3 workers (so 3 tasks can be run in parallel), the trigger and the nginx web server to show the docs produced by dbt. In addition host directories are mounted on containers and various other install processes are performed. Dockerfile will additionally install needed packages in each container.

If you would like to learn more what docker-compose.yaml and Dockerfile files do, examine these files. There are comments which clarify what is installed and why.

  1. Install docker-compose (necessary to run the yaml file).
노트

The instructions are based on version 1.29.2. Check out https://github.com/docker/compose/releases site for the latest release and update the command below as needed.

  1. Test your docker-compose installation. The command should return the docker-compose version, for example docker-compose version 1.29.2, build 5becea4c:

Install a test dbt project

노트

These steps set up a sample dbt project. dbt tool itself will be installed on the containers later by docker-compose.

  1. Install git:
  1. Get the sample jaffle shop dbt project:
노트

The dbt directories will be created under the home directory (not under airflow). The home directory in our example is /home/ec2-user.

  1. Create the airflowtest and jaffle_shop users/databases on your Teradata database by using your preferred database tool (Teradata Studio Express, bteq or similar). Log into the database as dbc, then execute the commands (change the passwords if needed):
  1. Create the dbt configuration directory:
  1. Copy profiles.yml into the .dbt directory.

  2. Edit the file so it corresponds to your Teradata database setup. At a minium, you will need to change the host, user and password. Use jaffle_shop user credentials you set up in step 3.

Create the Airflow environment in Docker

  1. Run the docker environment creation script in the airflow directory where Dockerfile and docker-compose.yaml:

This can take 5-10 minutes, when the installation is complete you should see on the screen a message similar to this:

This means the Airflow webserver is ready to accept calls.

  1. Now Airflow should be up. The terminal session that we were using during the installation will be used to display log messages, so it is recommended to open another terminal session for subsequent steps. To check the Airflow installation type:

The result should be something like:

  1. OPTIONAL: If you want to delete the docker installation (for example to update the docker-compose.yaml and the Dockerfile files and recreate a different environment), the command is (from the airflow directory where these files are located):

Once the stack is down, update the configuration files and restart by running the command in step 1.

  1. To test if the Airflow web UI works, type the following urls on your browser. Replace <VM_IP_ADDRESS> with the external IP address of the VM:
  • DAG UI: http://<YOUR_IP_ADDRESS>:8080/home - username: airflow / password: airflow
  • Flower Airflow UI (worker control): http://<YOUR_IP_ADDRESS>:5555/

Run an Airflow DAG

  1. Copy airflow_dbt_integration.py, db_test_example_dag.py, discover_dag.py, variables.json files to /home/ec2-user/airflow/dags.
  2. Examine the files:
  • airflow_dbt_integration.py - a simple Teradata sql example that creates a few tables and runs queries.
  • db_test_example_dag.py - runs a dbt example [i.e. integration of dbt and airflow with a Teradata database). In this example a fictitious jaffle_shop data model is created, loaded and the documentation for this project is produced (you can view it by pointing your browser to http://<VM_IP_ADDRESS>:4000/)
노트

[Adjust db_test_example_dag.py] db_test_example_dag.py needs to be updated so that the Teradata database IP address points to your database.

  • discover_dag.py - an example on how to load various types of data files (CSV, Parquet, JSON). The source code file contains comments that explain what the program does and how to use it. This example relies on variables.json file. The file needs to be imported into Airflow. It will happen in subsequent steps.
  1. Wait for a few minutes until these dag files are picked up by the airflow tool. Once they are picked up they will appear on the list of dags on the Airflow home page.

  2. Import variables.json file as a variable file into Airflow:

  • Click on Admin -> Variables menu item to go to the Variables page Airflow admin dropdown
  • Click on Choose File, then select variable.json in your file explorer and click on Import Variables Airflow admin dropdown
  • Edit the variables to match your environment
  1. Run the dags from the UI and check the logs.

Summary

This tutorial aimed at providing a hands on exercise on how to install an Airflow environment on a Linux server and how to use Airflow to interact with a Teradata Vantage database. An additional example is provided on how to integrate Airflow and the data modelling and maintenance tool dbt to create and load a Teradata Vantage database.

Further reading

노트

If you have any questions or need further assistance, please visit our community forum where you can get support and interact with other community members.

기타 관심 사항