Aller au contenu principal

Automate Data Movement and Transformation with Airflow, Airbyte, and dbt in Teradata Vantage

Overview

This quickstart demonstrates how to leverage Apache Airflow as orchestration tool to automate the movement of data to Teradata using Airbyte. Once the data is in Teradata, trigger dbt to perform transformations, ensuring the data is clean, reliable and ready for analysis.

Prerequisites

Demo project setup

  1. Clone the tutorial repository:
  1. Navigate to the directory:

Open the code in your preferred IDE.

Setting up Teradata instance

Follow the instructions from Getting started with ClearScape Analytics Experience and acquire following:

  • Host
  • Username
  • Password

Jot it down, it will be needed in next step.

Airbyte setup

  1. After you install Airbyte OSS locally using abctl, generate and copy the credentials:
Remarque

At the moment of writing this quickstart, Airbyte installed with abctl performs unauthenticated pulls even if you pass in Docker credentials: https://github.com/airbytehq/airbyte/issues/46309. Since Docker limits unauthenticated pulls you may encounter 429 Too Many Requests error. If you run into this error, launch Airbyte using run-ab-platform.sh script as described in the Airbyte quickstart.

Airbyte Credentials

You can learn more about it from here

  1. Launch Airbyte UI by opening http://localhost:8000/ in your preferred browser and login with credentials from previous step.

Airbyte UI

  1. Create a source

    • Go to the Sources tab and click on + New source.
    • Search for “faker” using the search bar and select Sample Data (Faker).
    • Adjust the Count and optional fields as needed for your use case. You can also leave as is.
    • Click on Set up source.
  2. Create a destination

    • Go to the Destinations tab and click on + New destination.
    • Search for “teradata” using the search bar and select Teradata Vantage.
    • Enter the connection details as needed.
      • Host (your Teradata instance hostname)
      • Password (your Teradata instance password)
      • Default Schema ('ecommerce')
    • Click on Set up destination.

    Airbyte Teradata Connection

  3. Create a connection

    • Go to the Connections tab and click on + New connection.
    • Select the source and destination you just created.
    • Enter the connection details as needed.
    • Click on Set up connection.

    Airbyte Connection

Airflow setup

Airflow is at the center of the whole tutorial. It is responsible for making Airbyte move data to Teradata and trigger dbt to transform the data.

  1. Navigate to orchestration directory
  1. Change the environment file name from .env.example to .env.

  2. Link Airbyte connection to the Airflow DAG

    • To execute the DAG in Airflow to trigger Airbyte, connection_id needs to be edited in elt_dag.py file
    • On the connection page, from the URL, copy the part between connections and status. It is the value of connection_id variable.

    Airbyte Connection id

    • Open elt_dag.py file inside the airflow/dags directory and add your Airbyte connection id on line 28

    Connection ID value

  3. Build custom Airflow image

  1. Launch Airflow container

This might take a few minutes initially as it sets up necessary databases and metadata.

  1. Open Airflow UI

    • Open Airflow UI by pasting http://localhost:8080 to browser of your choice.
    • Default username and password is airflow (unless you changed it in .env file)
  2. Create Airflow connection with Airbyte

    • Go to the Admin > Connections tab
    • Click on the + button to create a new connection
    • Edit the page with following values
      • Connection id:airbyte_connection
      • Connection Type:Airbyte
      • Host:host.docker.internal
      • Port:8000
      • Login:Your Airbyte username
      • Password:Your airbyte password (you can get it by running abctl local credentials command in terminal)
    • Click on the Test button, and make sure you get a Connection successfully tested message at the top. Then, you can Save the connection.

    Airflow Airbyte Connection

  3. Create Airflow connection with Teradata

  • Go to the Admin > Connections tab

    • Click on the + button to create a new connection
    • Edit the page with following values
      • Connection id:teradata_connection
      • Connection Type:Teradata
      • Database Server URL:Your Teradata instance hostname
      • Username:demo_user
      • Password:Your Teradata instance password
    • Click on the Test button, and make sure you get a Connection successfully tested message at the top. Then, you can Save the connection.

    Airflow Teradata Connection

Orchestration with Airflow

After making sure all the steps till now is working fine, it is time to run your data pipeline.

  1. Make sure all the workflows are enabled.

DAGs Enabled

  1. In Airflow UI, go to DAGs section, locate elt_dag and click on "Trigger DAG" under the "Action" column.
    • This will initiate the complete data pipeline, starting with the Airbyte sync from Faker to Teradata, followed by dbt transforming the raw data into staging and marts models.
    • The status of elt_dag can be check by clicking on elt_dag and then on Graph.

DAGs

DAGs two running

  1. Confirm the sync status in the Airbyte UI.

Airbyte Sync Status

  1. After the completion of both the DAGs, check the dbt_dag graph.

DAGs two complete

dbt transformation

Remarque

The dbt project is present in the dbt_project folder and is handled by Airflow with the help of Astronomer Cosmos

  1. You can check the newly created views in the transformed_data dataset on Teradata Vantage. This can be done by using Teradata Studio or other database client UI tools like DBeaver.

Conclusion

After completing all the above steps, you should have a working stack of Airbyte, dbt and Airflow with Teradata. It can be used as starting point for your projects and can be adapted as per your scenario.

Également intéressant