Transform data Loaded with Airbyte using dbt
Overview
This tutorial demonstrates how to use dbt (Data Build Tool) to transform external data load through Airbyte (an Open-Source Extract Load tool) in Teradata Vantage.
This tutorial is based on the original dbt Jaffle Shop tutorial with a small change, instead of using the dbt seed
command, the Jaffle Shop dataset is loaded from Google Sheets into Teradata Vantage using Airbyte. Data loaded through airbyte is contained in JSON columns as can be seen in the picture below:
Prerequisites
- Access to a Teradata Vantage Instance.
Hinweis
If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com
- Sample data: The sample data Jaffle Shop Dataset can be found in Google Sheets.
- Reference dbt project repository: Jaffle Project with Airbyte.
- Python 3.7, 3.8, 3.9, 3.10 or 3.11 installed.
Sample Data Loading
- Follow the steps in the Airbyte tutorial. Make sure you load data from the Jaffle Shop spreadsheet and not the default dataset referenced by the Airbyte tutorial. Also, set the
Default Schema
in the Teradata destination toairbyte_jaffle_shop
.
When you configure a Teradata destination in Airbyte, it will ask for a Default Schema
. Set the Default Schema
to airbyte_jaffle_shop
.
Clone the project
Clone the tutorial repository and change the directory to the project directory:
Install dbt
-
Create a new python environment to manage dbt and its dependencies. Activate the environment:
HinweisYou can activate the virtual environment in Windows by executing the corresponding batch file
./myenv/Scripts/activate
. -
Install
dbt-teradata
module and its dependencies. The core dbt module is included as a dependency so you don't have to install it separately:
Configure dbt
-
Initialize a dbt project.
The dbt project wizard will ask you for a project name and database management system to use in the project. In this demo, we define the project name as
dbt_airbyte_demo
. Since we are using the dbt-teradata connector, the only database management system available is Teradata. -
Configure the
profiles.yml
file located in the$HOME/.dbt
directory. If theprofiles.yml
file is not present, you can create a new one. -
Adjust
server
,username
,password
to match your Teradata instance'sHOST
,Username
,Password
respectively. -
In this configuration,
schema
stands for the database that contains the sample data, in our case that is the default schema that we defined in Airbyteairbyte_jaffle_shop
. -
Once the
profiles.yml
file is ready, we can validate the setup. Go to the dbt project folder and run the command:If the debug command returned errors, you likely have an issue with the content of
profiles.yml
. If the setup is correct, you will get messageAll checks passed!
The Jaffle Shop dbt project
jaffle_shop
is a fictional restaurant that takes orders online. The data of this business consists of tables for customers
, orders
and payments
that follow the entity relations diagram below:
The data in the source system is normalized. A dimensional model based on the same data, more suitable for analytics tools, is presented below:
dbt transformations
The complete dbt project encompassing the transformations detailed below is located at Jaffle Project with Airbyte.
The reference dbt project performs two types of transformations.
- First, it transforms the raw data (in JSON format), loaded from Google Sheets via Airbyte, into staging views. At this stage the data is normalized.
- Next, it transforms the normalized views into a dimensional model ready for analytics.
The following diagram shows the transformation steps in Teradata Vantage using dbt:
As in all dbt projects, the folder models
contains the data models that the project materializes as tables, or views, according to the corresponding configurations at the project, or individual model level.
The models can be organized into different folders according to their purpose in the organization of the data warehouse/lake. Common folder layouts include a folder for staging
, a folder for core
, and a folder for marts
. This structure can be simplified without affecting the workings of dbt.
Staging models
In the original dbt Jaffle Shop tutorial the project's data is loaded from csv files located in the ./data
folder through dbt's seed
command. The seed
command is commonly used to load data from tables, however, this command is not designed to perform data loading.
In this demo we are assuming a more typical setup in which a tool designed for data loading, Airbyte, was used to load data into the datawarehouse/lake. Data loaded through Airbyte though is represented as raw JSON strings. From these raw data we are creating normalized staging views. We perform this task through the following staging models.
- The
stg_customers
model creates the normalized staging view forcustomers
from the_airbyte_raw_customers
table. - The
stg_orders
model creates the normalized view fororders
from the_airbyte_raw_orders
table - The
stg_payments
model creates the normalized view forpayments
from the_airbyte_raw_payments
table.
As the method of extracting JSON strings remains consistent across all staging models, we will provide a detailed explanation for the transformations using just one of these models as an example.
Below an example of transforming raw JSON data into a view through the stg_orders.sql
model :
- In this model the source is defined as the raw table
_airbyte_raw_orders
. - This raw table columns contains both metadata, and the actual ingested data. The data column is called
_airbyte_data
. - This column is of Teradata JSON type. This type supports the method JSONExtractValue for retrieving scalar values from the JSON object.
- In this model we are retrieving each of the attributes of interest and adding meaningful aliases in order to materialize a view.
Dimensional models (marts)
Building a Dimensional Model is a two-step process:
- First, we take the normalized views in
stg_orders
,stg_customers
,stg_payments
and build denormalized intermediate join tablescustomer_orders
,order_payments
,customer_payments
. You will find the definitions of these tables in./models/marts/core/intermediate
. - In the second step, we create the
dim_customers
andfct_orders
models. These constitute the dimensional model tables that we want to expose to our BI tool. You will find the definitions of these tables in./models/marts/core
.
Executing transformations
For executing the transformations defined in the dbt project we run:
You will get the status of each model as given below:
Test data
To ensure that the data in the dimensional model is correct, dbt allows us to define and execute tests against the data.
The tests are defined in ./models/marts/core/schema.yml
and ./models/staging/schema.yml
. Each column can have multiple tests configured under the tests
key.
- For example, we expect that
fct_orders.order_id
column will contain unique, non-null values.
To validate that the data in the produced tables satisfies the test conditions run:
If the data in the models satisfies all the test cases, the result of this command will be as below:
Generate documentation
Our model consists of just a few tables. In a scenario with more sources of data, and a more complex dimensional model, documenting the data lineage and what is the purpose of each of the intermediate models is very important.
Generating this type of documentation with dbt is very straight forward.
This will produce html files in the ./target
directory.
You can start your own server to browse the documentation. The following command will start a server and open up a browser tab with the docs' landing page:
Lineage graph
Summary
This tutorial demonstrated how to use dbt to transform raw JSON data loaded through Airbyte into dimensional model in Teradata Vantage. The sample project takes raw JSON data loaded in Teradata Vantage, creates normalized views and finally produces a dimensional data mart. We used dbt to transform JSON into Normalized views and multiple dbt commands to create models (dbt run
), test the data (dbt test
), and generate and serve model documentation (dbt docs generate
, dbt docs serve
).
Further reading
If you have any questions or need further assistance, please visit our community forum where you can get support and interact with other community members.