Zum Hauptinhalt springen

Data Transfer from AWS S3 to Teradata Vantage Using dagster-teradata

Overview

This document provides instructions and guidance for transferring data in CSV, JSON and Parquet formats from AWS S3 to Teradata Vantage using dagster-teradata. It outlines the setup, configuration and execution steps required to establish a seamless data transfer pipeline between these platforms.

Prerequisites

  • Access to a Teradata Vantage instance.

    Hinweis

    If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com

  • Python 3.9 or higher, Python 3.12 is recommended.

  • pip

Setting Up a Virtual Enviroment

A virtual environment is recommended to isolate project dependencies and avoid conflicts with system-wide Python packages. Here’s how to set it up:

Run in Powershell:

Install dagster and dagster-teradata

With your virtual environment active, the next step is to install dagster and the Teradata provider package (dagster-teradata) to interact with Teradata Vantage.

  1. Install the Required Packages:

  2. Verify the Installation:

    To confirm that Dagster is correctly installed, run:

    If installed correctly, it should show the version of Dagster.

Initialize a Dagster Project

Now that you have the necessary packages installed, the next step is to create a new Dagster project.

Scaffold a New Dagster Project

Run the following command:

This command will create a new project named dagster-teradata-s3. It will automatically generate the following directory structure:

Refer here to know more above this directory structure

You need to modify the definitions.py file inside the jaffle_dagster/jaffle_dagster directory.

Step 1: Open definitions.py in dagster-teradata-s3/dagster-teradata-s3 Directory

Locate and open the file where Dagster job definitions are configured.
This file manages resources, jobs, and assets needed for the Dagster project.

Step 2: Implement AWS S3 to Teradata Transfer in Dagster

Explanation of the Code

  1. Resource Configuration for S3 and Teradata:

    • The code configures resources for interacting with S3 and Teradata.
    • The S3Resource is created using AWS credentials (access key, secret key, and session token) from environment variables.
    • The TeradataResource is set up with connection details (host, user, password, database) for Teradata from environment variables.
  2. Defining Operations:

    • drop_existing_table: This operation uses the Teradata resource to drop the "people" table in Teradata.
    • ingest_s3_to_teradata: This operation checks if the "Tables Dropped" status was returned from the previous operation. If true, it ingests data from an S3 bucket to the Teradata table people using the S3 resource. If the table wasn't dropped, it raises an error.
  3. Job Execution:

    • The example_job is defined to execute the two operations sequentially: first, drop the existing table, and then ingest data from S3 to Teradata.
    • The job is registered under the Definitions object for execution within the Dagster environment.

Running the Pipeline

After setting up the project, you can now run your Dagster pipeline:

  1. Start the Dagster Dev Server: In your terminal, navigate to the root directory of your project and run: dagster dev After executing the command dagster dev, the Dagster logs will be displayed directly in the terminal. Any errors encountered during startup will also be logged here. Once you see a message similar to:

    It indicates that the Dagster web server is running successfully. At this point, you can proceed to the next step.

  2. Access the Dagster UI: Open a web browser and navigate to http://127.0.0.1:3000. This will open the Dagster UI where you can manage and monitor your pipelines.

dagster-teradata-s31.png

In the Dagster UI, you will see the following:

  • The job example_job is displayed, along with the associated dbt asset.
  • The dbt asset is organized under the "default" asset group.
  • In the middle, you can view the lineage of each @op, showing its dependencies and how each operation is related to others.

dagster-teradata-s32.png

Go to the "Launchpad" and provide the configuration for the TeradataResource as follows:

Replace <AWS_ACCESS_KEY_ID>, <AWS_SECRET_ACCESS_KEY>, <AWS_SESSION_TOKEN>, <TERADATA_HOST>, <TERADATA_USER>, <TERADATA_PASSWORD>, and <TERADATA_DATABASE> with the actual values for your S3 and Teradata configuration. Once the configuration is done, click on "Launch Run" to start the process.

dagster-teradata-s33.png

The Dagster UI allows you to visualize the pipeline's progress, view logs, and inspect the status of each step.

Arguments Supported by s3_blob_to_teradata

  • s3 (S3Resource):
    The S3Resource object used to interact with the S3 bucket.

  • s3_source_key (str):
    The URI specifying the location of the S3 bucket. The URI format is:
    /s3/YOUR-BUCKET.s3.amazonaws.com/YOUR-BUCKET-NAME
    For more details, refer to:
    Teradata Documentation - Native Object Store

  • teradata_table (str):
    The name of the Teradata table to which the data will be loaded.

  • public_bucket (bool):
    Indicates whether the provided S3 bucket is public. If True, the objects within the bucket can be accessed via a URL without authentication. If False, the bucket is considered private, and authentication must be provided.
    Defaults to False.

  • teradata_authorization_name (str):
    The name of the Teradata Authorization Database Object, which controls access to the S3 object store.
    For more details, refer to:
    Teradata Vantage Native Object Store - Setting Up Access

Summary

This guide details the utilization of the dagster-teradata to seamlessly transfer CSV, JSON, and Parquet data from AWS S3 Storage to Teradata Vantage, facilitating streamlined data operations between these platforms.

Further reading

Auch interessant