Run large bulkloads efficiently with Teradata Parallel Transporter (TPT)
Overview
We often have a need to move large volumes of data into Vantage. Teradata offers Teradata Parallel Transporter (TPT)
utility that can efficiently load large amounts of data into Teradata Vantage. This how-to demonstrates how to use TPT
. In this scenario, we will load over 300k records, over 40MB of data, in a couple of seconds.
Prerequisites
-
Access to a Teradata Vantage instance.
노트If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com
-
Download Teradata Tools and Utilities (TTU) - supported platforms: Windows, MacOS, Linux (requires registration).
Install TTU
- Windows
- MacOS
- Linux
Run in Powershell:
Unzip the downloaded file, go to the unzipped directory and run:
Get Sample data
We will be working with the US tax fillings for nonprofit organizations. Nonprofit tax filings are public data. The US Internal Revenue Service publishes them in S3 bucket. Let's grab a summary of filings for 2020: https://storage.googleapis.com/clearscape_analytics_demo_data/TPT/index_2020.csv
. You can use your browser, wget
or curl
to save the file locally.
Create a database
Let's create a database in Vantage. Use your favorite SQL tool to run the following query:
Run TPT
We will now run TPT
. TPT
is a command-line tool that can be used to load, extract and update data in Teradata Vantage. These various functions are implemented in so called operators
. For example, loading data into Vantage is handled by the Load
operator. The Load
operator is very efficient in uploading large amounts of data into Vantage. The Load
operator, in order to be fast, has several restrictions in place. It can only populate empty tables. Inserts to already populated tables are not supported. It doesn't support tables with secondary indices. Also, it won't insert duplicate records, even if a table is a MULTISET
table. For the full list of restrictions check out Teradata® TPT Reference - Load Operator - Restrictions and Limitations.
TPT has its own scripting language. The language allows you to prepare the database with arbitrary SQL commands, declare the input source and define how the data should be inserted into Vantage.
To load the csv data to Vantage, we will define and run a job. The job will prepare the database. It will remove old log and error tables and create the target table. It will then read the file and insert the data into the database.
-
Create a job variable file that will tell TPT how to connect to our Vantage database. Create file
jobvars.txt
and insert the following content. Replacehost
with the host name of your database. For example, if you are using a local Vantage Express instance, use127.0.0.1
.username
with the database user name, andpassword
with the database password. Note that the preparation step (DDL) and the load step have their own configuration values and that the config values need to be entered twice to configure both the DDL and the load step. -
Create a file with the following content and save it as
load.txt
. See comments within the job file to understand its structure. -
Run the job:
A successful run will return logs that look like this:
TPT
vs. NOS
In our case, the file is in an S3 bucket. That means, that we can use Native Object Storage (NOS) to ingest the data:
The NOS solution is convenient as it doesn't depend on additional tools. It can be implemented using only SQL. It performs well, especially for Vantage deployments with a high number of AMPs as NOS tasks are delegated to AMPs and run in parallel. Also, splitting the data in object storage into multiple files may further improve performance.
Summary
This how-to demonstrated how to ingest large amounts of data into Vantage. We loaded hundreds of thousands or records into Vantage in a couple of seconds using TPT
.
Further reading
If you have any questions or need further assistance, please visit our community forum where you can get support and interact with other community members.