Run large bulkloads efficiently with Fastload
Deprecation notice
This how-to describes Fastload
utility. The utility has been deprecated. For new implementations consider using Teradata Parallel Transporter (TPT).
Overview
We often have a need to move large volumes of data into Vantage. Teradata offers Fastload
utility that can efficiently load large amounts of data into Teradata Vantage. This how-to demonstrates how to use Fastload
. In this scenario, we will load over 300k records, over 40MB of data, in a couple of seconds.
Prerequisites
- Access to a Teradata Vantage instance
Note
If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com
- Download Teradata Tools and Utilities (TTU) - supported platforms: Windows, MacOS, Linux (requires registration).
Install TTU
- Windows
- MacOS
- Linux
Unzip the downloaded file and run setup.exe
.
Unzip the downloaded file and run TeradataToolsAndUtilitiesXX.XX.XX.pkg
.
Unzip the downloaded file, go to the unzipped directory and run:
Get Sample data
We will be working with the US tax fillings for nonprofit organizations. Nonprofit tax filings are public data. The US Internal Revenue Service publishes them in S3 bucket. Let's grab a summary of filings for 2020: https://s3.amazonaws.com/irs-form-990/index_2020.csv
. You can use your browser, wget
or curl
to save the file locally.
Create a database
Lets's create a database in Vantage. Use your favorite SQL tool to run the following query:
Run Fastload
We will now run Fastload
. Fastload
is a command-line tool that is very efficient in uploading large amounts of data into Vantage. Fastload
, in order to be fast, has several restrictions in place. It can only populate empty tables, no inserts to already populated tables are supported. It doesn't support tables with secondary indices. Also, it won't insert duplicate records, even if a table is a MULTISET
table. For the full list of restrictions check out Teradata® Fastload
Reference.
Fastload has its own scripting language. The language allows you to prepare the database with arbitrary SQL commands, declare the input source and define how the data should be inserted into Vantage. The tool supports both interactive and batch mode. In this section, we are going to use the interactive mode.
Let's start Fastload
in the interactive mode:
First, let's log in to a Vantage database. I've a Vantage Express running locally, so I'll use localhost
as the hostname and dbc
for username and password:
Now, that we are logged in, I'm going to prepare the database. I'm switching to irs
database and making sure that the target table irs_returns
and error tables (more about error tables later) do not exist:
I'll now create an empty table that can hold the data elements from the csv file.
Now, that the target table has been prepared, we can start loading the data. ERRORFILES
directive defines error files. The error files are really tables that Fastload
creates. The first table contains information about data conversion and other issues. The second table keeps track of data uniqueness issues, e.g. primary key violations.
We instruct Fastload
to save a checkpoint every 10k rows. It's useful in case we have to restart our job. It will be able to resume from the last checkpoint.
We also need to tell Fastload
to skip the first row in the CSV file as start at record 2. That's because the first row contains column headers.
Fastload
also needs to know that it's a comma-separated file:
DEFINE
block specifies what columns we should expect:
DEFINE
block also has FILE
attribute that points to the file with the data. Replace FILE = /tmp/index_2020.csv;
with your location of index_2020.csv
file:
Finally, we define the INSERT statement that will put data into the database and we close off LOADING
block:
Once the job has finished, we are logging off from the database to clean things up.
Here is what the entire script looks like:
Batch mode
To run our example in batch mode, simply save all instructions in a single file and run:
Fastload
vs. NOS
In our case, the file is in an S3 bucket. That means, that we can use Native Object Storage (NOS) to ingest the data:
The NOS solution is convenient as it doesn't depend on additional tools. It can be implemented using only SQL. It performs well, especially for Vantage deployments with a high number of AMPs as NOS tasks are delegated to AMPs and run in parallel. Also, splitting the data in object storage into multiple files may further improve performance.
Summary
This how-to demonstrated how to ingest large amounts of data into Vantage. We loaded hundreds of thousands or records into Vantage in a couple of seconds using Fastload
.
Further reading
If you have any questions or need further assistance, please visit our community forum where you can get support and interact with other community members.