본문으로 건너뛰기

Train ML models in Vantage using Database Analytic Functions

Overview

There are situations when you want to quickly validate a machine learning model idea. You have a model type in mind. You don't want to operationalize with an ML pipeline just yet. You just want to test out if the relationship you had in mind exists. Also, sometimes even your production deployment doesn't require constant relearning with MLops. In such cases, you can use Database Analytic Functions for feature engineering, train different ML models, score your models, and evaluate your model on different model evaluation functions.

Prerequisites

You need access to a Teradata Vantage instance.

노트

If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com

Load the sample data

Here in this example we will be using the sample data from val database. We will use the accounts, customer, and transactions tables. We will be creating some tables in the process and you might face some issues while creating tables in val database, so let's create our own database td_analytics_functions_demo.

노트

You must have CREATE TABLE permissions on the Database where you want to use Database Analytics Functions.

Let's now create accounts, customer and transactions tables in our database td_analytics_functions_demo from the corresponding tables in val database.

Understand the sample data

Now, that we have our sample tables loaded into td_analytics_functions_demo, let's explore the data. It's a simplistic, fictitious dataset of banking customers (700-ish rows), Accounts (1400-ish rows) and Transactions (77K-ish rows). They are related to each other in the following ways:

Banking Model

In later parts of this how-to we are going to explore if we can build a model that predicts average monthly balance that a banking customer has on their credit card based on all non-credit card related variables in the tables.

Preparing the Dataset

We have data in three different tables that we want to join and create features. Let's start by creating a joined table.

Let's now see how our data looks. The dataset has both categorical and continuous features or independent variables. In our case, the dependent variable is cc_avg_bal which is customer's average credit card balance.

Joined Table

Feature Engineering

On looking at the data we see that there are several features that we can take into consideration for predicting the cc_avg_bal.

TD_OneHotEncodingFit

As we have some categorical features in our dataset such as gender, marital status and state code. We will leverage the Database Analytics function TD_OneHotEncodingFit to encode categories to one-hot numeric vectors.

TD_ScaleFit

If we look at the data, some columns like tot_income, tot_age, ck_avg_bal have values in different ranges. For the optimization algorithms like gradient descent it is important to normalize the values to the same scale for faster convergence, scale consistency and enhanced model performance. We will leverage TD_ScaleFit function to normalize values in different scales.

TD_ColumnTransformer

Teradata's Database Analytic Functions typically operate in pairs for data transformations. The first step is dedicated to "fitting" the data. Subsequently, the second function utilizes the parameters derived from the fitting process to execute the actual transformation on the data. The TD_ColumnTransformertakes the FIT tables to the function and transforms the input table columns in single operation.

Once we perform the transformation we can see our categorical columns one-hot encoded and numeric values scaled as can be seen in the image below. For ex: tot_income is in the range [0,1], gender is one-hot encoded to gender_0, gender_1, gender_other.

Total Income Scaled

Gender One Hot Encoded

Train Test Split

As we have our datatset ready with features scaled and encoded, now let's split our dataset into training (75%) and testing (25%) parts. Teradata's Database Analytic Functions provide TD_TrainTestSplit function that we'll leverage to split our dataset.

As can be seen in the image below, the function adds a new column TD_IsTrainRow.

Train Row Column

We'll use TD_IsTrainRow to create two tables, one for training and other for testing.

Training with Generalized Linear Model

We will now use TD_GLM Database Analytic Function to train on our training dataset. The TD_GLM function is a generalized linear model (GLM) that performs regression and classification analysis on data sets. Here we have used a bunch of input columns such as tot_income, ck_avg_bal,cc_avg_tran_amt, one-hot encoded values for marital status, gender and states. cc_avg_bal is our dependent or response column which is continous and hence is a regression problem. We use Family as Gaussian for regression and Binomial for classification.

The parameter Tolerance signifies minimum improvement required in prediction accuracy for model to stop the iterations and MaxIterNum signifies the maximum number of iterations allowed. The model concludes training upon whichever condition is met first. For example in the example below the model is CONVERGED after 58 iterations.

TTrained GLM

Scoring on Testing Dataset

We will now use our model GLM_model_training to score our testing dataset testing_table using link:TD_GLMPredictDatabase Analytic Function.

Scored GLM

Model Evaluation

Finally, we evaluate our model on the scored results. Here we are using TD_RegressionEvaluator function. The model can be evaluated based on parameters such as R2, RMSE, F_score.

Evaluated GLM

노트

The purpose of this how-to is not to describe feature engineering but to demonstrate how we can leverage different Database Analytic Functions in Vantage. The model results might not be optimal and the process to make the best model is beyond the scope of this article.

Summary

In this quick start we have learned how to create ML models using Teradata Database Analytic Functions. We built our own database td_analytics_functions_demo with customer,accounts, transactions data from val database. We performed feature engineering by transforming the columns using TD_OneHotEncodingFit, TD_ScaleFit and TD_ColumnTransformer. We then used TD_TrainTestSplit for train test split. We trained our training dataset with TD_GLM model and scored our testing dataset. Finally we evaluated our scored results using TD_RegressionEvaluator function.

Further reading

노트

If you have any questions or need further assistance, please visit our community forum where you can get support and interact with other community members.

기타 관심 사항