Aller au contenu principal

Use geographic reference data with Vantage

Overview

This post demonstrates how you can leverage any geographic dataset in GeoJson format and use it for geospatial analytics in Teradata Vantage, with just a few lines of code.

Today we be gathering reference geographical data (official maps, points of interest, etc...) form public sources and use it to support our day to day analytics.

You will learn two methods to get your GeoJson data into Teradata Vantage:

  1. Load it as a single document and use native ClearScape analytics functions to parse it into a table usable for analytics.
  2. Lightly transform it in native Python as we load it into Vantage to produce an analytics ready dataset.

The first method is a straig forward ELT pattern for semi-structured format processing in Vantage with a single SQL statement, the second one involves some lightweight preparation in (pure) Python and may allow more flexibility (for example to add early quality checks or optimize the load of large documents).

Prerequisites

You will need:

  • Access to a Teradata Vantage instance.
    Remarque

    If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com

  • A Python 3 interpreter
  • A SQL Client

Option 1: Load a GeoJson document into Vantage

Here we will load a GeoJson document as a single Character Large OBject (CLOB) into the Vantage Data Store and use a single SQL statement, backed by ClearScape Analytics native functions, to parse this document into a usable structure for geospatial analytics.

Get and load the GeoJson document

The http://geojson.xyz/ website is a fantastic source for open geographical data in GeoJson format. We will load the "Populated Places" dataset that provides with a list of over 1000 significant world cities.

Open you favourite Python 3 interpreter and make sure you have the following packages installed:

  • wget
  • teradatasql
  • getpass

Download and read the cities dataset:

Load the GeoJson document in Vantage

Modify this code as needed with your Vantage host name, user name and specify an advanced login mechanism if any (eg. LDAP, Kerberos...) with the logmech parameter if you need to. All the connection parameters are documented on the teradatasql PyPi page there: https://pypi.org/project/teradatasql/

The code below simply creates a Vantage connection, and opens a cursor creating a table and loading it with our file.

Use the map from Vantage

Now open your favourite SQL client and connect to your Vantage system.

We will use ClearScape analytics JSON functions to parse our GeoJson document and extract the most relevant properties and the geometry itself (the coordinates of the city) for each feature (each feature representing a city in this example). We then use the GeomFromGeoJSON function to cast our geometry as a native Vantage geometry data type (ST_GEOMETRY).

For user convenience, will wrap all this SQL code in a view:

That's all, you can now view the prepared geometry data as a table, ready to enrich your analytics:

Result:

city_namecountry_nameregion_namecode_country_isoa3city_coord
PotenzaItalyBasilicataITAPOINT (15.798996495640267 40.642002130098206)
MariehamnFinlandFinströmALDPOINT (19.949004471869102 60.096996184895431)
RamallahIndeterminatePSEPOINT (35.206209378189556 31.902944751424059)
PoitierFrench RepublicPoitou-CharentesFRAPOINT (0.333276528534554 46.583292255736581)
Clermont-FerrandFrench RepublicAuvergneFRAPOINT (3.080008095928406 45.779982115759424)

Calculate the distance between two cities:

Result:

city_coord.ST_SPHERICALDISTANCE(city_coord)
1.9265006861079421e+06

Option 2: Prepare a GeoJson document with Python and load it into Vantage

The previous example demonstrated how to load a complete document as a large object into Teradata Vantage and use built in analytic functions to parse it into a usable dataset.

This is convenient but limited: we need to parse this document every time we need to use it, as the original document is not directly usable for analytics, JSON documents are currently limited to 16MB in Vantage and it may be inconvenient to fix data quality or formatting issues within the document stored as a CLOB.

In this example, we will parse our JSON document using the Python json package and load it as a table that can be used directly and efficiently for analysis.

Python json and list manipulation functions, along with the Teradata SQL driver for Python make this process really simple and efficient.

For this example, we will use the boundaries of the world countries available on https://datahub.io.

Let's get into it.

Open you favourite Python 3 interpreter and make sure you have the following packages installed:

  • wget
  • teradatasql
  • getpass

Get and load the GeoJson document

Open the GeoJson file and type it as a dictionary

Optional Check the content of the file

The good thing about loading this JSON in memory, if you are using an interactive Python terminal, is that you can now explore the document to understand its structure. For example

What we have here is a collection of GeoFeatures (as earlier).

We will now lightly model this data in a Vantage table, for that:

  • We will load each feature as a raw.
  • We will extract the properties that look interesting for immediate analysis (in our example, the country name and ISO code).
  • We will extract the geometry itself and load it as a separate column.

To load a set of rows with a teradatasql cursor, we need to represent each row as an array (or tuples) of values, and the complete dataset as an array of all the row-arrays. This is fairly easy to do as a list comprehension

For example:

NB: Not featured here, but recommended for richer datasets, consider loading the entire and original feature payload as a separate column (this is a JSON document). This will allow you to go back to the original record and extract new properties that you may have missed during your first analysis but have become relevant, directly in SQL and without having to reload the file entirely.

Create a Vantage connection and load our file in a staging table

Modify this code as needed with your Vantage host name, user name and specify an advanced login mechanism if any (eg. LDAP, Kerberos...) with the logmech parameter if you need to. All the connection parameters are documented on the teradatasql PyPi page there: https://pypi.org/project/teradatasql/

The code below simply creates a Vantage connection, and opens a cursor creating a table and loading it with our list.

Create and our geography refernce table

The code below performs the table creation from the Python interpreter, you can also run the sql statement defined below in your prefered SQL client you might as well simply define this logic as a SQL view to avoid having to refresh this table.

We will use ClearScape analytics the GeomFromGeoJSON function to cast our geometry as a native Vantage geometry data type (ST_GEOMETRY).

Use your data

That's all, you may now query your tables using your favourite SQL client and Teradata's excellent Geospatial data types and analytic functions.

For example, using the two datasets we have loaded during this tutorial, check in what countries are

city_namecity_coordcountry_nm
AcapulcoPOINT (-99.915979046410712 16.849990864016206)Mexico Aosta
POINT (7.315002595706176 45.737001067072299)Italy AnconaPOINT (13.499940550397127 43.600373554552903)
Italy AlbanyPOINT (117.891604776075155 -35.016946595501224)Australia

Summary

Note that none of the above code does not implement any control procedure or checks to, for example, manage the state of the target tables, manage locking, control error codes, etc... This is meant to be a demonstrations of the available features to acquire and use geospatial reference data.

Consider using SQLAlchemy ORM if you are defining your pipeline in Python, xref:dbt.adoc[dbt], or your favorite ELT and orchestration toolset to create your products you can operationalize.

You now can know how to get any open geographic dataset and use it to augment your analytics with Teradata Vantage!

Remarque

If you have any questions or need further assistance, please visit our community forum where you can get support and interact with other community members.

Également intéressant