Last Updated: 2023-12-14

Background

Storing data in a data lake and accessing it directly from the lake can help your organization reduce costs by avoiding costly data storage solutions. However, the files landed in a data lake must be registered to a metastore before they can be queried. As your organization lands files from many different sources into its data lake on an hourly, daily, or weekly basis, data managers may struggle to keep up with the file registration necessary for accessing data from a data lake.

Role of the event manager

This is especially burdensome with the Azure Blob and Google Cloud Storage object stores, which don't include an event manager to manage files as they land. When the data managers in an organization aren't able to quickly register files, data consumers have to wait for the newest data or work with stale data.

Schema discovery helps to manage data lakes

Starburst Galaxy's schema discovery feature helps your organization solve this challenge. Data managers run schema discovery on a data lake catalog to find and register new files with the metastore of their choice. This feature will decrease the amount of time it takes for the most up-to-date data to get into the hands of data consumers.

Tutorial scope

In this tutorial, you will configure a catalog in Starburst Galaxy that connects to an AWS S3 object store. You will then run schema discovery on this S3 data lake to discover existing schema and tables.

If you are a data engineer, this tutorial will show you how easy it is to use schema discovery to facilitate data lake file registration.

Prerequisites

You need a Starburst Galaxy account to complete this tutorial. Please be sure to complete the tutorial titled Starburst Galaxy: Getting started before attempting this tutorial.

Learning outcomes

Upon successful completion of this tutorial, you will be able to:

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Tutorial scenario

Burst Bank relies on AWS S3 object storage to house a significant portion of its data. As the bank expands, the volume of data pouring into this data lake increases daily.

Unfortunately, the data engineering team has been struggling to keep up with the task of registering new files to the metastore. This process is essential for enabling data consumers to utilize the data effectively for analytics.

Fortunately, Burst Bank utilizes Starburst Galaxy, a solution that offers schema discovery to address their current challenge. Your job is to help the data engineers at Burst Bank by showing them the schema discovery feature.

Background

Schema discovery is a process in data management that involves automatically identifying and understanding the structure of a database, data warehouse, or data lake.

In Starburst Galaxy, schema discovery works on a data lake by searching your object store to find the metadata corresponding to the schemas, tables, and partitions. Once the schemas have been discovered, a preview of the tables and columns is generated. Schemas can then be added easily, and queried in the normal way.

Starburst Galaxy also tracks the schemas that have been added with schema discovery. You can use the Starburst Galaxy Web UI to see the changes made to a catalog's schema. The audit log will also display any errors associated with the columns in the current or previous schema.

Video: Use schema discovery in Starburst Galaxy

The following video walks through all the steps in this tutorial.

You can choose to watch the video and follow along using your own account. Alternatively, if you prefer, you can skip the video and proceed directly to the step-by-step instructions provided later in the tutorial.

Background

You're going to begin by signing in to Starburst Galaxy and setting your role.

For this tutorial, we have set up a shared AWS S3 training bucket containing sample data. All connection credentials will be provided in the steps below.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Starburst Galaxy separates users by role. Configuring a new catalog will require access to a role with appropriate privileges. Today, you'll be using the accountadmin role.

Your current role is listed in the top right-hand corner of the screen.

Background

Adding a new AWS S3 catalog follows the same process as adding other data sources in Starburst Galaxy. This is one of the main ways that Starburst Galaxy is used to connect to data lakes.

The steps below will show you how to start the process of configuring a new catalog.

Step 1: Create a new catalog

Create a new catalog for your AWS S3 data source.

Step 2: Select Amazon S3 data source

Starburst Galaxy allows the creation of catalogs for a number of different data sources. In this case, you are going to create a new catalog in the Amazon S3 category.

Step 3: Input name and description

The catalog needs both a name and description. This ensures that you can find it later.

Background

When you connect Starburst Galaxy to a new data source, it is necessary to undergo an authentication process. This helps ensure that you are connecting the right data source and that you have the appropriate permissions.

Step 1: Authenticate using AWS access key

Starburst Galaxy allows you to configure several different authentication methods when creating a new catalog. This lets you connect to data sources of different types.

For this tutorial, you're going to choose the AWS access key method. It uses an access key and secret key pairing, which we will provide for this tutorial.

Background

Starburst Galaxy uses a metastore to keep track of the location of your data when it is added to the data lake, in this case to AWS S3.

You can use three different types of metastore with AWS S3:

Step 1: Select the Starburst Galaxy metastore

For this tutorial, you will use the Galaxy Metastore, which removes the need to configure and manage a separate Hive Metastore Service.

Step 2: Select the Apache Iceberg table format

Table formats control the way that data is stored. These include popular modern, open table formats like Iceberg or Delta Lake, or older table formats like Hive.

In this tutorial, you will be using Iceberg, which is the newest and most advanced of the table formats. It is considered best practice to use Iceberg whenever possible with Starburst Galaxy, which is designed to take advantage of its many enhanced features.

Background

Every new catalog connection includes a test before you connect it. This helps to ensure that you have input the correct credentials and allows you to quickly fix any problems before actually connecting.

Step 1: Test and Connect

You're almost there! Time to test the connection and then complete the process of creating your new AWS S3 catalog.

Step 2: Select read-only access

Starburst Galaxy allows you to grant or restrict read access. This is an important feature in production environments.

Read-only access will be sufficient for this tutorial.

Step 3: Add the catalog to a cluster

At this point, you can either add the new catalog to a cluster, or choose to skip this and connect it later.

In this tutorial, you're going to add the catalog to your cluster right away.

Background

Now that you've set up an AWS S3 catalog, it's time to test schema discovery for Burst Bank using Starburst Galaxy. This is the exciting part where you get to dive deep into the capabilities of this feature.

Let's get started!

Step 1: Run schema discovery

Getting started with schema discovery is easy. In fact, we're so sure that you'll need it often, that Starburst Galaxy automatically offers to run schema discovery when you add a new catalog to a cluster.

You just added the schema_discovery AWS S3 catalog, so it's time to run schema discovery on that catalog.

Step 2: Configure schema discovery

Now it's time to configure your schema discovery search by providing a Catalog location URL and Default Schema.

Step 3: Review the 3 rules of schema discovery best practice

While you wait, review the following important information regarding best practices for schema naming.

Step 4: Review schema discovery and create tables

Your schema discovery process should now be complete!

Nine tables were discovered during the schema discovery. Let's inspect them one-by-one and then create each of these tables in the schema_discovery.burst_bank schema.

Step 5: Return to the query editor

Now that you've created the new tables, it's time to test them out using queries. To do this, you'll need to go to the query editor.

Step 6: Query the account table

Now that you're in the query editor, it's time to select one of the new tables to query.

You could choose any of them, but let's go with the account table.

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've completed this tutorial, you should have a better understanding of just how easy it is to use schema discovery in Starburst Galaxy.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.