Explore table partitioning and bucketing

21 mins remaining

1. Tutorial overview

Last Updated: 2024-10-24

Background

Several factors can affect the performance of a data lake. Ensuring optimal performance requires attention to how data is ingested, stored, and processed. This tutorial will focus on how table partitioning and bucketing, used independently or together, can improve data lake performance. Through hands-on exercises, you will see how both of these strategies will read far less data than a full scan of a table when there is a filter that aligns appropriately.

Learning objectives

Once you've completed this tutorial, you will be able to:

  • Create partitioned tables.
  • Explain how to write a query on a partitioned table so that less data is read.
  • Create bucketed tables.
  • Explain how to write a query on a bucketed table so that less data is read.

Prerequisites

  • You must have a Starburst Galaxy account set up before you can complete this tutorial. You can find instructions for creating a free account here.
  • You must complete the prerequisite tutorial titled Configure a Starburst Galaxy data lake catalog and schema before attempting this tutorial. It will guide you in setting up the required environment for this tutorial.

2. What is table partitioning?

Background

Table partitioning is an optimization technique that data lakes employ as an alternative to indexing. A partition is simply a way to divide data into subdirectories that are designed around logical groupings. In other words, partitioning divides data into subfolders. The field(s) you partition a table on become subfolders of the table's data lake folder. When a WHERE clause is aligned to the partitioned field, then only the folder(s) that meet the particular criteria are read – the rest are ignored.

For example, suppose you have five years of historical data. Rather than putting all of that data into a single folder, you may choose to create a folder for each year, ensuring that the data for each year is in the appropriate folder. This is considered partitioning by year. Efficiency is gained when a query includes the partition key in the predicate or join, because the query engine can avoid reading partitions that don't match.

3. Create and query an unpartitioned table

Background

Your first exercise will be to create two tables with identical data. The only difference will be that one is partitioned and one is not. This will allow you to compare the amount of data being read when each table is queried. You'll begin with the unpartitioned table.

Step 1: Verify cluster, catalog, and schema

When you use the query editor in Starburst Galaxy, you have the option to set the cluster, catalog, and schema for your session. This lets you avoid typing out the fully-qualified table name every time you write a query. As a reminder, you should have created the catalog and schema you'll be using during the prerequisite tutorial.

  • Log in to your Starburst Galaxy account.
  • Use the left-hand navigation menu to select Query>>Query editor.
  • Use the dropdown menus above the query editor window to select the following:
  • Cluster: aws-us-east-1-free
  • Catalog: tmp_cat
  • Schema: tmp_firstname_lastname_postalcode

Step 2: Create table

Next, it's time to create a new, unpartitioned table following the format and contents of the tpch.sf1.orders table.

  • Use the following SQL to create a new table from the tpch.sf1.orders table:
CREATE TABLE orders_no_pttn
   WITH (
      type = 'hive'
   )
AS
   SELECT *
     FROM tpch.sf1.orders;

Step 3: Query new table

Now that you've created a new table, it's time to query it. After you run the query, you'll observe how much data was read from your unpartitioned table.

  • Run the following query. Your result should return 941 rows:
SELECT * from orders_no_pttn
 WHERE totalprice >= 430000.00;

Step 4: Locate metrics for data read

When you run a query in Starburst Galaxy, you can learn a lot of information about the query by going to the query details page.

  • Click the Query details icon (denoted by an eye) on the bottom right of the screen.
  • Select the Advanced tab below the Query ID label.
  • In the Tables section of the Advanced tab you will see values for the number of Rows and Bytes read.
  • For the query you just ran, all rows (1.5M) were read from the order_no_pttn table, which accounted for 33.7MB of data. That equates to a full table scan of all records from the data lake.

Step 5: Add additional filter to query

Let's enhance the filtering of the previous query by adding a filter on the priority field.

  • Run the following query, which should return 187 rows:
SELECT * from orders_no_pttn
 WHERE totalprice >= 430000.00
   AND orderpriority = '4-NOT SPECIFIED';
  • Use the process shown in Step 4 to verify that another full table scan occurred. As a reminder, a full table scan would be indicated by 1.5 million rows read.

4. Create and query a partitioned table

Background

In the last section, you created and queried an unpartitioned table. Through that exercise, you learned that when a table is unpartitioned, the entire table is scanned during query execution, even when filters are applied.

In this section, you'll create another orders table, but this one will be partitioned on orderpriority. This will allow you to verify that less data is read on partitioned tables when the WHERE clause is aligned to the partitioned field, thus enhancing the performance of the data lake.

Step 1: Create partitioned table

To create a partitioned version of the orders table, you'll need to modify the CTAS statement you used to create the unpartitioned table.

  • Try to run the following SQL to create the table. Notice the addition of the partitioned_by clause:
CREATE TABLE orders_pttn_priority
   WITH (
      partitioned_by = ARRAY['orderpriority'],
      type = 'hive'
   )
AS
   SELECT *
     FROM tpch.sf1.orders;
  • You should have received the following error:

  • Just as the error message states, the partition field needs to be the last column in this (Hive) table.
  • Execute the following statement that reorders the columns so that orderpriority is last:
CREATE TABLE orders_pttn_priority
   WITH (
      partitioned_by = ARRAY['orderpriority'],
      type = 'hive'
   )
AS
   SELECT orderkey, custkey, orderstatus, totalprice,
          orderdate, clerk, shippriority, comment,
          orderpriority
     FROM tpch.sf1.orders;

Step 2: Query partitioned table

It's time to query the partitioned table, using the same query from Step 3 in the previous section. This query does not include the partition column in the WHERE clause, which means that the entire table will still be scanned. After you've queried the table, you'll see how to use a query plan to confirm a full table scan.

  • Run the following query on the new table:
SELECT * from orders_pttn_priority
 WHERE totalprice >= 430000.00;
  • Use the information provided on the Query details page to confirm that a full table scan occurred. Recall that the table has 1.5M rows.

Step 3: Review query plan

You can also review the Query plan to see if a full table scan occurred. You should already be on the Query details page under the Advanced tab. The Query plan is located at the bottom of the page.

  • Click on the Query plan to bring up a larger view.
  • Review the lines that are circled in the image below. Because this is a partitioned column, the query plan will show you the folders that were read. The folders listed represent all of the distinct values for the orderpriority column. This is further evidence that a full table scan occurred.

Step 4: Leverage partition pruning

In this step, you'll take advantage of the partitioning on your table by adding an orderpriority filter.

  • Run the following query:
SELECT * from orders_pttn_priority
 WHERE totalprice >= 430000.00
   AND orderpriority = '4-NOT SPECIFIED';
  • Navigate to the Query details page to see how many rows and bytes were read while executing the query.

In our example, 300K rows and 6.92 MB were read. This represents about 20% of the full table scan's metrics. The sample data from TPCH was relatively evenly distributed across the five order priority values, which provides anecdotal evidence that only one of the partition folders was read. In the next step, you'll confirm that by looking at the query plan.

Step 5: Review query plan

Let's take a look at the query plan to further confirm that only one partition folder was read during the execution of the last query. You should already be in the Advanced tab of the Query details page.

  • Click to open an expanded view of the query plan.
  • As you did before, look for the orderpriority row in the query plan.

Note that only one folder was read. This solidifies the anecdotal evidence with actual results. When the query engine can read a subset of the partition folders, this is referred to as partition pruning.

5. What is table bucketing?

Background

Table bucketing is a data lake technique where tables are divided into fixed-sized, smaller segments called buckets based on the values of a specific column (or columns). Bucketing optimizes data storage and speeds up query performance, especially for queries involving joins or aggregations on the bucketed columns.

How it works

In table bucketing:

  • A column is selected as the bucketing key, often one that is frequently used in join or filter conditions.
  • A hash function is applied to the values in the bucket key column. This hash function assigns each row to one of a fixed number of buckets.
  • Each bucket is stored separately, usually in its own file, within the table directory.

For example, if a table is bucketed by user_id into 4 buckets, each unique user_id is hashed to determine its bucket. Rows with the same hash value (for user_id) end up in the same bucket.

Let's take a look at this in practice.

6. Create and query a bucketed table

Background

In this section, you are going to recreate the orders table, this time bucketing it on the customer identifier column.

Step 1: Create table

  • Run the following SQL to create an orders table, bucketed on custkey:
CREATE TABLE orders_bckt_custkey
   WITH (
      bucketed_by = ARRAY['custkey'],
      bucket_count = 4,
      type = 'hive'
   )
AS
   SELECT *
     FROM tpch.sf1.orders;

You do not have access to review the data in S3, but here is an example of the contents of this table. The data was evenly spread by custkey into four separate files of similar size. The sum of these file sizes adds up to about the size of the full table.

Step 2: Query two tables

In this step, you're going to run the same query on two different tables. One will be the unpartitioned, unbucketed orders table, and the other will be the bucketed table you just created. This exercise will compare the amount of data processed for each query.

  • Begin by running the following query against the orders_no_pttn table:
SELECT *
  FROM orders_no_pttn
 WHERE custkey = 130000;
  • A look at the query details page shows that the full table was scanned:

  • Now run the same query against the orders_bckt_custkey table:
SELECT *
  FROM orders_bckt_custkey
 WHERE custkey = 130000;
  • Review the query details again, and this time you'll see that far less data was read. In fact, only about 25% of the data was read because the specific custkey could only be in one of the four buckets you created.

Remember, partitioning and bucketing can be used independently or in combination.

7. Tutorial wrap-up

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of how partitioning and bucketing can improve data lake performance. The data used in this tutorial was relatively small in size, which means that the performance improvements weren't very dramatic. With larger volumes of data you will see a much more obvious improvement.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!